Posts Tagged ‘identifier mapping’

Fixing a Pathway the Groovy Way

Tuesday, August 25th, 2009

It’s no secret that there are many mistakes in WikiPathways. A coworker notified me of a problem with the Focal Adhesion pathway. It’s exactly the sort of problem that requires a lot of repetitive action to be fixed. So I thought, instead of doing this the boring way, I write a program to do that instead, and have some fun with it at the same time. This is the perfect opportunity to get some practice with the WikiPathways webservice.

To make it a little bit more interesting, I decided to do it with Groovy, a scripting language I’m learning. It’s very similar to Java but also has a lot of cool whizzbang features such as dynamic typing, closures and multi-line strings.

For reference, you can find the full script below. But I’ll explain bit by bit. First step is to get access to the WikiPathways webservice:

def wpclient;
wpclient = new WikiPathwaysClient();

After this bit of setup, wpclient can be used to interact with WikiPathways.
In this case we want to send the updated pathway back, so we’ll need to log in. Authentication is not required if you only need read access. However, if you want to write data, you have to have a valid account with webservice permissions, which is available upon request.

wpclient.login("username", "********");

I only needed to get one pathway, and I know which one. It has identifier WP306. The code to download it is:

def wspwy = wpclient.getPathway("WP306");
def pwy = wpclient.toPathway(wspwy);

wspwy is a WSPathway object, which is a wrapper for the pathway with some meta-data from WikiPathways (species and revision number). The actual Pathway itself can be obtained with the toPathway() method.

Now is the time to manipulate the pathway, and fix it up any way imaginable. I’ll describe that below, but first I’ll show how I uploaded the pathway again using the webservice:

println "Uploading... ";
wpclient.updatePathway (wspwy.getId(), pwy,
  "autoconversion of Entrez symbols to IDs",
  wspwy.getRevision().toInteger());

So now onto the actual pathway manipulation. I didn’t explain yet what kind of problem needed to be fixed. The problem here was that none of the genes had a proper identifier. As a result, none of the links worked, and the pathway could not be linked to experimental data.

However, there is some good news because each gene is labeled with the gene name. Gene names are nice as labels because they are readable and a bit more meaningful than identifiers, but on the other hand gene names tend to be ambiguous, so we really need the identifiers as well. But gene names are a good start. Using BridgeDb we can perform a free text search for the gene name, and come up with a matching identifier.

First we need a bit of setup code for BridgeDb. In this case we’ll make use of our Human gene database, which can be downloaded from http://www.bridgedb.org/data/gene_databases

Class.forName ("org.bridgedb.rdb.IDMapperRdb");
mapper = BridgeDb.connect (
  "idmapper-pgdb:/path/to/Hs_Derby_20090720.bridge");
BioDataSource.init();

The main logic bits to translate gene names to gene identifiers is contained in the function labelToXref. It takes the label containing the gene name as argument, and returns an identifier or null if nothing was found.

def labelToXref (label) {
    // do a free search for all Xrefs that match our label
    for (ref in mapper.freeAttributeSearch(label, "Symbol", 100)) {
        // check only Xrefs that are in Entrez,
        // and that are an exact match with the label
        // free search will also return partial matches.
        if (ref.getDataSource() == BioDataSource.ENTREZ_GENE &&
            mapper.getAttributes (ref, "Symbol").contains(label)) {
            return ref;
        }
    }
    return null;
}

Now all we have to do is loop over each element in the pathway, get its label and use labelToXref to get the correct identifier.

And that’s it. It took me about an hour to write this. It would have taken less if I didn’t have to look up some bits about the Groovy syntax. As you can see from the pathway history, almost all genes were fixed.

Here is the entire script for reference purposes:

import org.pathvisio.model.Pathway
import org.bridgedb.bio.BioDataSource
import org.bridgedb.BridgeDb
import org.pathvisio.model.ObjectType
import org.pathvisio.view.Graphics
import org.pathvisio.wikipathways.WikiPathwaysClient

// Pathway WP306 (Focal Adhesion) uses Entrez symbols instead of Entrez ID's
// Thanks to Claus Mayer for reporting this problem.
public class EntrezSymbolToNumber {
    def mapper;
    def wpclient;
    // Look up entrez gene id for a given label
    // e.g. for INSR it will return L:3643
    def labelToXref (label) {
        // do a free search for all Xrefs that match our label
        for (ref in mapper.freeAttributeSearch(label, "Symbol", 100)) {
            // check only Xrefs that are in Entrez,
            // and that are an exact match with the label
            // free search will also return partial matches.
            if (ref.getDataSource() == BioDataSource.ENTREZ_GENE &&
                mapper.getAttributes (ref, "Symbol").contains(label)) {
                return ref;
            }
        }
        return null;
    }
   
    void init() {
        wpclient = new WikiPathwaysClient();
        wpclient.login("username", "********");
        Class.forName ("org.bridgedb.rdb.IDMapperRdb");
        mapper = BridgeDb.connect ("idmapper-pgdb:/path/to/Hs_Derby_20090720.bridge");
        BioDataSource.init();
    }
   
    void run() {
        def success = 0;
        def total = 0;
        // fetch pathway through the webservice
        def wspwy = wpclient.getPathway("WP306");
        def pwy = wpclient.toPathway(wspwy);

        // loop over all data nodes
        for (dn in pwy.getDataObjects()) {
            if (dn.getObjectType() == ObjectType.DATANODE) {
                total++;
                def label = dn.getTextLabel();
                def ref = dn.getXref();
                print (label + " " + ref + " -> ");
                if (ref.getDataSource() == BioDataSource.ENTREZ_GENE
                        && !mapper.xrefExists(ref)) {      
                    def result = labelToXref (label);
                    if (result != null) {
                        println "mapping to " + result;
                        dn.setGeneID (result.getId());
                        success++;
                    }
                    else
                        println "could not map";
                }
                else
                    println "OK";
            }
        }
        println success + " out of " + total + " converted";       
        println "Uploading... ";
        wpclient.updatePathway (wspwy.getId(), pwy, "autoconversion of Entrez symbols to IDs", wspwy.getRevision().toInteger());
    }
   
    static void main (args) {
        def runner = new EntrezSymbolToNumber();
        runner.init();
        runner.run();
    }  
}

By the way, I’m using the codecolorer plugin for the pretty code formatting. (thanks to rguha for the tip, and to Dmytro for being so responsive to my bug report)