Posts Tagged ‘bridgedb’

BridgeDb: now also with metabolite identifier mapping

Thursday, March 25th, 2010

Ha, fooled ya! BridgeDb has been able to deal with metabolite identifiers since the beginning. But mapping genes is such a common problem that metabolites aren’t getting any attention. Nearly all the code examples that we have thus far are with genes.

Somebody on the mailinglist asked for an example with metabolites. Well here you go, you’ll see it’s really easy. This example takes the ChEBI identifier for methionine, and looks up the corresponding PubChem identifier.

// Using the BridgeRest webservice as mapping
// service, it does compound mapping fairly well.
// We select the human species, but it doesn't really
// matter which species we pick.
Class.forName ("org.bridgedb.webservice.bridgerest.BridgeRest");
IDMapper mapper = BridgeDb.connect(
    "idmapper-bridgerest:http://webservice.bridgedb.org/Human");

// Start with defining the Chebi identifier for
// Methionine, id 16811
Xref src = new Xref("16811", BioDataSource.CHEBI);

// the method returns a set, but actually there is only one result
for (Xref dest : mapper.mapID(src, BioDataSource.PUBCHEM))
{
    // this should print 6137,
    // the pubchem identifier for Methionine.
    System.out.println ("" + dest.getId());
}

Compile this example with org.bridgedb.jar, org.bridgedb.bio.jar and org.bridgedb.webservice.bridgerest.jar in the classpath, which can be downloaded from http://bridgedb.org/data/releases/

How to develop Modular Software

Saturday, March 6th, 2010

It’s always good to make software modular. Modular software is strong and healthy, monolithic software is sickly and bedridden. I’ve touched before on how modularity increases adaptability. But modularity also helps to keep software small, nimble and unbloated. I’ll illustrate how we’re applying modular design in BridgeDb.

Modularity is the only known antidote against bloatware. The more features a piece of software has, the larger it has to be. When you don’t use 90% of those features, it’s perceived as a problem. Bloated software takes a long time to start, fills up your hard drive, clogs your tubes. We want bioinformatics developers to use BridgeDb as much as possible, and we don’t want them to complain that BridgeDb is bloated.

For example, BridgeDb supports identifier mapping through several different web services. Some of those webservices are based on SOAP, others on XML-RPC or on REST. For each type of webservice, you need additional libraries. If it was only one monolithic chunk, you’d always need several megabytes of library dependencies.

You may say: “A few megabyte, so what?”. When I was at mediamarkt the other day, I couldn’t even find memory sticks smaller than 2 Gb anymore. But size still matters when you expect fast download times. For example WikiPathways uses BridgeDb on each pathway page. Bigger libraries means longer load times, which means annoyed users.

We want many features, but we don’t want bloat. The solution is to cut BridgeDb up into many small pieces, where you can choose the ones you need, and ignore the rest. You also don’t need the dependencies of the parts you ignore.

So how do you decide which pieces of BridgeDb you need? I’ve compiled this handy graph. On the right side, you see all the different “features” (i.e. identifier mapping services) that you can choose. Follow the arrows to the left, and note the modules that you encounter. Those are the modules you need for that mapping service.

If you’re getting started with modular software development, I can give you a few tips. You really don’t need any of those terribly complicated frameworks like Maven or OSGi. All you need is a good IDE like Eclipse and a bit of determination.

You have to be careful to manage the boundaries between modules. Eclipse can help you a great deal with this. Put each module in its own directory. In your Eclipse workspace, set up a separate project for each module, and add dependent projects in each project build path. This way you can never introduce cyclic dependencies or go across module boundaries. Eclipse will simply refuse to find the class and flag it as a compiler error.

For example, here is how I’ve set up BridgeDb in eclipse. In the package explorer you see that I’ve defined a separate project for each module in BridgeDb.

And to complete the example, here is how I configured the build path for the org.bridgedb.bio module. As you can see, the org.bridgedb project is listed as its sole dependency.

BridgeDb paper published

Tuesday, January 12th, 2010

I’m very happy that our paper on BridgeDb was accepted by BMC Bioinformatics. It’s open access so download it to your hearts content. BridgeDb is all about identifier mapping, which I blogged about before (here, here and here).

BridgeDb lets you find cross-references for identifiers, but BridgeDb is not simply a cross-reference database. BridgeDb provides a standard method to access other cross-reference databases. And because of that level of standardization, you can easily decide to switch to a different source of cross-references.

Deepak Singh uses the term “middleware”, which is a good way to explain it, if that sort of word means anything to you.

But let me try to explain in a different way. BirdgeDb is really a travel adapter. Suppose you’re in Japan and you’ve brought some gear like a laptop, cell phone and a nintendo DS (just in case you get stuck in a blizzard while transferring at CDG). Much to your dismay you discover, after checking into your hotel, that none of your plugs fit in Japanese electrical sockets. So what do you do? Do you go down to Akihabara and spend a grand on a new laptop, phone and portable video game unit? Or do you buy a travel adapter for $1.95?

Just like there are many different power plugs around the world, there are many databases that do identifier mapping. And just like travel adapters let you plug in your laptop anywhere, no matter what country, BridgeDb lets you use your favorite bioinformatics tool, no matter what the source of identifier mappings is (Provided that the tool uses BridgeDb).

Power plugs around the world

It’s important to realize that BridgeDb is simply a conduit of information. It does not calculate cross-references from scratch, nor does it give any guarantees about the validity of those cross-references. You shouldn’t ask if BridgeDb provides better identifier mappings. That is like asking if a travel adapter provides better electricity. You still depend on the power company to give you a stable source of electricity. The travel plug just gives you flexibility to adapt to different circumstances.

Fixing a Pathway the Groovy Way

Tuesday, August 25th, 2009

It’s no secret that there are many mistakes in WikiPathways. A coworker notified me of a problem with the Focal Adhesion pathway. It’s exactly the sort of problem that requires a lot of repetitive action to be fixed. So I thought, instead of doing this the boring way, I write a program to do that instead, and have some fun with it at the same time. This is the perfect opportunity to get some practice with the WikiPathways webservice.

To make it a little bit more interesting, I decided to do it with Groovy, a scripting language I’m learning. It’s very similar to Java but also has a lot of cool whizzbang features such as dynamic typing, closures and multi-line strings.

For reference, you can find the full script below. But I’ll explain bit by bit. First step is to get access to the WikiPathways webservice:

def wpclient;
wpclient = new WikiPathwaysClient();

After this bit of setup, wpclient can be used to interact with WikiPathways.
In this case we want to send the updated pathway back, so we’ll need to log in. Authentication is not required if you only need read access. However, if you want to write data, you have to have a valid account with webservice permissions, which is available upon request.

wpclient.login("username", "********");

I only needed to get one pathway, and I know which one. It has identifier WP306. The code to download it is:

def wspwy = wpclient.getPathway("WP306");
def pwy = wpclient.toPathway(wspwy);

wspwy is a WSPathway object, which is a wrapper for the pathway with some meta-data from WikiPathways (species and revision number). The actual Pathway itself can be obtained with the toPathway() method.

Now is the time to manipulate the pathway, and fix it up any way imaginable. I’ll describe that below, but first I’ll show how I uploaded the pathway again using the webservice:

println "Uploading... ";
wpclient.updatePathway (wspwy.getId(), pwy,
  "autoconversion of Entrez symbols to IDs",
  wspwy.getRevision().toInteger());

So now onto the actual pathway manipulation. I didn’t explain yet what kind of problem needed to be fixed. The problem here was that none of the genes had a proper identifier. As a result, none of the links worked, and the pathway could not be linked to experimental data.

However, there is some good news because each gene is labeled with the gene name. Gene names are nice as labels because they are readable and a bit more meaningful than identifiers, but on the other hand gene names tend to be ambiguous, so we really need the identifiers as well. But gene names are a good start. Using BridgeDb we can perform a free text search for the gene name, and come up with a matching identifier.

First we need a bit of setup code for BridgeDb. In this case we’ll make use of our Human gene database, which can be downloaded from http://www.bridgedb.org/data/gene_databases

Class.forName ("org.bridgedb.rdb.IDMapperRdb");
mapper = BridgeDb.connect (
  "idmapper-pgdb:/path/to/Hs_Derby_20090720.bridge");
BioDataSource.init();

The main logic bits to translate gene names to gene identifiers is contained in the function labelToXref. It takes the label containing the gene name as argument, and returns an identifier or null if nothing was found.

def labelToXref (label) {
    // do a free search for all Xrefs that match our label
    for (ref in mapper.freeAttributeSearch(label, "Symbol", 100)) {
        // check only Xrefs that are in Entrez,
        // and that are an exact match with the label
        // free search will also return partial matches.
        if (ref.getDataSource() == BioDataSource.ENTREZ_GENE &&
            mapper.getAttributes (ref, "Symbol").contains(label)) {
            return ref;
        }
    }
    return null;
}

Now all we have to do is loop over each element in the pathway, get its label and use labelToXref to get the correct identifier.

And that’s it. It took me about an hour to write this. It would have taken less if I didn’t have to look up some bits about the Groovy syntax. As you can see from the pathway history, almost all genes were fixed.

Here is the entire script for reference purposes:

import org.pathvisio.model.Pathway
import org.bridgedb.bio.BioDataSource
import org.bridgedb.BridgeDb
import org.pathvisio.model.ObjectType
import org.pathvisio.view.Graphics
import org.pathvisio.wikipathways.WikiPathwaysClient

// Pathway WP306 (Focal Adhesion) uses Entrez symbols instead of Entrez ID's
// Thanks to Claus Mayer for reporting this problem.
public class EntrezSymbolToNumber {
    def mapper;
    def wpclient;
    // Look up entrez gene id for a given label
    // e.g. for INSR it will return L:3643
    def labelToXref (label) {
        // do a free search for all Xrefs that match our label
        for (ref in mapper.freeAttributeSearch(label, "Symbol", 100)) {
            // check only Xrefs that are in Entrez,
            // and that are an exact match with the label
            // free search will also return partial matches.
            if (ref.getDataSource() == BioDataSource.ENTREZ_GENE &&
                mapper.getAttributes (ref, "Symbol").contains(label)) {
                return ref;
            }
        }
        return null;
    }
   
    void init() {
        wpclient = new WikiPathwaysClient();
        wpclient.login("username", "********");
        Class.forName ("org.bridgedb.rdb.IDMapperRdb");
        mapper = BridgeDb.connect ("idmapper-pgdb:/path/to/Hs_Derby_20090720.bridge");
        BioDataSource.init();
    }
   
    void run() {
        def success = 0;
        def total = 0;
        // fetch pathway through the webservice
        def wspwy = wpclient.getPathway("WP306");
        def pwy = wpclient.toPathway(wspwy);

        // loop over all data nodes
        for (dn in pwy.getDataObjects()) {
            if (dn.getObjectType() == ObjectType.DATANODE) {
                total++;
                def label = dn.getTextLabel();
                def ref = dn.getXref();
                print (label + " " + ref + " -> ");
                if (ref.getDataSource() == BioDataSource.ENTREZ_GENE
                        && !mapper.xrefExists(ref)) {      
                    def result = labelToXref (label);
                    if (result != null) {
                        println "mapping to " + result;
                        dn.setGeneID (result.getId());
                        success++;
                    }
                    else
                        println "could not map";
                }
                else
                    println "OK";
            }
        }
        println success + " out of " + total + " converted";       
        println "Uploading... ";
        wpclient.updatePathway (wspwy.getId(), pwy, "autoconversion of Entrez symbols to IDs", wspwy.getRevision().toInteger());
    }
   
    static void main (args) {
        def runner = new EntrezSymbolToNumber();
        runner.init();
        runner.run();
    }  
}

By the way, I’m using the codecolorer plugin for the pretty code formatting. (thanks to rguha for the tip, and to Dmytro for being so responsive to my bug report)

The “Why” of the Identifier Mapping Problem

Tuesday, August 11th, 2009

I wrote before about my current work on identifier mapping. Briefly, each of the many different databases for genes and metabolites uses its own system of identifiers. This creates big headaches when you want to compare things from different databases. You’ll have to do some work to correlate them, which is what we call the identifier mapping problem.

Why does this problem exist in the first place? Wouldn’t it be really fantastic if everybody would always use the same identifiers everywhere? I don’t think that’s ever going to happen. There are practical reasons for that, but there are also fundamental problems that can never be solved.

Scientific databases are organized in a way that reflects the mindset of the scientists that created them. I noticed the same argument in an essay by Clay Shirky about the semantic web:

Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can’t get a standard til you have an agreement, and you can’t force an agreement to exist where none actually does.

Lactic Acid

Lactic Acid

Lactate

Lactate

Here is another example from the context of bioinformatics: a chemist might create separate identifiers for lactate and lactic acid. To a chemist, these are two different things, lactate is missing a hydrogen atom and it’s even negatively charged. But when dissolved in water these two rapidly convert into each other, making them practically indistinguishable. So a chemistry oriented database such as ChEBI describes them separately (CHEBI:24996 and CHEBI:28358) whereas a biological database such as HMDB puts both in a single record (HMDB00190) World views have affected the way these databases are set up.

By the way, the article quoted above is also an argument against the whole idea of the Semantic Web of Life Sciences (SWLS), but that’s subject matter for another post.