Posts Tagged ‘identifier mapping problem’

BridgeDb: now also with metabolite identifier mapping

Thursday, March 25th, 2010

Ha, fooled ya! BridgeDb has been able to deal with metabolite identifiers since the beginning. But mapping genes is such a common problem that metabolites aren’t getting any attention. Nearly all the code examples that we have thus far are with genes.

Somebody on the mailinglist asked for an example with metabolites. Well here you go, you’ll see it’s really easy. This example takes the ChEBI identifier for methionine, and looks up the corresponding PubChem identifier.

// Using the BridgeRest webservice as mapping
// service, it does compound mapping fairly well.
// We select the human species, but it doesn't really
// matter which species we pick.
Class.forName ("org.bridgedb.webservice.bridgerest.BridgeRest");
IDMapper mapper = BridgeDb.connect(
    "idmapper-bridgerest:http://webservice.bridgedb.org/Human");

// Start with defining the Chebi identifier for
// Methionine, id 16811
Xref src = new Xref("16811", BioDataSource.CHEBI);

// the method returns a set, but actually there is only one result
for (Xref dest : mapper.mapID(src, BioDataSource.PUBCHEM))
{
    // this should print 6137,
    // the pubchem identifier for Methionine.
    System.out.println ("" + dest.getId());
}

Compile this example with org.bridgedb.jar, org.bridgedb.bio.jar and org.bridgedb.webservice.bridgerest.jar in the classpath, which can be downloaded from http://bridgedb.org/data/releases/

BridgeDb paper published

Tuesday, January 12th, 2010

I’m very happy that our paper on BridgeDb was accepted by BMC Bioinformatics. It’s open access so download it to your hearts content. BridgeDb is all about identifier mapping, which I blogged about before (here, here and here).

BridgeDb lets you find cross-references for identifiers, but BridgeDb is not simply a cross-reference database. BridgeDb provides a standard method to access other cross-reference databases. And because of that level of standardization, you can easily decide to switch to a different source of cross-references.

Deepak Singh uses the term “middleware”, which is a good way to explain it, if that sort of word means anything to you.

But let me try to explain in a different way. BirdgeDb is really a travel adapter. Suppose you’re in Japan and you’ve brought some gear like a laptop, cell phone and a nintendo DS (just in case you get stuck in a blizzard while transferring at CDG). Much to your dismay you discover, after checking into your hotel, that none of your plugs fit in Japanese electrical sockets. So what do you do? Do you go down to Akihabara and spend a grand on a new laptop, phone and portable video game unit? Or do you buy a travel adapter for $1.95?

Just like there are many different power plugs around the world, there are many databases that do identifier mapping. And just like travel adapters let you plug in your laptop anywhere, no matter what country, BridgeDb lets you use your favorite bioinformatics tool, no matter what the source of identifier mappings is (Provided that the tool uses BridgeDb).

Power plugs around the world

It’s important to realize that BridgeDb is simply a conduit of information. It does not calculate cross-references from scratch, nor does it give any guarantees about the validity of those cross-references. You shouldn’t ask if BridgeDb provides better identifier mappings. That is like asking if a travel adapter provides better electricity. You still depend on the power company to give you a stable source of electricity. The travel plug just gives you flexibility to adapt to different circumstances.

The “Why” of the Identifier Mapping Problem

Tuesday, August 11th, 2009

I wrote before about my current work on identifier mapping. Briefly, each of the many different databases for genes and metabolites uses its own system of identifiers. This creates big headaches when you want to compare things from different databases. You’ll have to do some work to correlate them, which is what we call the identifier mapping problem.

Why does this problem exist in the first place? Wouldn’t it be really fantastic if everybody would always use the same identifiers everywhere? I don’t think that’s ever going to happen. There are practical reasons for that, but there are also fundamental problems that can never be solved.

Scientific databases are organized in a way that reflects the mindset of the scientists that created them. I noticed the same argument in an essay by Clay Shirky about the semantic web:

Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can’t get a standard til you have an agreement, and you can’t force an agreement to exist where none actually does.

Lactic Acid

Lactic Acid

Lactate

Lactate

Here is another example from the context of bioinformatics: a chemist might create separate identifiers for lactate and lactic acid. To a chemist, these are two different things, lactate is missing a hydrogen atom and it’s even negatively charged. But when dissolved in water these two rapidly convert into each other, making them practically indistinguishable. So a chemistry oriented database such as ChEBI describes them separately (CHEBI:24996 and CHEBI:28358) whereas a biological database such as HMDB puts both in a single record (HMDB00190) World views have affected the way these databases are set up.

By the way, the article quoted above is also an argument against the whole idea of the Semantic Web of Life Sciences (SWLS), but that’s subject matter for another post.

BatchMapper v0.1

Sunday, July 5th, 2009

I just released the first working version of a new tool called batchmapper. This tool lets you take a list of gene, protein or metabolite identifiers from one database and translate them to a different database.

638px-beta-d-glucose-from-xtal-3d-balls
Why is this useful? I’ll explain for metabolites, although the story is really the same for genes and proteins. Metabolites are the chemical compounds that you find naturally in the human body. Of course a lot of research is being done on metabolites, and the collected wisdom is available in a number of online databases, such as Kegg in Japan, PubChem in the USA, ChEBI in the UK and HMDB in Canada

The glut of online databases has lead to a tower of Babel of metabolite identifiers. Glucose, one of the most important compounds in our body, may be known as HMDB00122 in Canada, C00031 in Japan, 5793 in the USA or 17634 in the UK.

batchmapper is a spin-off from recent work done by JJ and me. It’s a command line tool, so it’s not very user friendly, but it is fast, flexible and completely automatic. The translation tables can be provided in the form of text files, relational databases or webservices, or even a combination thereof. This early release is completely functional. Check out the tutorial, and leave some comments here on this blog.

It would be nice if all the online metabolite databases worked together and merged into a single resource, but I don’t see that happening in the near future. At least batchmapper helps to make the problem a little more manageable.