Posts Tagged ‘identifiers’

More about URI’s for BioPAX

Monday, December 3rd, 2012

In a previous post, I explained that a BioPAX document is really an RDF graph. And with that in mind, you can do interesting things like inferring URI’s using a SPARQL CONSTRUCT query.

What I didn’t explain is that, after adding those new inferences, the result is no longer valid BioPAX. RDF gives you lots of freedom, as well as lots of rope to hang yourself with. BioPAX has some restrictions in place that are necessary for exchange of pathway data.

Let me explain in more detail. Take a look at the BioPAX snippet below. This snippet represents more or less the same information as the first figure from my previous post. It represents Protein186961, with a bp:xref property pointing to id4, which is a UnificationXref with bp:db property FlyBase and bp:id property FBgn0034356.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="id4" />

 <bp:UnificationXref rdf:about="id4">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>

After the SPARQL CONSTRUCT query, the newly inferred URI’s are added back to the graph. The results looks more or less like this:

<bp:ProteinReference rdf:about="Protein186961">
 <bp:xref rdf:resource="id4" />
 <bp:xref rdf:resource=""/>

As you can see, Protein186961 now has two bp:xref properties. This kind of duplication may cause problems for software. Furthermore, the new bp:xref property doesn’t have the correct type (UnificationXref), and it doesn’t have values for bp:db and bp:id, because our CONSTRUCT query didn’t say anything about them. Yet well-behaving pathway software might quite reasonably be looking for that information.

Running inferences on an RDF store gives you lots of power, but it’s not necessarily good for standardization. If you are running a large pathway database, you might want to enforce some restrictions. The online BioPAX validator created by Igor Rodchenkov et al. is the gold standard for producing correct, manageable BioPAX. Running it on the second snippet leads to this error:

But what if you want to have URI’s, but you also want to keep your BioPAX valid? It’s easy – the UnificationXref in the first snippet used id4 as resource identifier. Id4 is just an arbitrary value – we can easily replace that with something better. But instead of running a construct query, it’s a matter of modifying your BioPAX generating code to write out URI’s where possible. The result could look like the snippet below. Admittedly, the result has a bit of redundancy, with the two references to FBgn0034356. But that is a small price to pay. The new version has goodness ready for SPARQL integration magic, yet it’s still standard compliant so that mundane software can cope with it too.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="" />

 <bp:UnificationXref rdf:about="">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>

Inferring URIs for BioPAX

Friday, November 16th, 2012

Here is a useful data-integration trick involving BioPAX and

BioPAX is a pathway exchange format – it is known for being somewhat complicated, but at the very basic level it’s simple: BioPAX is made up of subject-predicate-object triples. Together these triples form a graph. Thus, a BioPAX document is nothing more than a large graph. Here is a small fragment to illustrate:

Here you see a particular BiochemicalReaction, which is catalysed by a particular Protein [1]. Both the BiochemicalReaction and the Protein have a number acting as local identifiers – they are quite useless outside this BioPAX document. To identify this particular protein in the wild, we must look at its Xref, which refers to a database (FlyBase), and an identifier (FBgn0034356). [2]

You have to imagine that this graph is much larger than just the snippet shown above, and contains lots of interesting information. And we can make it even more interesting by fetching information from external databases about this protein, and integrate that into this graph.

The trouble is that the Xref is stored in two nodes: one for the identifier and one for the database. This makes data integration cumbersome, requiring comparison of two nodes at the same time. It would be more efficient to merge this data into a single node.

One possible solution is to simply concatenate the database and identifier and put that into a new node. For example, here is just one way we could do that:


But we can do even better: if we combine the two nodes into a single URI (Uniform Resource Identifier) from, we gain the added advantage of having a resolvable URI. That means that the identifier is also a link which you can open in a browser, which is just incredibly neat.

(Go ahead and open it:

We can create these URIs directly in the triple store using a SPARQL CONSTRUCT query. SPARQL is a query language for graphs – it looks for patterns in the graph, and in the case of CONSTRUCT queries, new triples are generated which can be added back into the graph. The following query generates URIs for Uniprot Xrefs. Unfortunately this query only works on the virtuoso triple store, because of the whole “bif:sprintf…” incantation which is non-standard SPARQL. Presumably equivalent functions exist for other triple stores.

    ?x BP:xref `bif:sprintf_iri (
    "", ?id)`
   ?x BP:xref ?blank .
   ?blank BP:id ?id .
   ?blank BP:db "UniProt"^^xsd:string

If you try that, you will get a set of new triples, which looks like this when viewed in the browser:


Subject Item

Subject Item

If you want you can try for yourself on our live triple store with preloaded BioPAX data. Here is our live sparql endpoint. If you scroll down on that page you see a few more SPARQL queries to try. To learn more, please see my presentation of the SPIN-OSS conference


  • [1] In standard bioPAX, there is a Catalysis object between a Protein and a BiochemicalReaction. The controlledBy relation must be inferred.
  • [2] Ignore for the moment that we’re using a gene identifier for a protein

The “Why” of the Identifier Mapping Problem

Tuesday, August 11th, 2009

I wrote before about my current work on identifier mapping. Briefly, each of the many different databases for genes and metabolites uses its own system of identifiers. This creates big headaches when you want to compare things from different databases. You’ll have to do some work to correlate them, which is what we call the identifier mapping problem.

Why does this problem exist in the first place? Wouldn’t it be really fantastic if everybody would always use the same identifiers everywhere? I don’t think that’s ever going to happen. There are practical reasons for that, but there are also fundamental problems that can never be solved.

Scientific databases are organized in a way that reflects the mindset of the scientists that created them. I noticed the same argument in an essay by Clay Shirky about the semantic web:

Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can’t get a standard til you have an agreement, and you can’t force an agreement to exist where none actually does.

Lactic Acid

Lactic Acid



Here is another example from the context of bioinformatics: a chemist might create separate identifiers for lactate and lactic acid. To a chemist, these are two different things, lactate is missing a hydrogen atom and it’s even negatively charged. But when dissolved in water these two rapidly convert into each other, making them practically indistinguishable. So a chemistry oriented database such as ChEBI describes them separately (CHEBI:24996 and CHEBI:28358) whereas a biological database such as HMDB puts both in a single record (HMDB00190) World views have affected the way these databases are set up.

By the way, the article quoted above is also an argument against the whole idea of the Semantic Web of Life Sciences (SWLS), but that’s subject matter for another post.

BatchMapper v0.1

Sunday, July 5th, 2009

I just released the first working version of a new tool called batchmapper. This tool lets you take a list of gene, protein or metabolite identifiers from one database and translate them to a different database.

Why is this useful? I’ll explain for metabolites, although the story is really the same for genes and proteins. Metabolites are the chemical compounds that you find naturally in the human body. Of course a lot of research is being done on metabolites, and the collected wisdom is available in a number of online databases, such as Kegg in Japan, PubChem in the USA, ChEBI in the UK and HMDB in Canada

The glut of online databases has lead to a tower of Babel of metabolite identifiers. Glucose, one of the most important compounds in our body, may be known as HMDB00122 in Canada, C00031 in Japan, 5793 in the USA or 17634 in the UK.

batchmapper is a spin-off from recent work done by JJ and me. It’s a command line tool, so it’s not very user friendly, but it is fast, flexible and completely automatic. The translation tables can be provided in the form of text files, relational databases or webservices, or even a combination thereof. This early release is completely functional. Check out the tutorial, and leave some comments here on this blog.

It would be nice if all the online metabolite databases worked together and merged into a single resource, but I don’t see that happening in the near future. At least batchmapper helps to make the problem a little more manageable.