Inferring identifiers.org URIs for BioPAX

Here is a useful data-integration trick involving BioPAX and identifiers.org.

BioPAX is a pathway exchange format - it is known for being somewhat complicated, but at the very basic level it's simple: BioPAX is made up of subject-predicate-object triples. Together these triples form a graph. Thus, a BioPAX document is nothing more than a large graph. Here is a small fragment to illustrate:

Here you see a particular BiochemicalReaction, which is catalysed by a particular Protein [1]. Both the BiochemicalReaction and the Protein have a number acting as local identifiers - they are quite useless outside this BioPAX document. To identify this particular protein in the wild, we must look at its Xref, which refers to a database (FlyBase), and an identifier (FBgn0034356). [2]

You have to imagine that this graph is much larger than just the snippet shown above, and contains lots of interesting information. And we can make it even more interesting by fetching information from external databases about this protein, and integrate that into this graph.

The trouble is that the Xref is stored in two nodes: one for the identifier and one for the database. This makes data integration cumbersome, requiring comparison of two nodes at the same time. It would be more efficient to merge this data into a single node.

One possible solution is to simply concatenate the database and identifier and put that into a new node. For example, here is just one way we could do that:

FlyBase~FBgn0034356

But we can do even better: if we combine the two nodes into a single URI (Uniform Resource Identifier) from identifiers.org, we gain the added advantage of having a resolvable URI. That means that the identifier is also a link which you can open in a browser, which is just incredibly neat.

http://identifiers.org/flybase/FBgn0034356

(Go ahead and open it: http://identifiers.org/flybase/FBgn0034356).

We can create these URIs directly in the triple store using a SPARQL CONSTRUCT query. SPARQL is a query language for graphs - it looks for patterns in the graph, and in the case of CONSTRUCT queries, new triples are generated which can be added back into the graph. The following query generates identifiers.org URIs for Uniprot Xrefs. Unfortunately this query only works on the virtuoso triple store, because of the whole "bif:sprintf..." incantation which is non-standard SPARQL. Presumably equivalent functions exist for other triple stores.

CONSTRUCT {
    ?x BP:xref
    `bif:sprintf_iri ("http://identifiers.org/uniprot/%s", ?id)`
}
WHERE {
   ?x BP:xref ?blank .
   ?blank BP:id ?id .
   ?blank BP:db "UniProt"^^xsd:string
}
LIMIT 10

If you try that, you will get a set of new triples, which looks like this when viewed in the browser:

xsdh http://www.w3.org/2001/XMLSchema#
n2 http://biocyc.org/biopax/biopax-level3#
n4 http://identifiers.org/uniprot/
n3 http://www.biopax.org/release/biopax-level3.owl#
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

Subject Item
n2:Protein220382
n3:xref
n4:P23884

Subject Item
n2:Protein193864
n3:xref
n4:Q9W330

If you want you can try for yourself on our live triple store with preloaded BioPAX data. Here is our live sparql endpoint. If you scroll down on that page you see a few more SPARQL queries to try. To learn more, please see my presentation of the SPIN-OSS conference

Footnotes:

[1] In standard bioPAX, there is a Catalysis object between a Protein and a BiochemicalReaction. The controlledBy relation must be inferred.
[2] Ignore for the moment that we're using a gene identifier for a protein