More about identifiers.org URI's for BioPAX

In a previous post, I explained that a BioPAX document is really an RDF graph. And with that in mind, you can do interesting things like inferring identifiers.org URI's using a SPARQL CONSTRUCT query.

What I didn't explain is that, after adding those new inferences, the result is no longer valid BioPAX. RDF gives you lots of freedom, as well as lots of rope to hang yourself with. BioPAX has some restrictions in place that are necessary for exchange of pathway data.

Let me explain in more detail. Take a look at the BioPAX snippet below. This snippet represents more or less the same information as the first figure from my previous post. It represents Protein186961, with a bp:xref property pointing to id4, which is a UnificationXref with bp:db property FlyBase and bp:id property FBgn0034356.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="id4" />
 </bp:ProteinReference>

 <bp:UnificationXref rdf:about="id4">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>
 </bp:UnificationXref>

After the SPARQL CONSTRUCT query, the newly inferred URI's are added back to the graph. The results looks more or less like this:

<bp:ProteinReference rdf:about="Protein186961">
 <bp:xref rdf:resource="id4" />
 <bp:xref rdf:resource="http://identifiers.org/flybase/FBgn0034356"/>
</bp:ProteinReference>

As you can see, Protein186961 now has two bp:xref properties. This kind of duplication may cause problems for software. Furthermore, the new bp:xref property doesn't have the correct type (UnificationXref), and it doesn't have values for bp:db and bp:id, because our CONSTRUCT query didn't say anything about them. Yet well-behaving pathway software might quite reasonably be looking for that information.

Running inferences on an RDF store gives you lots of power, but it's not necessarily good for standardization. If you are running a large pathway database, you might want to enforce some restrictions. The online BioPAX validator created by Igor Rodchenkov et al. is the gold standard for producing correct, manageable BioPAX. Running it on the second snippet leads to this error:

But what if you want to have Identifiers.org URI's, but you also want to keep your BioPAX valid? It's easy - the UnificationXref in the first snippet used id4 as resource identifier. Id4 is just an arbitrary value - we can easily replace that with something better. But instead of running a construct query, it's a matter of modifying your BioPAX generating code to write out identifiers.org URI's where possible. The result could look like the snippet below. Admittedly, the result has a bit of redundancy, with the two references to FBgn0034356. But that is a small price to pay. The new version has identifiers.org goodness ready for SPARQL integration magic, yet it's still standard compliant so that mundane software can cope with it too.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="http://identifiers.org/flybase/FBgn0034356" />
 </bp:ProteinReference>

 <bp:UnificationXref rdf:about="http://identifiers.org/flybase/FBgn0034356">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>
 </bp:UnificationXref>