Posts Tagged ‘semantic web’

General SPARQL app for Cytoscape

Saturday, March 21st, 2015

We can now easily solve the problem of bioinformatics data integration. But how do we put that data in the hands of scientists?

At General Bioinformatics we put data in triple stores, and use SPARQL to query that data. Triple stores are great for data integration, but you still have to figure out how to put that data in the hands of scientists. Integrating data is only half of the problem, we also have to present that data. The problem isn’t that SPARQL is hard to use per se (it’s really rather plain and sensible). The problem is that SPARQL is supposed to be only a piece of plumbing at the bottom of a software stack. We shouldn’t expect scientists to write SPARQL queries anymore than we expect them to carry adjustable pliers to a restroom visit.

The General SPARQL app is one of the new ways to present triple data.

How do you use it?

The app lets you build a network step by step. Nodes and edges can be added to a network in a piecemeal fashion. Nodes can represent various biological entities, such as: a pathway, a protein, a reaction, or a compound. Edges can represent any type of relation between those entities.

For example, you can start by searching for a protein of interest. The app places a single node in your network. You can then right-click on this node to pull in related entities. For example, all the pathways that are related to your protein. Or all the Gene Ontology annotations. Or all the reactions that your protein is part of. Or the gene that encodes for your protein. And you can continue this process, jumping from one entity to the next.

Watch this screencast and it will start to make sense:

How does it work?

In the background, the General SPARQL app maintains a list of SPARQL queries. Each item in the search menu, and each item in the context (right-click) menu, is backed by one SPARQL query. When you click on them, a query is sent off in the background, and the result is mapped to your network according to certain rules.

When you first install the app, it comes pre-configured with a basic set of SPARQL queries, although it’s possible to provide your own set. The initial set is designed to work with public bioinformatics SPARQL endpoints provided by the EBI and Bio2RDF. But as great as these resources are, public triple stores can sometimes be overloaded. The app works with privately managed triple stores just as well.

Where can I find it?

The easiest way to get the app is simply from the Cytoscape App manager. Just install Cytoscape 3.0, start it, and go to menu->Apps->App Manager and search for “General SPARQL”. Or download it on from the app store website. What’s even better is that the source code is available on github.

Also, if you have a chance, come see my poster at Vizbi 2015 in Boston.

Inferring identifiers.org URIs for BioPAX

Friday, November 16th, 2012

Here is a useful data-integration trick involving BioPAX and identifiers.org.

BioPAX is a pathway exchange format – it is known for being somewhat complicated, but at the very basic level it’s simple: BioPAX is made up of subject-predicate-object triples. Together these triples form a graph. Thus, a BioPAX document is nothing more than a large graph. Here is a small fragment to illustrate:

Here you see a particular BiochemicalReaction, which is catalysed by a particular Protein [1]. Both the BiochemicalReaction and the Protein have a number acting as local identifiers – they are quite useless outside this BioPAX document. To identify this particular protein in the wild, we must look at its Xref, which refers to a database (FlyBase), and an identifier (FBgn0034356). [2]

You have to imagine that this graph is much larger than just the snippet shown above, and contains lots of interesting information. And we can make it even more interesting by fetching information from external databases about this protein, and integrate that into this graph.

The trouble is that the Xref is stored in two nodes: one for the identifier and one for the database. This makes data integration cumbersome, requiring comparison of two nodes at the same time. It would be more efficient to merge this data into a single node.

One possible solution is to simply concatenate the database and identifier and put that into a new node. For example, here is just one way we could do that:

FlyBase~FBgn0034356

But we can do even better: if we combine the two nodes into a single URI (Uniform Resource Identifier) from identifiers.org, we gain the added advantage of having a resolvable URI. That means that the identifier is also a link which you can open in a browser, which is just incredibly neat.

http://identifiers.org/flybase/FBgn0034356

(Go ahead and open it: http://identifiers.org/flybase/FBgn0034356).

We can create these URIs directly in the triple store using a SPARQL CONSTRUCT query. SPARQL is a query language for graphs – it looks for patterns in the graph, and in the case of CONSTRUCT queries, new triples are generated which can be added back into the graph. The following query generates identifiers.org URIs for Uniprot Xrefs. Unfortunately this query only works on the virtuoso triple store, because of the whole “bif:sprintf…” incantation which is non-standard SPARQL. Presumably equivalent functions exist for other triple stores.

CONSTRUCT {
    ?x BP:xref `bif:sprintf_iri (
    "http://identifiers.org/uniprot/%s", ?id)`
}
WHERE {
   ?x BP:xref ?blank .
   ?blank BP:id ?id .
   ?blank BP:db "UniProt"^^xsd:string
}
LIMIT 10

If you try that, you will get a set of new triples, which looks like this when viewed in the browser:

xsdh http://www.w3.org/2001/XMLSchema#
n2 http://biocyc.org/biopax/biopax-level3#
n4 http://identifiers.org/uniprot/
n3 http://www.biopax.org/release/biopax-level3.owl#
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

Subject Item
n2:Protein220382
n3:xref
n4:P23884

Subject Item
n2:Protein193864
n3:xref
n4:Q9W330

If you want you can try for yourself on our live triple store with preloaded BioPAX data. Here is our live sparql endpoint. If you scroll down on that page you see a few more SPARQL queries to try. To learn more, please see my presentation of the SPIN-OSS conference


Footnotes:

  • [1] In standard bioPAX, there is a Catalysis object between a Protein and a BiochemicalReaction. The controlledBy relation must be inferred.
  • [2] Ignore for the moment that we’re using a gene identifier for a protein

Notes from Vizbi: automation in Cytoscape

Monday, March 5th, 2012

Cytoscape is a popular network visualisation and analysis tool. It’s great because it’s so easy to create plug-ins. Today I was fortunate enough to be attending the Cytoscape developer workshop at Vizbi 2012, where I learned a few new things.

Firstly, one of my goals was to find out about the current state of Cytoscape development. Cytoscape is a great tool as long as you don’t look too closely at what’s going on inside. The upcoming third version promises to fix all the minor and major problems that exist under the hood. But Cytoscape 3 has been in the making for a long time. As a plug-in developer, you have to choose between something that works right now, but will go away eventually, or something that is clearly the future, but might take a long time to materialise.

The feeling I got from the workshop is that there is light at the end of the Cytoscape 3 tunnel. For a plug-in developer with a deadline, it’s probably best to stick with the current version for now. But if you’re not under pressure to release, it’s definitely possible to write for Cytoscape 3 and make use of a nicer and more pleasant working environment.

Besides that news, I learned some cool new tricks. Using Cytoscape Commands you can write simple macros for repetitive tasks. For example, to generate the network below, first you have to import a SIF (Simple Interaction Format) file, then import a file with node attributes, then apply a layout, and then apply a visual style. If you have to do this a couple of times it gets quite tedious. But here is how all that can be automated:

Take the following SIF data, and save it using a text editor as network.sif

Martijn is_involved_with    LibSBGN
Chaouiya    is_involved_with    SBML-qual
Martijn is_involved_with    SBML-qual
Martijn is_involved_with    BioPreDyn
Emanuel is_involved_with    LibSBGN
Emanuel is_funded_by    Erasmus
Martijn is_funded_by    FP7

Here are the Node attributes, saved it as node_types.txt

type
LibSBGN=Project
BioPreDyn=Project
Chaouiya=Collaborator
SBML-qual=Project
Martijn=Member
Emanuel=Member
FP7=Funding
Erasmus=Funding

For the visual style, I created one in Cytoscape and saved it as style.props, using Export->Vizmap property file. And here is the magic bit: If you save the above three files in your work directory, then you can generate that picture with the script below.

network import file=network.sif
layout force-directed
node import attributes file=node_types.txt
vizmap import file=style.props

Run it from within Cytoscape with Plugins->Command Tool->Run script…, or from the command line with

./cytoscape.sh -S scriptfile

The “Why” of the Identifier Mapping Problem

Tuesday, August 11th, 2009

I wrote before about my current work on identifier mapping. Briefly, each of the many different databases for genes and metabolites uses its own system of identifiers. This creates big headaches when you want to compare things from different databases. You’ll have to do some work to correlate them, which is what we call the identifier mapping problem.

Why does this problem exist in the first place? Wouldn’t it be really fantastic if everybody would always use the same identifiers everywhere? I don’t think that’s ever going to happen. There are practical reasons for that, but there are also fundamental problems that can never be solved.

Scientific databases are organized in a way that reflects the mindset of the scientists that created them. I noticed the same argument in an essay by Clay Shirky about the semantic web:

Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can’t get a standard til you have an agreement, and you can’t force an agreement to exist where none actually does.

Lactic Acid

Lactic Acid

Lactate

Lactate

Here is another example from the context of bioinformatics: a chemist might create separate identifiers for lactate and lactic acid. To a chemist, these are two different things, lactate is missing a hydrogen atom and it’s even negatively charged. But when dissolved in water these two rapidly convert into each other, making them practically indistinguishable. So a chemistry oriented database such as ChEBI describes them separately (CHEBI:24996 and CHEBI:28358) whereas a biological database such as HMDB puts both in a single record (HMDB00190) World views have affected the way these databases are set up.

By the way, the article quoted above is also an argument against the whole idea of the Semantic Web of Life Sciences (SWLS), but that’s subject matter for another post.