Archive for the ‘Uncategorized’ Category

Installing the R kernel for Jupyter on linux

Monday, March 7th, 2016

To paraphrase a saying:

Give a scientist a script and they will analyse data for a day. Teach a scientist to script and they won’t have any more time to do analysis for the rest of their lifetime.

Scripts are a great way to make reproducible workflows, but they are too technical for many situations where you have to report to scientists. Jupyter notebooks are a great way to do an analysis, and report the results at the same time. A Jupyter notebook can contain the analysis, the results, and the documentation that explains the results together in a single file, making it at once understandable and reproducible.

Jupyter started its life as IPython, or “interactive python”. But since they added support for other languages besides python, they had to rename. In principle you can install support for a wide range of scripting languages, but in practice it may be a little difficult to set up. Jupyter consists of multiple ‘kernels’, to get support for a different language you have to install that language, and then install the Jupyter kernel for it. It took me a while to get that working for the R scripting language. What follows are some notes I took during that process, in the hope that they are useful for anybody else trying to do the same thing.

So here is the ‘howto’

First you need to have R and Jupyter installed, but I’m assuming you already got that far. Anyway, that is the easy bit.

The R kernel for Jupyter is available here, and the installation instructions are on the same page. I’ve copied them here for convenience:

repos = c('', getOption('repos')))

When R is installing one of the dependencies, rzmq, on linux, you immediately run into Problem #1. It will complain:

missing header zmq.h

This happens quite often when installing R packages. It happens when the R package is a wrapper for a C library, and needs the development version of that C library to compile the wrapper. In general, the pattern to solve such a problem is simple. When you encounter:

Missing header xxx.h

the solution is to install the development version of the library with

apt-get install libxxx-dev

But then you get to Problem #2

sudo apt-get install libzmq-dev
... try installing the rzmq package again
interface.cpp:123:49: error: call to zmq::context_t::context_t(int, int) uses the default argument for parameter 2, which is not yet defined
zmq::context_t* context = new zmq::context_t(1);

Annoying! It turns out that there are multiple versions of zmq, and you need to install the right one:

sudo apt-get install libzmq3-dev

We’re not there yet. When trying to install rzmq again, we run into Problem #3:

g++ -I/usr/share/R/include -DNDEBUG -I../inst/cppzmq -fpic -O3 -pipe -g -c interface.cpp -o interface.o
interface.cpp:23:14: error: expected constructor, destructor, or type conversion before '(' token
static_assert(ZMQ_VERSION_MAJOR >= 3,"The minimum required version of libzmq is 3.0.0.");

Apparently there is a check for the version of libzmq, but it isn’t working. The installation doesn’t fail because we have the wrong version of libzmq. The installation fails because the R package doesn’t properly detect the version of libzmq. And the problem isn’t that libzmq is somehow misreporting its own version. The problem looks more like a syntax error, which is a weird internal error that shouldn’t occur in a published library. The syntax error is caused by the fact that rzmq uses C++11, (the 2011 version of C++), which is not the default version. We’ll have to fix rzmq. First get the source code:

git clone
cd rzmq

We have to edit one line of src/Makevars, as you can see from the following ‘patch’:

diff --git a/src/Makevars b/src/Makevars
index 7d6771c..c6fdce6 100644
--- a/src/Makevars
+++ b/src/Makevars
@@ -1,5 +1,5 @@
## -*- mode: makefile; -*-

-PKG_CPPFLAGS = -I../inst
+PKG_CPPFLAGS = -I../inst -std=c++11
PKG_LIBS = -lzmq

Now let’s install our own hacked package:

R CMD build .
R CMD INSTALL rzmq_0.8.0.tar.gz

(btw, why is build lowercase and INSTALL uppercase? this is exactly the type of thing why R is not my favourite scripting language)

We’re still not there.

Problem #4 – we installed the very latest rzmq fresh from source code, which now requires R version 3.1.0. I’m using a Long-Term-Support (LTS) version of Linux Mint and I don’t want to switch R versions as that could create a hassle elsewhere. Does rzmq really require 3.1.0, or does it merely say it does because it’s pretending to be cutting edge? Let’s hack it up some more.

git diff
index ba50840..228e3e0 100644
@@ -5,7 +5,7 @@ Maintainer: Whit Armstrong <>
Author: Whit Armstrong <>
Description: Interface to the ZeroMQ lightweight messaging kernel (see <> for more information).
License: GPL-3
-Depends: R (>= 3.1.0)
+Depends: R (>= 3.0.0)
SystemRequirements: ZeroMQ >= 3.0.0 libraries and headers (see <>; Debian packages libzmq3, libzmq3-dev, Fedora packages zeromq3, zeromq3-devel)

And now the R kernel installs without further problems for me. I haven’t noticed any incompatibilities with earlier R versions yet, R 3.0.0 seems to be just fine.

General SPARQL app for Cytoscape

Saturday, March 21st, 2015

We can now easily solve the problem of bioinformatics data integration. But how do we put that data in the hands of scientists?

At General Bioinformatics we put data in triple stores, and use SPARQL to query that data. Triple stores are great for data integration, but you still have to figure out how to put that data in the hands of scientists. Integrating data is only half of the problem, we also have to present that data. The problem isn’t that SPARQL is hard to use per se (it’s really rather plain and sensible). The problem is that SPARQL is supposed to be only a piece of plumbing at the bottom of a software stack. We shouldn’t expect scientists to write SPARQL queries anymore than we expect them to carry adjustable pliers to a restroom visit.

The General SPARQL app is one of the new ways to present triple data.

How do you use it?

The app lets you build a network step by step. Nodes and edges can be added to a network in a piecemeal fashion. Nodes can represent various biological entities, such as: a pathway, a protein, a reaction, or a compound. Edges can represent any type of relation between those entities.

For example, you can start by searching for a protein of interest. The app places a single node in your network. You can then right-click on this node to pull in related entities. For example, all the pathways that are related to your protein. Or all the Gene Ontology annotations. Or all the reactions that your protein is part of. Or the gene that encodes for your protein. And you can continue this process, jumping from one entity to the next.

Watch this screencast and it will start to make sense:

How does it work?

In the background, the General SPARQL app maintains a list of SPARQL queries. Each item in the search menu, and each item in the context (right-click) menu, is backed by one SPARQL query. When you click on them, a query is sent off in the background, and the result is mapped to your network according to certain rules.

When you first install the app, it comes pre-configured with a basic set of SPARQL queries, although it’s possible to provide your own set. The initial set is designed to work with public bioinformatics SPARQL endpoints provided by the EBI and Bio2RDF. But as great as these resources are, public triple stores can sometimes be overloaded. The app works with privately managed triple stores just as well.

Where can I find it?

The easiest way to get the app is simply from the Cytoscape App manager. Just install Cytoscape 3.0, start it, and go to menu->Apps->App Manager and search for “General SPARQL”. Or download it on from the app store website. What’s even better is that the source code is available on github.

Also, if you have a chance, come see my poster at Vizbi 2015 in Boston.

Proxy configuration for Cytoscape

Tuesday, June 11th, 2013

In large companies, you often find that direct web access is blocked: you have to ask a proxy server to request web pages on your behalf (The proxy also does stuff like scanning for viruses and malware). As a consequence, all the software on your computer needs to be configured to be proxy-aware. This is usually done for you, but Bioinformaticians tend to use “non-standard” software that you’ll have to configure yourself.

If you are using Cytoscape 2.X or 3.0 behind a proxy, and you know your proxy settings, you may find the following useful.

Cytoscape has a “proxy server settings” dialog, as described in the manual. The problem is that it doesn’t work – it stores the proxy settings in a special way that only some bits of Cytoscape are aware of. It does not work for plug-ins (sorry, “apps”) that make use of off-the-shelf Java libraries.

Instead, go to your Cytoscape installation directory, and look for a file named Cytoscape.vmoptions. Enter the following lines at the top. Substitute the dummy host ( and port (8080) values for the appropriate values of your proxy.


This method works for Cytoscape internally as well as plug-ins and libraries, so you can just ignore the internal Proxy configuration dialog. I’ve tested this for Cytoscape 2.8.2 and 2.8.3, and it’s also relevant for Cytoscape 3.0. People from the Cytoscape mailinglist inform me that this will be changed in the upcoming Cytoscape 3.1.

I recommend putting the options at the top, because Cytoscape.vmoptions has a maximum of 9 options. Any more are quietly ignored.

In case you want to delete some to make space, I’ll explain the meaning of the default Cytoscape.vmoptions. The first three options increase the memory available to Cytoscape, and are potentially useful to keep if you deal with large networks:


The next two deal with anti-aliasing for font rendering. That’s ancient stuff, I can’t remember the last time I saw a Java application without anti-aliased fonts. I think you can remove them safely, and in the worst case you’ll just get some ugly text.


Finally, a note for Java developers: if you are trying to debug proxy issues, use the following snippet of code just before you make a web request. Sometimes the values of system properties are not what you think they are – with this you can confirm them.

// print out proxy settings for debugging purposes
for (String key : new String[] { "proxySet", "http.proxyHost",
        "http.proxyPort", "https.proxyHost", "https.proxyPort" })
    System.out.printf ("%30s: %40s\n", key, System.getProperty(key));

More about URI’s for BioPAX

Monday, December 3rd, 2012

In a previous post, I explained that a BioPAX document is really an RDF graph. And with that in mind, you can do interesting things like inferring URI’s using a SPARQL CONSTRUCT query.

What I didn’t explain is that, after adding those new inferences, the result is no longer valid BioPAX. RDF gives you lots of freedom, as well as lots of rope to hang yourself with. BioPAX has some restrictions in place that are necessary for exchange of pathway data.

Let me explain in more detail. Take a look at the BioPAX snippet below. This snippet represents more or less the same information as the first figure from my previous post. It represents Protein186961, with a bp:xref property pointing to id4, which is a UnificationXref with bp:db property FlyBase and bp:id property FBgn0034356.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="id4" />

 <bp:UnificationXref rdf:about="id4">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>

After the SPARQL CONSTRUCT query, the newly inferred URI’s are added back to the graph. The results looks more or less like this:

<bp:ProteinReference rdf:about="Protein186961">
 <bp:xref rdf:resource="id4" />
 <bp:xref rdf:resource=""/>

As you can see, Protein186961 now has two bp:xref properties. This kind of duplication may cause problems for software. Furthermore, the new bp:xref property doesn’t have the correct type (UnificationXref), and it doesn’t have values for bp:db and bp:id, because our CONSTRUCT query didn’t say anything about them. Yet well-behaving pathway software might quite reasonably be looking for that information.

Running inferences on an RDF store gives you lots of power, but it’s not necessarily good for standardization. If you are running a large pathway database, you might want to enforce some restrictions. The online BioPAX validator created by Igor Rodchenkov et al. is the gold standard for producing correct, manageable BioPAX. Running it on the second snippet leads to this error:

But what if you want to have URI’s, but you also want to keep your BioPAX valid? It’s easy – the UnificationXref in the first snippet used id4 as resource identifier. Id4 is just an arbitrary value – we can easily replace that with something better. But instead of running a construct query, it’s a matter of modifying your BioPAX generating code to write out URI’s where possible. The result could look like the snippet below. Admittedly, the result has a bit of redundancy, with the two references to FBgn0034356. But that is a small price to pay. The new version has goodness ready for SPARQL integration magic, yet it’s still standard compliant so that mundane software can cope with it too.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="" />

 <bp:UnificationXref rdf:about="">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>

Inferring URIs for BioPAX

Friday, November 16th, 2012

Here is a useful data-integration trick involving BioPAX and

BioPAX is a pathway exchange format – it is known for being somewhat complicated, but at the very basic level it’s simple: BioPAX is made up of subject-predicate-object triples. Together these triples form a graph. Thus, a BioPAX document is nothing more than a large graph. Here is a small fragment to illustrate:

Here you see a particular BiochemicalReaction, which is catalysed by a particular Protein [1]. Both the BiochemicalReaction and the Protein have a number acting as local identifiers – they are quite useless outside this BioPAX document. To identify this particular protein in the wild, we must look at its Xref, which refers to a database (FlyBase), and an identifier (FBgn0034356). [2]

You have to imagine that this graph is much larger than just the snippet shown above, and contains lots of interesting information. And we can make it even more interesting by fetching information from external databases about this protein, and integrate that into this graph.

The trouble is that the Xref is stored in two nodes: one for the identifier and one for the database. This makes data integration cumbersome, requiring comparison of two nodes at the same time. It would be more efficient to merge this data into a single node.

One possible solution is to simply concatenate the database and identifier and put that into a new node. For example, here is just one way we could do that:


But we can do even better: if we combine the two nodes into a single URI (Uniform Resource Identifier) from, we gain the added advantage of having a resolvable URI. That means that the identifier is also a link which you can open in a browser, which is just incredibly neat.

(Go ahead and open it:

We can create these URIs directly in the triple store using a SPARQL CONSTRUCT query. SPARQL is a query language for graphs – it looks for patterns in the graph, and in the case of CONSTRUCT queries, new triples are generated which can be added back into the graph. The following query generates URIs for Uniprot Xrefs. Unfortunately this query only works on the virtuoso triple store, because of the whole “bif:sprintf…” incantation which is non-standard SPARQL. Presumably equivalent functions exist for other triple stores.

    ?x BP:xref `bif:sprintf_iri (
    "", ?id)`
   ?x BP:xref ?blank .
   ?blank BP:id ?id .
   ?blank BP:db "UniProt"^^xsd:string

If you try that, you will get a set of new triples, which looks like this when viewed in the browser:


Subject Item

Subject Item

If you want you can try for yourself on our live triple store with preloaded BioPAX data. Here is our live sparql endpoint. If you scroll down on that page you see a few more SPARQL queries to try. To learn more, please see my presentation of the SPIN-OSS conference


  • [1] In standard bioPAX, there is a Catalysis object between a Protein and a BiochemicalReaction. The controlledBy relation must be inferred.
  • [2] Ignore for the moment that we’re using a gene identifier for a protein

Ports, tunnels, request types and virtual hosts

Tuesday, August 28th, 2012

The internet is surely the most incredible machine on earth. For one thing, I use it to share code with other developers, using a program called subversion. But the other day, subversion was being blocked by a firewall. Fixing that problem was a great opportunity to get my hands dirty with the nuts and bolts of the internet, and I learned a lot too, which I’d like to share here.

First let me explain about ports, because it will be important later. An internet connection always involves two programs: one is the client, running on the local machine, and the other is the server, running on the remote machine. For example, the client could be Firefox on the wife’s laptop, and the server could be Apache serving images of kittens.

Now imagine that the remote machine had both a web server and an email server installed. To distinguish the traffic for each program they are assigned a port number. The web server is listening on port 80, which is the conventional port for web traffic. The email server is listening on port 25, and both happily co-operate on the same machine [1].

The client and server must speak the same language, or protocol, to communicate. There is a whole alphabet soup of protocols such as HTTP, FTP, SMTP… Not surprisingly most of them end with the letter P. The most common one is HTTP, being the protocol used for web browsing. This protocol dictates that the browser should start by sending a request. This can be one of several request types, e.g. GET to request the latest kitty pictures, and POST to upload new ones.

Firewalls are designed to let through the ordinary, and block the unusual. Since HTTP is so common, firewalls normally let it go through unharmed. Subversion also uses HTTP, but still it was being blocked [3]. This is because subversion uses rather weird HTTP request types, such as PROPFIND [4]. This is legal according to the protocol, but it’s unusual. Firewalls find that suspicious. It’s not because subversion is trying to be funny. Honestly, I think that blocking PROPFIND is just the default setting on popular firewall software, and the sysadmins don’t bother to change the defaults. After all, Subversion is only used by developers, who make up just a fraction of the population, and they are geeks anyway, so nothing to worry about.

So what to do? Well luckily, I had an account on this particular server for a program called SSH, and with that I set up a tunnel to bypass the firewall. Here is how I did that:

First, I instructed subversion to send its requests to localhost, instead of the subversion server, and to use port 7654 instead of 80 [2]. So instead of doing a subversion checkout from, I was doing it from http://localhost:7654/bridgedb/trunk.

What is localhost? Localhost corresponds to IP address, which is a special address that sends messages right back to where they came from. Every computer, no matter how simple, can act as a server, as long as it has suitable software listening on a port. What would be the use of that? The messages are already at localhost, so there what is the point in sending them there? As mentioned above, internet communication is always between two programs. They communicate even if they are written in very different programming languages, as long as they follow the right protocol. Connecting over localhost is sometimes the easiest way to get two very different pieces of software to talk to each other.

So I instructed SSH to set up a tunnel. What this means is that SSH is listening to port 7654, where it was receiving all messages from subversion. SSH does not interpret these messages, it just encrypts them, and forwards them over the internet. The unusual PROPFIND requests are now obscured by encryption. The messages arrive at the remote server on port 22, where another copy of SSH decrypts the messages and passes them on again. They continue the journey to localhost (from the servers point of view), on port 80, where the subversion messages were expected to arrive in the first place. The beauty of this is that in spite of all the redirection, both the subversion client and server are oblivious to what is going on, they just send and receive messages as usual.

To make this trick work on windows, you can configure Putty, the windows variant of SSH:

On linux, it’s a simple matter of typing

ssh -L 7654:localhost:80

Except that in my case… it still wasn’t working.

The problem is that this particular server is actually hosting two websites: and This server was configured with a technique called virtual hosting, which is useful when you want to host several small websites. Putting each on a separate computer would be very inefficient. With virtual hosting, you can bundle multiple sites on a single server.

The web server listening on port 80 looks at the incoming requests to decide which of the virtual websites is going to handle the request. Normally, a subversion request for the page /bridgedb/trunk on the server looks like this:


But because of the way I set things up earlier, subversion thinks that it is talking to localhost. Even though the messages are forwarded to the server correctly due to SSH, when they arrive, the requests still look something like:

PROPFIND http://localhost:7654/bridgedb/trunk

Which doesn’t help the web server to decide if this request should be served by or

So what to do? Next, I tricked my local computer into thinking that and localhost are the same, by adding the following line to the hosts file, which is in C:\Windows\System32\drivers\etc on windows, (you need to open notepad with sysadmin rights in order to be able to edit the file) or in /etc/hosts on linux.

This tells the operating system, that when you make a request for, it should really be sent to Which coincidentally is the IP address for localhost. This means that I can configure subversion to send to, even though is really localhost due to the hosts file, except that localhost really is due to the SSH tunnel.

And finally it works!

  • [1] These port numbers are just conventions, and we could configure each piece of software to use a different port if we wanted. WikiPedia has a long list of conventional port numbers
  • [2] Why port 7654? For no reason other than that it was free on my machine. (In fact I could have used port 80, which is normally free, unless you’re running a web server on your computer, which I do, but that is a different story)
  • [3] The blocking could also be done by a proxy instead of a firewall, but that doesn’t matter for this discussion
  • [4] I have had problems with PROPFIND before, see also my question on stackoverflow to diagnose the problem.
  • So I have an SBGN-ML file, what’s next?

    Thursday, March 22nd, 2012

    The Systems Biology Graphical Notation (SBGN) is a system for drawing pathways in a very precise and standardized way. But the problem is that the software support is spotty at best. The LibSBGN project is here to help improve that situation (For a bit of history, see here and here). As part of this project, we created a file format for SBGN files, named SBGN Markup Language or SBGN-ML.

    Let me break that alphabet soup down for you:

    1. SBGN: the graphics
    2. LibSBGN: the software
    3. SBGN-ML: the file format

    Suppose you manage to procure a SBGN-ML file. You may then reasonably ask what you can do with it. Until fairly recently, the only answer that we could give to non-programmers was “not much”. That is quickly changing however. I’ll present three things you can do with an SBGN-ML file right now.

    1. Open it in PathVisio

    Using the following webstart link, you can open PathVisio with the SBGN-plug-in pre-installed. (More information about the state of this plug-in, see the help page). Then go to File->Import… and select SBGN-ML from the file type drop-down.

    2. Convert them into an image from the Command-Line

    If you want to convert a bunch of SBGN-ML files to images, it’s easier to do it from the command-line. For this purpose I created a little script. First download the sbgn-to-png tarball. Unzip it, and run it from the command line using “sh “.

    3. Open it in SBGN-ED

    The SBGN-ED tool is an alternative to PathVisio for editting pathway diagrams. SBGN-ED has won the annual SBGN competition in the category “Best software support” twice in a row.

    When comparing PathVisio and SBGN-ED, the latter is probably a bit better when it comes to editing Process Description diagrams, whereas PathVisio deals better with Entity Relationship diagrams. The only caveat is that at this time of writing, SBGN-ED only supports an older version of SBGN-ML. For this reason, files generated by SBGN-ED can be read by PathVisio, but not the other way around. An update should arrive very soon though.

    Notes from Vizbi: automation in Cytoscape

    Monday, March 5th, 2012

    Cytoscape is a popular network visualisation and analysis tool. It’s great because it’s so easy to create plug-ins. Today I was fortunate enough to be attending the Cytoscape developer workshop at Vizbi 2012, where I learned a few new things.

    Firstly, one of my goals was to find out about the current state of Cytoscape development. Cytoscape is a great tool as long as you don’t look too closely at what’s going on inside. The upcoming third version promises to fix all the minor and major problems that exist under the hood. But Cytoscape 3 has been in the making for a long time. As a plug-in developer, you have to choose between something that works right now, but will go away eventually, or something that is clearly the future, but might take a long time to materialise.

    The feeling I got from the workshop is that there is light at the end of the Cytoscape 3 tunnel. For a plug-in developer with a deadline, it’s probably best to stick with the current version for now. But if you’re not under pressure to release, it’s definitely possible to write for Cytoscape 3 and make use of a nicer and more pleasant working environment.

    Besides that news, I learned some cool new tricks. Using Cytoscape Commands you can write simple macros for repetitive tasks. For example, to generate the network below, first you have to import a SIF (Simple Interaction Format) file, then import a file with node attributes, then apply a layout, and then apply a visual style. If you have to do this a couple of times it gets quite tedious. But here is how all that can be automated:

    Take the following SIF data, and save it using a text editor as network.sif

    Martijn is_involved_with    LibSBGN
    Chaouiya    is_involved_with    SBML-qual
    Martijn is_involved_with    SBML-qual
    Martijn is_involved_with    BioPreDyn
    Emanuel is_involved_with    LibSBGN
    Emanuel is_funded_by    Erasmus
    Martijn is_funded_by    FP7

    Here are the Node attributes, saved it as node_types.txt


    For the visual style, I created one in Cytoscape and saved it as style.props, using Export->Vizmap property file. And here is the magic bit: If you save the above three files in your work directory, then you can generate that picture with the script below.

    network import file=network.sif
    layout force-directed
    node import attributes file=node_types.txt
    vizmap import file=style.props

    Run it from within Cytoscape with Plugins->Command Tool->Run script…, or from the command line with

    ./ -S scriptfile

    Logic modeling with CellNOptR in Cytoscape

    Monday, February 27th, 2012

    A few months ago, I started work as a post-doc at the Systems Biomedicine group of the EBI. Our group makes heavy use of logical modelling as a means to understand how pathways work. For me, the most interesting thing about logical modelling is that it shows a very dynamic picture of how a pathway changes over time. By comparison, the pictures that you get from WikiPathways are very static.

    We have our own logical modelling software called CellNetOptimizer (a.k.a CellNOptR). One of my current projects is to make the CellNOptR software more interoperable with popular tools such as Cytoscape. To this end, Emanuel Gonçalves, a master student in our group, has implemented a plug-in that makes CellNOptR available from Cytoscape. Work on the plug-in is progressing nicely. Below you see the video that he made, to show off some of the features of this new plug-in.

    In the video, you see how you can:

    • open a network
    • Start the CellNOptR wizard
    • Import and view experimental data
    • Train the network against the data
    • View the optimized network in Cytoscape

    Pathway Visualization to the next level

    Friday, February 25th, 2011

    The laboratory of bioinformatics of Wageningen University has put together some really cool hardware. In the picture below you see their tiled display, consisting of 12 high-resolution monitors, powered by a single workstation.

    PathVisio on tiled display

    PathVisio on a tiled display

    This setup gives you a lot of resolution to play with. We managed to display all major metabolic pathways from WikiPathways simultaneously, at full resolution, and map microarray data as well. When you’re standing right next to the screens, it feels like the data is all around you. That really encourages you to explore, and make connections across the pathways. That’s just much harder to do on a single screen.