Proxy configuration for Cytoscape

June 11th, 2013

In large companies, you often find that direct web access is blocked: you have to ask a proxy server to request web pages on your behalf (The proxy also does stuff like scanning for viruses and malware). As a consequence, all the software on your computer needs to be configured to be proxy-aware. This is usually done for you, but Bioinformaticians tend to use “non-standard” software that you’ll have to configure yourself.

If you are using Cytoscape 2.X or 3.0 behind a proxy, and you know your proxy settings, you may find the following useful.

Cytoscape has a “proxy server settings” dialog, as described in the manual. The problem is that it doesn’t work – it stores the proxy settings in a special way that only some bits of Cytoscape are aware of. It does not work for plug-ins (sorry, “apps”) that make use of off-the-shelf Java libraries.

Instead, go to your Cytoscape installation directory, and look for a file named Cytoscape.vmoptions. Enter the following lines at the top. Substitute the dummy host (192.168.5.130) and port (8080) values for the appropriate values of your proxy.

-DproxySet=true
-Dhttp.proxyHost=192.168.5.130
-Dhttp.proxyPort=8080
-Dhttps.proxyHost=192.168.5.130
-Dhttps.proxyPort=8080

This method works for Cytoscape internally as well as plug-ins and libraries, so you can just ignore the internal Proxy configuration dialog. I’ve tested this for Cytoscape 2.8.2 and 2.8.3, and it’s also relevant for Cytoscape 3.0. People from the Cytoscape mailinglist inform me that this will be changed in the upcoming Cytoscape 3.1.

I recommend putting the options at the top, because Cytoscape.vmoptions has a maximum of 9 options. Any more are quietly ignored.

In case you want to delete some to make space, I’ll explain the meaning of the default Cytoscape.vmoptions. The first three options increase the memory available to Cytoscape, and are potentially useful to keep if you deal with large networks:

-Xms10m
-Xmx768m
-Xss10m

The next two deal with anti-aliasing for font rendering. That’s ancient stuff, I can’t remember the last time I saw a Java application without anti-aliased fonts. I think you can remove them safely, and in the worst case you’ll just get some ugly text.

-Dswing.aatext=true
-Dawt.useSystemAAFontSettings=lcd

Finally, a note for Java developers: if you are trying to debug proxy issues, use the following snippet of code just before you make a web request. Sometimes the values of system properties are not what you think they are – with this you can confirm them.

// print out proxy settings for debugging purposes
for (String key : new String[] { "proxySet", "http.proxyHost",
        "http.proxyPort", "https.proxyHost", "https.proxyPort" })
{
    System.out.printf ("%30s: %40s\n", key, System.getProperty(key));
}

More about identifiers.org URI’s for BioPAX

December 3rd, 2012

In a previous post, I explained that a BioPAX document is really an RDF graph. And with that in mind, you can do interesting things like inferring identifiers.org URI’s using a SPARQL CONSTRUCT query.

What I didn’t explain is that, after adding those new inferences, the result is no longer valid BioPAX. RDF gives you lots of freedom, as well as lots of rope to hang yourself with. BioPAX has some restrictions in place that are necessary for exchange of pathway data.

Let me explain in more detail. Take a look at the BioPAX snippet below. This snippet represents more or less the same information as the first figure from my previous post. It represents Protein186961, with a bp:xref property pointing to id4, which is a UnificationXref with bp:db property FlyBase and bp:id property FBgn0034356.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="id4" />
 </bp:ProteinReference>

 <bp:UnificationXref rdf:about="id4">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>
 </bp:UnificationXref>

After the SPARQL CONSTRUCT query, the newly inferred URI’s are added back to the graph. The results looks more or less like this:

<bp:ProteinReference rdf:about="Protein186961">
 <bp:xref rdf:resource="id4" />
 <bp:xref rdf:resource="http://identifiers.org/flybase/FBgn0034356"/>
</bp:ProteinReference>

As you can see, Protein186961 now has two bp:xref properties. This kind of duplication may cause problems for software. Furthermore, the new bp:xref property doesn’t have the correct type (UnificationXref), and it doesn’t have values for bp:db and bp:id, because our CONSTRUCT query didn’t say anything about them. Yet well-behaving pathway software might quite reasonably be looking for that information.

Running inferences on an RDF store gives you lots of power, but it’s not necessarily good for standardization. If you are running a large pathway database, you might want to enforce some restrictions. The online BioPAX validator created by Igor Rodchenkov et al. is the gold standard for producing correct, manageable BioPAX. Running it on the second snippet leads to this error:

But what if you want to have Identifiers.org URI’s, but you also want to keep your BioPAX valid? It’s easy – the UnificationXref in the first snippet used id4 as resource identifier. Id4 is just an arbitrary value – we can easily replace that with something better. But instead of running a construct query, it’s a matter of modifying your BioPAX generating code to write out identifiers.org URI’s where possible. The result could look like the snippet below. Admittedly, the result has a bit of redundancy, with the two references to FBgn0034356. But that is a small price to pay. The new version has identifiers.org goodness ready for SPARQL integration magic, yet it’s still standard compliant so that mundane software can cope with it too.

 <bp:ProteinReference rdf:about="Protein186961">
  <bp:xref rdf:resource="http://identifiers.org/flybase/FBgn0034356" />
 </bp:ProteinReference>

 <bp:UnificationXref rdf:about="http://identifiers.org/flybase/FBgn0034356">
  <bp:id rdf:datatype="xsd:string">FBgn0034356</bp:id>
  <bp:db rdf:datatype="xsd:string">FlyBase</bp:db>
 </bp:UnificationXref>

Inferring identifiers.org URIs for BioPAX

November 16th, 2012

Here is a useful data-integration trick involving BioPAX and identifiers.org.

BioPAX is a pathway exchange format – it is known for being somewhat complicated, but at the very basic level it’s simple: BioPAX is made up of subject-predicate-object triples. Together these triples form a graph. Thus, a BioPAX document is nothing more than a large graph. Here is a small fragment to illustrate:

Here you see a particular BiochemicalReaction, which is catalysed by a particular Protein [1]. Both the BiochemicalReaction and the Protein have a number acting as local identifiers – they are quite useless outside this BioPAX document. To identify this particular protein in the wild, we must look at its Xref, which refers to a database (FlyBase), and an identifier (FBgn0034356). [2]

You have to imagine that this graph is much larger than just the snippet shown above, and contains lots of interesting information. And we can make it even more interesting by fetching information from external databases about this protein, and integrate that into this graph.

The trouble is that the Xref is stored in two nodes: one for the identifier and one for the database. This makes data integration cumbersome, requiring comparison of two nodes at the same time. It would be more efficient to merge this data into a single node.

One possible solution is to simply concatenate the database and identifier and put that into a new node. For example, here is just one way we could do that:

FlyBase~FBgn0034356

But we can do even better: if we combine the two nodes into a single URI (Uniform Resource Identifier) from identifiers.org, we gain the added advantage of having a resolvable URI. That means that the identifier is also a link which you can open in a browser, which is just incredibly neat.

http://identifiers.org/flybase/FBgn0034356

(Go ahead and open it: http://identifiers.org/flybase/FBgn0034356).

We can create these URIs directly in the triple store using a SPARQL CONSTRUCT query. SPARQL is a query language for graphs – it looks for patterns in the graph, and in the case of CONSTRUCT queries, new triples are generated which can be added back into the graph. The following query generates identifiers.org URIs for Uniprot Xrefs. Unfortunately this query only works on the virtuoso triple store, because of the whole “bif:sprintf…” incantation which is non-standard SPARQL. Presumably equivalent functions exist for other triple stores.

CONSTRUCT {
    ?x BP:xref `bif:sprintf_iri (
    "http://identifiers.org/uniprot/%s", ?id)`
}
WHERE {
   ?x BP:xref ?blank .
   ?blank BP:id ?id .
   ?blank BP:db "UniProt"^^xsd:string
}
LIMIT 10

If you try that, you will get a set of new triples, which looks like this when viewed in the browser:

xsdh http://www.w3.org/2001/XMLSchema#
n2 http://biocyc.org/biopax/biopax-level3#
n4 http://identifiers.org/uniprot/
n3 http://www.biopax.org/release/biopax-level3.owl#
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

Subject Item
n2:Protein220382
n3:xref
n4:P23884

Subject Item
n2:Protein193864
n3:xref
n4:Q9W330

If you want you can try for yourself on our live triple store with preloaded BioPAX data. Here is our live sparql endpoint. If you scroll down on that page you see a few more SPARQL queries to try. To learn more, please see my presentation of the SPIN-OSS conference


Footnotes:

  • [1] In standard bioPAX, there is a Catalysis object between a Protein and a BiochemicalReaction. The controlledBy relation must be inferred.
  • [2] Ignore for the moment that we’re using a gene identifier for a protein

Ports, tunnels, request types and virtual hosts

August 28th, 2012

The internet is surely the most incredible machine on earth. For one thing, I use it to share code with other developers, using a program called subversion. But the other day, subversion was being blocked by a firewall. Fixing that problem was a great opportunity to get my hands dirty with the nuts and bolts of the internet, and I learned a lot too, which I’d like to share here.

First let me explain about ports, because it will be important later. An internet connection always involves two programs: one is the client, running on the local machine, and the other is the server, running on the remote machine. For example, the client could be Firefox on the wife’s laptop, and the server could be Apache serving images of kittens.

Now imagine that the remote machine had both a web server and an email server installed. To distinguish the traffic for each program they are assigned a port number. The web server is listening on port 80, which is the conventional port for web traffic. The email server is listening on port 25, and both happily co-operate on the same machine [1].

The client and server must speak the same language, or protocol, to communicate. There is a whole alphabet soup of protocols such as HTTP, FTP, SMTP… Not surprisingly most of them end with the letter P. The most common one is HTTP, being the protocol used for web browsing. This protocol dictates that the browser should start by sending a request. This can be one of several request types, e.g. GET to request the latest kitty pictures, and POST to upload new ones.

Firewalls are designed to let through the ordinary, and block the unusual. Since HTTP is so common, firewalls normally let it go through unharmed. Subversion also uses HTTP, but still it was being blocked [3]. This is because subversion uses rather weird HTTP request types, such as PROPFIND [4]. This is legal according to the protocol, but it’s unusual. Firewalls find that suspicious. It’s not because subversion is trying to be funny. Honestly, I think that blocking PROPFIND is just the default setting on popular firewall software, and the sysadmins don’t bother to change the defaults. After all, Subversion is only used by developers, who make up just a fraction of the population, and they are geeks anyway, so nothing to worry about.

So what to do? Well luckily, I had an account on this particular server for a program called SSH, and with that I set up a tunnel to bypass the firewall. Here is how I did that:

First, I instructed subversion to send its requests to localhost, instead of the subversion server, and to use port 7654 instead of 80 [2]. So instead of doing a subversion checkout from http://svn.bigcat.unimaas.nl/bridgedb/trunk, I was doing it from http://localhost:7654/bridgedb/trunk.

What is localhost? Localhost corresponds to IP address 127.0.0.1, which is a special address that sends messages right back to where they came from. Every computer, no matter how simple, can act as a server, as long as it has suitable software listening on a port. What would be the use of that? The messages are already at localhost, so there what is the point in sending them there? As mentioned above, internet communication is always between two programs. They communicate even if they are written in very different programming languages, as long as they follow the right protocol. Connecting over localhost is sometimes the easiest way to get two very different pieces of software to talk to each other.

So I instructed SSH to set up a tunnel. What this means is that SSH is listening to port 7654, where it was receiving all messages from subversion. SSH does not interpret these messages, it just encrypts them, and forwards them over the internet. The unusual PROPFIND requests are now obscured by encryption. The messages arrive at the remote server on port 22, where another copy of SSH decrypts the messages and passes them on again. They continue the journey to localhost (from the servers point of view), on port 80, where the subversion messages were expected to arrive in the first place. The beauty of this is that in spite of all the redirection, both the subversion client and server are oblivious to what is going on, they just send and receive messages as usual.

To make this trick work on windows, you can configure Putty, the windows variant of SSH:

On linux, it’s a simple matter of typing

ssh -L 7654:localhost:80 username@svn.example.com

Except that in my case… it still wasn’t working.

The problem is that this particular server is actually hosting two websites: http://bridgedb.org and http://svn.bigcat.unimaas.nl. This server was configured with a technique called virtual hosting, which is useful when you want to host several small websites. Putting each on a separate computer would be very inefficient. With virtual hosting, you can bundle multiple sites on a single server.

The web server listening on port 80 looks at the incoming requests to decide which of the virtual websites is going to handle the request. Normally, a subversion request for the page /bridgedb/trunk on the server svn.bigcat.unimaas.nl looks like this:

PROPFIND http://svn.bigcat.unimaas.nl/bridgedb/trunk

But because of the way I set things up earlier, subversion thinks that it is talking to localhost. Even though the messages are forwarded to the server correctly due to SSH, when they arrive, the requests still look something like:

PROPFIND http://localhost:7654/bridgedb/trunk

Which doesn’t help the web server to decide if this request should be served by bridgedb.org or svn.bigcat.unimaas.nl

So what to do? Next, I tricked my local computer into thinking that svn.bigcat.unimaas.nl and localhost are the same, by adding the following line to the hosts file, which is in C:\Windows\System32\drivers\etc on windows, (you need to open notepad with sysadmin rights in order to be able to edit the file) or in /etc/hosts on linux.

127.0.0.1   svn.bigcat.unimaas.nl

This tells the operating system, that when you make a request for svn.bigcat.unimaas.nl, it should really be sent to 127.0.0.1. Which coincidentally is the IP address for localhost. This means that I can configure subversion to send to svn.bigcat.unimaas.nl, even though svn.bigcat.unimaas.nl is really localhost due to the hosts file, except that localhost really is svn.bigcat.unimaas.nl due to the SSH tunnel.

And finally it works!

  • [1] These port numbers are just conventions, and we could configure each piece of software to use a different port if we wanted. WikiPedia has a long list of conventional port numbers
  • [2] Why port 7654? For no reason other than that it was free on my machine. (In fact I could have used port 80, which is normally free, unless you’re running a web server on your computer, which I do, but that is a different story)
  • [3] The blocking could also be done by a proxy instead of a firewall, but that doesn’t matter for this discussion
  • [4] I have had problems with PROPFIND before, see also my question on stackoverflow to diagnose the problem.
  • So I have an SBGN-ML file, what’s next?

    March 22nd, 2012

    The Systems Biology Graphical Notation (SBGN) is a system for drawing pathways in a very precise and standardized way. But the problem is that the software support is spotty at best. The LibSBGN project is here to help improve that situation (For a bit of history, see here and here). As part of this project, we created a file format for SBGN files, named SBGN Markup Language or SBGN-ML.

    Let me break that alphabet soup down for you:

    1. SBGN: the graphics
    2. LibSBGN: the software
    3. SBGN-ML: the file format

    Suppose you manage to procure a SBGN-ML file. You may then reasonably ask what you can do with it. Until fairly recently, the only answer that we could give to non-programmers was “not much”. That is quickly changing however. I’ll present three things you can do with an SBGN-ML file right now.

    1. Open it in PathVisio

    Using the following webstart link, you can open PathVisio with the SBGN-plug-in pre-installed. (More information about the state of this plug-in, see the help page). Then go to File->Import… and select SBGN-ML from the file type drop-down.

    2. Convert them into an image from the Command-Line

    If you want to convert a bunch of SBGN-ML files to images, it’s easier to do it from the command-line. For this purpose I created a little script. First download the sbgn-to-png tarball. Unzip it, and run it from the command line using “sh sbgn-to-png.sh “.

    3. Open it in SBGN-ED

    The SBGN-ED tool is an alternative to PathVisio for editting pathway diagrams. SBGN-ED has won the annual SBGN competition in the category “Best software support” twice in a row.

    When comparing PathVisio and SBGN-ED, the latter is probably a bit better when it comes to editing Process Description diagrams, whereas PathVisio deals better with Entity Relationship diagrams. The only caveat is that at this time of writing, SBGN-ED only supports an older version of SBGN-ML. For this reason, files generated by SBGN-ED can be read by PathVisio, but not the other way around. An update should arrive very soon though.

    Notes from Vizbi: automation in Cytoscape

    March 5th, 2012

    Cytoscape is a popular network visualisation and analysis tool. It’s great because it’s so easy to create plug-ins. Today I was fortunate enough to be attending the Cytoscape developer workshop at Vizbi 2012, where I learned a few new things.

    Firstly, one of my goals was to find out about the current state of Cytoscape development. Cytoscape is a great tool as long as you don’t look too closely at what’s going on inside. The upcoming third version promises to fix all the minor and major problems that exist under the hood. But Cytoscape 3 has been in the making for a long time. As a plug-in developer, you have to choose between something that works right now, but will go away eventually, or something that is clearly the future, but might take a long time to materialise.

    The feeling I got from the workshop is that there is light at the end of the Cytoscape 3 tunnel. For a plug-in developer with a deadline, it’s probably best to stick with the current version for now. But if you’re not under pressure to release, it’s definitely possible to write for Cytoscape 3 and make use of a nicer and more pleasant working environment.

    Besides that news, I learned some cool new tricks. Using Cytoscape Commands you can write simple macros for repetitive tasks. For example, to generate the network below, first you have to import a SIF (Simple Interaction Format) file, then import a file with node attributes, then apply a layout, and then apply a visual style. If you have to do this a couple of times it gets quite tedious. But here is how all that can be automated:

    Take the following SIF data, and save it using a text editor as network.sif

    Martijn is_involved_with    LibSBGN
    Chaouiya    is_involved_with    SBML-qual
    Martijn is_involved_with    SBML-qual
    Martijn is_involved_with    BioPreDyn
    Emanuel is_involved_with    LibSBGN
    Emanuel is_funded_by    Erasmus
    Martijn is_funded_by    FP7

    Here are the Node attributes, saved it as node_types.txt

    type
    LibSBGN=Project
    BioPreDyn=Project
    Chaouiya=Collaborator
    SBML-qual=Project
    Martijn=Member
    Emanuel=Member
    FP7=Funding
    Erasmus=Funding

    For the visual style, I created one in Cytoscape and saved it as style.props, using Export->Vizmap property file. And here is the magic bit: If you save the above three files in your work directory, then you can generate that picture with the script below.

    network import file=network.sif
    layout force-directed
    node import attributes file=node_types.txt
    vizmap import file=style.props

    Run it from within Cytoscape with Plugins->Command Tool->Run script…, or from the command line with

    ./cytoscape.sh -S scriptfile

    Logic modeling with CellNOptR in Cytoscape

    February 27th, 2012

    A few months ago, I started work as a post-doc at the Systems Biomedicine group of the EBI. Our group makes heavy use of logical modelling as a means to understand how pathways work. For me, the most interesting thing about logical modelling is that it shows a very dynamic picture of how a pathway changes over time. By comparison, the pictures that you get from WikiPathways are very static.

    We have our own logical modelling software called CellNetOptimizer (a.k.a CellNOptR). One of my current projects is to make the CellNOptR software more interoperable with popular tools such as Cytoscape. To this end, Emanuel Gonçalves, a master student in our group, has implemented a plug-in that makes CellNOptR available from Cytoscape. Work on the plug-in is progressing nicely. Below you see the video that he made, to show off some of the features of this new plug-in.

    http://www.youtube.com/watch?v=L343vXClXb4

    In the video, you see how you can:

    • open a network
    • Start the CellNOptR wizard
    • Import and view experimental data
    • Train the network against the data
    • View the optimized network in Cytoscape

    Pathway Visualization to the next level

    February 25th, 2011

    The laboratory of bioinformatics of Wageningen University has put together some really cool hardware. In the picture below you see their tiled display, consisting of 12 high-resolution monitors, powered by a single workstation.

    PathVisio on tiled display

    PathVisio on a tiled display

    This setup gives you a lot of resolution to play with. We managed to display all major metabolic pathways from WikiPathways simultaneously, at full resolution, and map microarray data as well. When you’re standing right next to the screens, it feels like the data is all around you. That really encourages you to explore, and make connections across the pathways. That’s just much harder to do on a single screen.

    First release of LibSBGN

    February 10th, 2011

    After months of work, last week we finally released the first version of LibSBGN.

    LibSBGN logo

    So what is LibSBGN? The Systems Biology Graphical Notation (SBGN), is a standard for drawing pathways. It prescribes exactly how to draw a biochemical reaction, how one can display the effect of heat on protein degradation, or how you should present the formation of a protein complex. It’s unambiguous: no matter how complex the drawing gets, it can be interpreted in only one way. SBGN is the result of many discussions, arguments and debates, over the course of several years and it’s therefore really well thought out.

    Good software support is essential to make SBGN succeed as a standard. LibSBGN was created in an attempt to encourage uptake. As the name implies, LibSBGN is a software library that should make it easy to incorporate SBGN in pathway tools.

    LibSBGN is only a software component, it’s not a ready to use end-product by itself. So this announcement is probably only interesting to bioinformatics developers. Nevertheless, I hope that it will soon lead to an increased uptake of SBGN in pathway tools, which should benefit end-users of those tools as well.

    LibSBGN is already supported by a few applications, including of course PathVisio. To make sure that it works exactly the same in each tool, we’ve created a comparison gallery, containing several test-cases rendered by each tool. All the diagrams should look exactly the same for each tool. This comparison page has proven tremendously useful to check for bugs and misunderstandings.

    This is only the first release, there is still a lot to do. This first release only supports a part of SBGN called process description (PD). The coming months will see lots of work on the remaining parts of SBGN, entity relationships (ER), and activity flow (AF). And after that we’ve planned more features, such as validation rules and file format conversion.

    This is the first tangible result from something that was set in motion at a meeting in Wittenberg. LibSBGN community: thanks for your hard work and congratulations on this first milestone.

    Spaghetti DNA

    January 30th, 2011

    This is in the category “parallels between life and computers”.

    DNA is said to contain the instructions to build an organism, just like software contains instructions for a computer. Poorly structured software is sometimes called “Spaghetti Code” because it’s such an intangible mess. What about the structure of DNA? Here is a nice quote from the linux kernel mailing list (link):

    > Human communication methods are all buggy as hell :) 
    
    Not to mention that they are slow, inefficient and ambiguous.
    
    But wht did you expect? The original authors of the code are long gone and
    maintenance is done by newcomers who are patching the code bit by bit. What
    you get from such a development model is pretty predictable: ~1 billion years
    old spaghetti DNA that no-one truly understands.

    Evolution may be a “poor development model”, but at least DNA has seen billions of years of debugging :)