Archive for the ‘Uncategorized’ Category

How to develop Modular Software

Saturday, March 6th, 2010

It’s always good to make software modular. Modular software is strong and healthy, monolithic software is sickly and bedridden. I’ve touched before on how modularity increases adaptability. But modularity also helps to keep software small, nimble and unbloated. I’ll illustrate how we’re applying modular design in BridgeDb.

Modularity is the only known antidote against bloatware. The more features a piece of software has, the larger it has to be. When you don’t use 90% of those features, it’s perceived as a problem. Bloated software takes a long time to start, fills up your hard drive, clogs your tubes. We want bioinformatics developers to use BridgeDb as much as possible, and we don’t want them to complain that BridgeDb is bloated.

For example, BridgeDb supports identifier mapping through several different web services. Some of those webservices are based on SOAP, others on XML-RPC or on REST. For each type of webservice, you need additional libraries. If it was only one monolithic chunk, you’d always need several megabytes of library dependencies.

You may say: “A few megabyte, so what?”. When I was at mediamarkt the other day, I couldn’t even find memory sticks smaller than 2 Gb anymore. But size still matters when you expect fast download times. For example WikiPathways uses BridgeDb on each pathway page. Bigger libraries means longer load times, which means annoyed users.

We want many features, but we don’t want bloat. The solution is to cut BridgeDb up into many small pieces, where you can choose the ones you need, and ignore the rest. You also don’t need the dependencies of the parts you ignore.

So how do you decide which pieces of BridgeDb you need? I’ve compiled this handy graph. On the right side, you see all the different “features” (i.e. identifier mapping services) that you can choose. Follow the arrows to the left, and note the modules that you encounter. Those are the modules you need for that mapping service.

If you’re getting started with modular software development, I can give you a few tips. You really don’t need any of those terribly complicated frameworks like Maven or OSGi. All you need is a good IDE like Eclipse and a bit of determination.

You have to be careful to manage the boundaries between modules. Eclipse can help you a great deal with this. Put each module in its own directory. In your Eclipse workspace, set up a separate project for each module, and add dependent projects in each project build path. This way you can never introduce cyclic dependencies or go across module boundaries. Eclipse will simply refuse to find the class and flag it as a compiler error.

For example, here is how I’ve set up BridgeDb in eclipse. In the package explorer you see that I’ve defined a separate project for each module in BridgeDb.

And to complete the example, here is how I configured the build path for the org.bridgedb.bio module. As you can see, the org.bridgedb project is listed as its sole dependency.

BridgeDb paper published

Tuesday, January 12th, 2010

I’m very happy that our paper on BridgeDb was accepted by BMC Bioinformatics. It’s open access so download it to your hearts content. BridgeDb is all about identifier mapping, which I blogged about before (here, here and here).

BridgeDb lets you find cross-references for identifiers, but BridgeDb is not simply a cross-reference database. BridgeDb provides a standard method to access other cross-reference databases. And because of that level of standardization, you can easily decide to switch to a different source of cross-references.

Deepak Singh uses the term “middleware”, which is a good way to explain it, if that sort of word means anything to you.

But let me try to explain in a different way. BirdgeDb is really a travel adapter. Suppose you’re in Japan and you’ve brought some gear like a laptop, cell phone and a nintendo DS (just in case you get stuck in a blizzard while transferring at CDG). Much to your dismay you discover, after checking into your hotel, that none of your plugs fit in Japanese electrical sockets. So what do you do? Do you go down to Akihabara and spend a grand on a new laptop, phone and portable video game unit? Or do you buy a travel adapter for $1.95?

Just like there are many different power plugs around the world, there are many databases that do identifier mapping. And just like travel adapters let you plug in your laptop anywhere, no matter what country, BridgeDb lets you use your favorite bioinformatics tool, no matter what the source of identifier mappings is (Provided that the tool uses BridgeDb).

Power plugs around the world

It’s important to realize that BridgeDb is simply a conduit of information. It does not calculate cross-references from scratch, nor does it give any guarantees about the validity of those cross-references. You shouldn’t ask if BridgeDb provides better identifier mappings. That is like asking if a travel adapter provides better electricity. You still depend on the power company to give you a stable source of electricity. The travel plug just gives you flexibility to adapt to different circumstances.

The relation between Garbage bags and Databases

Friday, January 1st, 2010

In my local supermarket, you can find two brands of garbage bags: There are the “A-brand” garbage bags, and there are the “house-brand” garbage bags. Both come in rolls of 20, each bag holds 60 liters, and both come with the same convenient closing strips. There is only one difference: each roll of A-brand garbage bags costs 40 eurocents more.

Vuilniszak

Garbage bag (*)

How could this situation exist? Why on earth would you pay anything more than the cheapest possible? They’re just garbage bags, for crying out loud! They’re probably even made in the same factory.

But for some reason, be it superior marketing, brand recognition, or some persistent belief that the more expensive bags really do hold garbage in a superior fashion, there are enough yuppies who dump the expensive brand in their trolleys without thinking twice.

But at the same time, my local supermarket would be foolish to abandon the cheap brand. There are plenty of cheapskate customers who do pay attention to price, and who are really not embarrassed to be seen with a no-brand garbage bag, every Tuesday when they put the trash out, in front the whole onlooking neighborhood.

In marketing terms this is called segmentation: by catering to each market segment (cheapskates and yuppies), the supermarket can make more total profit than if they had carried either only the cheap or only the expensive brand. As always, Joel explains it best

I’m sure this is also something the marketing geniuses of Oracle know. Oracle, well known producer of the enterprise level database product of the same name, is a marketing force to be reckoned with. You can’t just go to MediaMarkt and buy a box of Oracle. No, if you want Oracle, you have to call them. They’ll send a salesrep, who will drive to your office, show slick spreadsheets during expensive lunch while back at headquarters they calculate exactly how much you’re worth, and how much you can be squeezed for site-wide Oracle licenses.

In complete contrast, Sun is not a “marketing” company. Sun is a technology company. They’re the geeks behind the scenes, who have produced a long list of innovative server technology that you never heard about, but nonetheless powers an important fraction of the internet infrastructure. The list goes on: Java, OpenOffice, OpenSolaris, ZFS, Virtualbox and interestingly, also a database product named MySQL.

In the open source community, Sun is widely recognized as a company that really “gets” it. Indeed, all of the products just mentioned are open source in different degrees (and some, like Java, to the highest degree possible: full GPL v3).

It’s the suits versus the beards all over again. And they’re up in arms, because Oracle recently acquired Sun. What a shock: in one corner, Oracle, the most closed, most expensive, commercial database system imaginable, used by all the Fortune 500, and in the other corner MySQL, the cost-free, open-source upstart that powers small shops, blogs (including the one you’re reading) and WikiPathways, and now they’re both in the hands of a single company? Somebody check the temperature in hell!

A friend asked if I would sign a petition to stop Oracle’s acquisition of Sun, and thus also MySQL. I’m not in the habit of signing e-petitions, and I won’t sign this one either. First of all because I don’t think it will make a bit of difference, but also because I think it’s premature. This acquisition does not have to be the disaster that some make it out to be.

Just as there will remain plenty of brand-susceptible bioinformatics professors who will keep claiming, Oracle is way better than MySQL, no matter what the application, or how much we have to spend, there will also remain plenty of low-budget shops that won’t be able to pay Oracle licenses, but will happily settle for MySQL’s smaller feature set.

It’s market segmentation all over again, I don’t see why Oracle won’t be able to keep MySQL open and still have a nice profitable business model. All Oracle has to do is to make the upgrade path from MySQL to Oracle a little bit easier. MySQL could be branded as entry-level Oracle, a gateway drug for newcomers in the enterprise database world.

Of course they could also easily fuck it up, but it’s not like there aren’t any competitors: there is PostgreSQL, mSQL, and of course there is always the possibility of forking MySQL itself (which many groups are doing right now). Because once the source is open, it stays that way forever.

No, I’m not worried at all. Happy New Year!

* Image licensed cc-by-sa-2.5 by M. Minderhoud. Technorati Claim code: FXEQMSTQ5VPE

PathVisio 2 released

Sunday, October 18th, 2009

This week we released version 2.0 of PathVisio. There has been over a year of active development since the last major release, and a ton of new features.

What is PathVisio? PathVisio is a tool for biological pathways. Stay organized! Use PathVisio as a simple notebook to collect all the various bits of information related to a biological research subject. Create images suitable for presentation or publication. Draw pathways, export them to many image formats, annotate them with links to online biological databases such as Ensembl or Entrez gene, and add comments and literature references from pubmed.

With PathVisio you draw pathways just like you would in powerpoint:
drawing_closeup

What is new? New in PathVisio 2.0 is the ability to import experimental datasets and visualize them on top of pathways. Explore large datasets in a way that is more interesting and understandable than just a huge spreadsheet. Import microarray, proteomics or metabolomics data. Microarray reporters will be automatically linked to genes and protein identifiers in pathways. You can customize the visualization, using gradients, boolean color rules, or colored icons.

Here is an example of visualized microarray data:
visualization_closeup

Perform over-representation analysis to find the pathway that was most affected by experimental conditions. This is great for hypothesis-generating experiment types, where you really don’t know anything in advance about your experiment.

Download pathway sets or share pathways on WikiPathways, a wiki where any researchers can contribute pathway knowledge. PathVisio is fully compatible.

Check our visual tour if you want to know more. Click here to Download PathVisio

For Developers: PathVisio has a plugin interface that lets users customize it to new analysis types, new visualization methods and new pathway formats. PathVisio is fully open source, and we’re always looking for Java developers who are interested in contributing, either to new plugins or to the core of the program. Contact us on our mailinglist.

Fixing a Pathway the Groovy Way

Tuesday, August 25th, 2009

It’s no secret that there are many mistakes in WikiPathways. A coworker notified me of a problem with the Focal Adhesion pathway. It’s exactly the sort of problem that requires a lot of repetitive action to be fixed. So I thought, instead of doing this the boring way, I write a program to do that instead, and have some fun with it at the same time. This is the perfect opportunity to get some practice with the WikiPathways webservice.

To make it a little bit more interesting, I decided to do it with Groovy, a scripting language I’m learning. It’s very similar to Java but also has a lot of cool whizzbang features such as dynamic typing, closures and multi-line strings.

For reference, you can find the full script below. But I’ll explain bit by bit. First step is to get access to the WikiPathways webservice:

def wpclient;
wpclient = new WikiPathwaysClient();

After this bit of setup, wpclient can be used to interact with WikiPathways.
In this case we want to send the updated pathway back, so we’ll need to log in. Authentication is not required if you only need read access. However, if you want to write data, you have to have a valid account with webservice permissions, which is available upon request.

wpclient.login("username", "********");

I only needed to get one pathway, and I know which one. It has identifier WP306. The code to download it is:

def wspwy = wpclient.getPathway("WP306");
def pwy = wpclient.toPathway(wspwy);

wspwy is a WSPathway object, which is a wrapper for the pathway with some meta-data from WikiPathways (species and revision number). The actual Pathway itself can be obtained with the toPathway() method.

Now is the time to manipulate the pathway, and fix it up any way imaginable. I’ll describe that below, but first I’ll show how I uploaded the pathway again using the webservice:

println "Uploading... ";
wpclient.updatePathway (wspwy.getId(), pwy,
  "autoconversion of Entrez symbols to IDs",
  wspwy.getRevision().toInteger());

So now onto the actual pathway manipulation. I didn’t explain yet what kind of problem needed to be fixed. The problem here was that none of the genes had a proper identifier. As a result, none of the links worked, and the pathway could not be linked to experimental data.

However, there is some good news because each gene is labeled with the gene name. Gene names are nice as labels because they are readable and a bit more meaningful than identifiers, but on the other hand gene names tend to be ambiguous, so we really need the identifiers as well. But gene names are a good start. Using BridgeDb we can perform a free text search for the gene name, and come up with a matching identifier.

First we need a bit of setup code for BridgeDb. In this case we’ll make use of our Human gene database, which can be downloaded from http://www.bridgedb.org/data/gene_databases

Class.forName ("org.bridgedb.rdb.IDMapperRdb");
mapper = BridgeDb.connect (
  "idmapper-pgdb:/path/to/Hs_Derby_20090720.bridge");
BioDataSource.init();

The main logic bits to translate gene names to gene identifiers is contained in the function labelToXref. It takes the label containing the gene name as argument, and returns an identifier or null if nothing was found.

def labelToXref (label) {
    // do a free search for all Xrefs that match our label
    for (ref in mapper.freeAttributeSearch(label, "Symbol", 100)) {
        // check only Xrefs that are in Entrez,
        // and that are an exact match with the label
        // free search will also return partial matches.
        if (ref.getDataSource() == BioDataSource.ENTREZ_GENE &&
            mapper.getAttributes (ref, "Symbol").contains(label)) {
            return ref;
        }
    }
    return null;
}

Now all we have to do is loop over each element in the pathway, get its label and use labelToXref to get the correct identifier.

And that’s it. It took me about an hour to write this. It would have taken less if I didn’t have to look up some bits about the Groovy syntax. As you can see from the pathway history, almost all genes were fixed.

Here is the entire script for reference purposes:

import org.pathvisio.model.Pathway
import org.bridgedb.bio.BioDataSource
import org.bridgedb.BridgeDb
import org.pathvisio.model.ObjectType
import org.pathvisio.view.Graphics
import org.pathvisio.wikipathways.WikiPathwaysClient

// Pathway WP306 (Focal Adhesion) uses Entrez symbols instead of Entrez ID's
// Thanks to Claus Mayer for reporting this problem.
public class EntrezSymbolToNumber {
    def mapper;
    def wpclient;
    // Look up entrez gene id for a given label
    // e.g. for INSR it will return L:3643
    def labelToXref (label) {
        // do a free search for all Xrefs that match our label
        for (ref in mapper.freeAttributeSearch(label, "Symbol", 100)) {
            // check only Xrefs that are in Entrez,
            // and that are an exact match with the label
            // free search will also return partial matches.
            if (ref.getDataSource() == BioDataSource.ENTREZ_GENE &&
                mapper.getAttributes (ref, "Symbol").contains(label)) {
                return ref;
            }
        }
        return null;
    }
   
    void init() {
        wpclient = new WikiPathwaysClient();
        wpclient.login("username", "********");
        Class.forName ("org.bridgedb.rdb.IDMapperRdb");
        mapper = BridgeDb.connect ("idmapper-pgdb:/path/to/Hs_Derby_20090720.bridge");
        BioDataSource.init();
    }
   
    void run() {
        def success = 0;
        def total = 0;
        // fetch pathway through the webservice
        def wspwy = wpclient.getPathway("WP306");
        def pwy = wpclient.toPathway(wspwy);

        // loop over all data nodes
        for (dn in pwy.getDataObjects()) {
            if (dn.getObjectType() == ObjectType.DATANODE) {
                total++;
                def label = dn.getTextLabel();
                def ref = dn.getXref();
                print (label + " " + ref + " -> ");
                if (ref.getDataSource() == BioDataSource.ENTREZ_GENE
                        && !mapper.xrefExists(ref)) {      
                    def result = labelToXref (label);
                    if (result != null) {
                        println "mapping to " + result;
                        dn.setGeneID (result.getId());
                        success++;
                    }
                    else
                        println "could not map";
                }
                else
                    println "OK";
            }
        }
        println success + " out of " + total + " converted";       
        println "Uploading... ";
        wpclient.updatePathway (wspwy.getId(), pwy, "autoconversion of Entrez symbols to IDs", wspwy.getRevision().toInteger());
    }
   
    static void main (args) {
        def runner = new EntrezSymbolToNumber();
        runner.init();
        runner.run();
    }  
}

By the way, I’m using the codecolorer plugin for the pretty code formatting. (thanks to rguha for the tip, and to Dmytro for being so responsive to my bug report)

The “Why” of the Identifier Mapping Problem

Tuesday, August 11th, 2009

I wrote before about my current work on identifier mapping. Briefly, each of the many different databases for genes and metabolites uses its own system of identifiers. This creates big headaches when you want to compare things from different databases. You’ll have to do some work to correlate them, which is what we call the identifier mapping problem.

Why does this problem exist in the first place? Wouldn’t it be really fantastic if everybody would always use the same identifiers everywhere? I don’t think that’s ever going to happen. There are practical reasons for that, but there are also fundamental problems that can never be solved.

Scientific databases are organized in a way that reflects the mindset of the scientists that created them. I noticed the same argument in an essay by Clay Shirky about the semantic web:

Because meta-data describes a worldview, incompatibility is an inevitable by-product of vigorous argument. It would be relatively easy, for example, to encode a description of genes in XML, but it would be impossible to get a universal standard for such a description, because biologists are still arguing about what a gene actually is. There are several competing standards for describing genetic information, and the semantic divergence is an artifact of a real conversation among biologists. You can’t get a standard til you have an agreement, and you can’t force an agreement to exist where none actually does.

Lactic Acid

Lactic Acid

Lactate

Lactate

Here is another example from the context of bioinformatics: a chemist might create separate identifiers for lactate and lactic acid. To a chemist, these are two different things, lactate is missing a hydrogen atom and it’s even negatively charged. But when dissolved in water these two rapidly convert into each other, making them practically indistinguishable. So a chemistry oriented database such as ChEBI describes them separately (CHEBI:24996 and CHEBI:28358) whereas a biological database such as HMDB puts both in a single record (HMDB00190) World views have affected the way these databases are set up.

By the way, the article quoted above is also an argument against the whole idea of the Semantic Web of Life Sciences (SWLS), but that’s subject matter for another post.

Mining biological pathways using WikiPathways web services

Thursday, August 6th, 2009

A website lets people interact with computers over the Internet. A web service on the other hand, lets computers interact with computers over the Internet. We’ve created a web service for WikiPathways so people can write computer scripts to do interesting new things with WikiPathways. This is all described in great detail in an article that was recently published in PLoS One.

Mining biological pathways using WikiPathways web services.
Kelder T, Pico AR, Hanspers K, van Iersel MP, Evelo C, Conklin BR.
PLoS One. 2009 Jul 30;4(7):e6447.

Naturally it’s open access, so you can read it all online. From the article:

The WikiPathways web service provides an interface for programmatic access to community-curated pathway information. […] The web service can be used by software developers to build or extend tools for analysis and integration of pathways, interaction networks and experimental data. The web services are also useful for assisting and monitoring the community-based curation process. By providing this web service, we hope to help researchers and developers build tools for pathway-based research and data analysis.

Automated access, plus the fact that all content is available under a Creative Commons license, should make WikiPathways even more useful as a scientific resource. It will be interesting to see what kind of uses people will come up with.

Martijn’s Continuous Build System part 2

Sunday, July 19th, 2009

In part 1, I described what a continuous build system is, and what it is useful for. Now I’m going to write about another important use of the build system: testing interfaces between modules.

In a modular system, the parts evolve independently in different directions and at different speeds. This is true in programming as well as in biology. Applications with a plug-in system (plug-ins, extensions, modules and drivers are really all the same thing) can add new features while avoiding bloat, can be customized to highly specific uses without burdening the user interface for everybody, etc. In the end, every interesting program will have a need for a plug-in system of some sort.

Shackled by a stable interface

Shackled by a stable interface

In a plug-in system, you have to define an interface between the main program and the plug-ins. This interface is also called API (Application Programming Interface). It is important that this interface is well defined and doesn’t change over time. If the API unilaterally changes, all the plug-ins will stop working. So naturally, most programs strive to keep the interfaces between the program and the plug-ins stable. This is what the Cytoscape people refer to when they are talking about the “Stable Plugin API”, a holy grail that they have yet to achieve unfortunately.

For PathVisio we use an unconstrained development model where the interface between the program and plug-ins can change at any time, as needed for the improvement of the program. How is this possible?

Linux pioneered that model: they call a stable API nonsense. The interface between drivers and the kernel changes all the time. If the Linux developers think of a better, more consistent or more efficient way to interface with the drivers they go ahead and make that change.

So how is this possible? How does Linux not degrade into a stinking heap of old drivers with interface mismatches that can’t communicate with the kernel properly? The answer is simple: because Linux is completely open source, any kernel developer can update all drivers at the same time as they change the API.

This model has two consequences

  1. Linux developers are free to improve the kernel in every way they can. They do not have to keep supporting an old crufty outdated API to keep old drivers working.
  2. Drivers for Linux have to be open source, or they run the risk of getting out of date really quickly.

The fact that their hands are not tied to a stable API gives the kernel developers enormous freedom to improve their work. Compare that to Windows. Living in a closed-source world, the Windows developers are stuck, they can never improve their kernel without breaking everything. Windows developers tried to break out of this choke hold with Vista. Vista came with a fresh new driver API, different from XP. The consequence, of course, was that several months after the release of Vista people were still complaining about broken drivers.

Of course the problems with Vista did get resolved in the end, but it took a lot of time and effort. The key difference is in who updates the drivers. In the Linux world, the person who changes the API is also the person who updates the drivers. This is only possible because the drivers are open source. The Windows developers have to notify all the driver developers about the API changes. This is a huge communication burden.

Back to bioinformatics. Unfortunately, Cytoscape can’t follow the Linux model because they want to support closed-source plugins. A number of core developers of Cytoscape live in the closed source world, and are not keen to release their plug-in source code. This means that Cytoscape has to continue on its quest for that elusive Stable plugin API.

PathVisio, on the other hand, has no such tie-in. Although the PathVisio license agreement certainly permits the development of closed-source plug-ins, we strongly discourage it. PathVisio does not seek a stable plugin API. Instead, what we have is the PathVisio promise:

If you make your plug-in open source, we will update your plug-in whenever there is an API change.

And we can fulfill this promise thanks to the continuous build system. It tests interfaces between modules. It tracks modules from multiple repositories, and runs fresh tests whenever a programmer checks in new code. Even better, all modules that interact with a changed module is tested as well, so we can check that the interface between them still works. At this moment we track dozens of modules from 10 different repositories. All this testing lets us follow a development model where any interface can be changed as needed. If something breaks, the build system will tell us and we will fix it immediately.

BatchMapper v0.1

Sunday, July 5th, 2009

I just released the first working version of a new tool called batchmapper. This tool lets you take a list of gene, protein or metabolite identifiers from one database and translate them to a different database.

638px-beta-d-glucose-from-xtal-3d-balls
Why is this useful? I’ll explain for metabolites, although the story is really the same for genes and proteins. Metabolites are the chemical compounds that you find naturally in the human body. Of course a lot of research is being done on metabolites, and the collected wisdom is available in a number of online databases, such as Kegg in Japan, PubChem in the USA, ChEBI in the UK and HMDB in Canada

The glut of online databases has lead to a tower of Babel of metabolite identifiers. Glucose, one of the most important compounds in our body, may be known as HMDB00122 in Canada, C00031 in Japan, 5793 in the USA or 17634 in the UK.

batchmapper is a spin-off from recent work done by JJ and me. It’s a command line tool, so it’s not very user friendly, but it is fast, flexible and completely automatic. The translation tables can be provided in the form of text files, relational databases or webservices, or even a combination thereof. This early release is completely functional. Check out the tutorial, and leave some comments here on this blog.

It would be nice if all the online metabolite databases worked together and merged into a single resource, but I don’t see that happening in the near future. At least batchmapper helps to make the problem a little more manageable.

Martijn’s Continuous Build System part 1

Thursday, June 25th, 2009

Joel said it best – Daily Builds are your friend.

A continuous build system is quality control for a program that’s being developed. It’s a computer that tests the state of the program every day. Or preferably every ten minutes. Under my desk I have a computer whose sole job it is to continuously monitor the state of PathVisio. Does it compile correctly? Are there any style problems in the source code? Are the automated tests positive? Fresh documentation is automatically generated and uploaded. The webstart version is refreshed. A fresh zip archive is created and placed in a convenient place. If there are any problems, an email is sent to our mailing list. There is an online report where you can find out everything about the current health of PathVisio.

bigstockphoto_fire_alarm_1536674The real effect of a preventive measure is always hard to tell. It’s like fire prevention measures. They are costly, yet most houses won’t need them because they’ll never burn to the ground. Setting up a good continuous build system is quite a bit of work. Is it just red tape, a lot of effort but YAGNI?

Here is an example of how it saved us. After I hooked up Helen’s summer-of-code project to the continuous build system, we soon got a ton of errors by email. It turns out that she started using GroupLayout, a java class that is only available in Java 6 and higher. However, unbeknownst to Helen, both Cytoscape and PathVisio aim to be compatible with Java 5, and I configured the build system to check for Java 5 compatibility. So we found out about this problem right away, and Helen could fix the problem immediately before going too far down a dead-end road. It would probably have taken a lot more work to fix this if we had found out later.

For PathVisio we’ve used a continuous build system since day one. But recently I’ve taken the time to make quite a few improvements. I plan to write about that in more detail in the coming days.

Edit:here is part 2