Archive for July, 2009

Martijn’s Continuous Build System part 2

Sunday, July 19th, 2009

In part 1, I described what a continuous build system is, and what it is useful for. Now I’m going to write about another important use of the build system: testing interfaces between modules.

In a modular system, the parts evolve independently in different directions and at different speeds. This is true in programming as well as in biology. Applications with a plug-in system (plug-ins, extensions, modules and drivers are really all the same thing) can add new features while avoiding bloat, can be customized to highly specific uses without burdening the user interface for everybody, etc. In the end, every interesting program will have a need for a plug-in system of some sort.

Shackled by a stable interface

Shackled by a stable interface

In a plug-in system, you have to define an interface between the main program and the plug-ins. This interface is also called API (Application Programming Interface). It is important that this interface is well defined and doesn’t change over time. If the API unilaterally changes, all the plug-ins will stop working. So naturally, most programs strive to keep the interfaces between the program and the plug-ins stable. This is what the Cytoscape people refer to when they are talking about the “Stable Plugin API”, a holy grail that they have yet to achieve unfortunately.

For PathVisio we use an unconstrained development model where the interface between the program and plug-ins can change at any time, as needed for the improvement of the program. How is this possible?

Linux pioneered that model: they call a stable API nonsense. The interface between drivers and the kernel changes all the time. If the Linux developers think of a better, more consistent or more efficient way to interface with the drivers they go ahead and make that change.

So how is this possible? How does Linux not degrade into a stinking heap of old drivers with interface mismatches that can’t communicate with the kernel properly? The answer is simple: because Linux is completely open source, any kernel developer can update all drivers at the same time as they change the API.

This model has two consequences

  1. Linux developers are free to improve the kernel in every way they can. They do not have to keep supporting an old crufty outdated API to keep old drivers working.
  2. Drivers for Linux have to be open source, or they run the risk of getting out of date really quickly.

The fact that their hands are not tied to a stable API gives the kernel developers enormous freedom to improve their work. Compare that to Windows. Living in a closed-source world, the Windows developers are stuck, they can never improve their kernel without breaking everything. Windows developers tried to break out of this choke hold with Vista. Vista came with a fresh new driver API, different from XP. The consequence, of course, was that several months after the release of Vista people were still complaining about broken drivers.

Of course the problems with Vista did get resolved in the end, but it took a lot of time and effort. The key difference is in who updates the drivers. In the Linux world, the person who changes the API is also the person who updates the drivers. This is only possible because the drivers are open source. The Windows developers have to notify all the driver developers about the API changes. This is a huge communication burden.

Back to bioinformatics. Unfortunately, Cytoscape can’t follow the Linux model because they want to support closed-source plugins. A number of core developers of Cytoscape live in the closed source world, and are not keen to release their plug-in source code. This means that Cytoscape has to continue on its quest for that elusive Stable plugin API.

PathVisio, on the other hand, has no such tie-in. Although the PathVisio license agreement certainly permits the development of closed-source plug-ins, we strongly discourage it. PathVisio does not seek a stable plugin API. Instead, what we have is the PathVisio promise:

If you make your plug-in open source, we will update your plug-in whenever there is an API change.

And we can fulfill this promise thanks to the continuous build system. It tests interfaces between modules. It tracks modules from multiple repositories, and runs fresh tests whenever a programmer checks in new code. Even better, all modules that interact with a changed module is tested as well, so we can check that the interface between them still works. At this moment we track dozens of modules from 10 different repositories. All this testing lets us follow a development model where any interface can be changed as needed. If something breaks, the build system will tell us and we will fix it immediately.

BatchMapper v0.1

Sunday, July 5th, 2009

I just released the first working version of a new tool called batchmapper. This tool lets you take a list of gene, protein or metabolite identifiers from one database and translate them to a different database.

638px-beta-d-glucose-from-xtal-3d-balls
Why is this useful? I’ll explain for metabolites, although the story is really the same for genes and proteins. Metabolites are the chemical compounds that you find naturally in the human body. Of course a lot of research is being done on metabolites, and the collected wisdom is available in a number of online databases, such as Kegg in Japan, PubChem in the USA, ChEBI in the UK and HMDB in Canada

The glut of online databases has lead to a tower of Babel of metabolite identifiers. Glucose, one of the most important compounds in our body, may be known as HMDB00122 in Canada, C00031 in Japan, 5793 in the USA or 17634 in the UK.

batchmapper is a spin-off from recent work done by JJ and me. It’s a command line tool, so it’s not very user friendly, but it is fast, flexible and completely automatic. The translation tables can be provided in the form of text files, relational databases or webservices, or even a combination thereof. This early release is completely functional. Check out the tutorial, and leave some comments here on this blog.

It would be nice if all the online metabolite databases worked together and merged into a single resource, but I don’t see that happening in the near future. At least batchmapper helps to make the problem a little more manageable.