Posts Tagged ‘modularity’

The Downside of Modularity

Saturday, March 13th, 2010

I’m a big fan of modularity. I’ve even got a modular system in my living room. It consists of the following modules:

  • One module that converts a digital signal to a two-dimensional picture.
  • One module that reads a rotating plastic disk with a laser and produces a digital signal.
  • One module that gets a digital signal from a socket in the wall, stores it temporarily on magnetic disk, and sends it out again upon request.
  • One module that generates a digital signal based on a simulation of a virtual world, with which I can interact in real time using motion and pressure sensitive input devices.

In case you hadn’t guessed already, I was talking about my TV, DVD player, Hard-disk recorder, and Game Console.

Imagine if all of this came in one device. A TV+DVD+HDR+Console-in-one. Imagine what it would cost. If only one part broke, I would have to get everything anew. I would never be able to move it abroad, because the HDR is tied to my cable provider. I would never be able to get the games that do not involve Italian plumbers.

But to be fair, there are also disadvantages to modular systems. Just take a look at the remote control that comes with it:

How to develop Modular Software

Saturday, March 6th, 2010

It’s always good to make software modular. Modular software is strong and healthy, monolithic software is sickly and bedridden. I’ve touched before on how modularity increases adaptability. But modularity also helps to keep software small, nimble and unbloated. I’ll illustrate how we’re applying modular design in BridgeDb.

Modularity is the only known antidote against bloatware. The more features a piece of software has, the larger it has to be. When you don’t use 90% of those features, it’s perceived as a problem. Bloated software takes a long time to start, fills up your hard drive, clogs your tubes. We want bioinformatics developers to use BridgeDb as much as possible, and we don’t want them to complain that BridgeDb is bloated.

For example, BridgeDb supports identifier mapping through several different web services. Some of those webservices are based on SOAP, others on XML-RPC or on REST. For each type of webservice, you need additional libraries. If it was only one monolithic chunk, you’d always need several megabytes of library dependencies.

You may say: “A few megabyte, so what?”. When I was at mediamarkt the other day, I couldn’t even find memory sticks smaller than 2 Gb anymore. But size still matters when you expect fast download times. For example WikiPathways uses BridgeDb on each pathway page. Bigger libraries means longer load times, which means annoyed users.

We want many features, but we don’t want bloat. The solution is to cut BridgeDb up into many small pieces, where you can choose the ones you need, and ignore the rest. You also don’t need the dependencies of the parts you ignore.

So how do you decide which pieces of BridgeDb you need? I’ve compiled this handy graph. On the right side, you see all the different “features” (i.e. identifier mapping services) that you can choose. Follow the arrows to the left, and note the modules that you encounter. Those are the modules you need for that mapping service.

If you’re getting started with modular software development, I can give you a few tips. You really don’t need any of those terribly complicated frameworks like Maven or OSGi. All you need is a good IDE like Eclipse and a bit of determination.

You have to be careful to manage the boundaries between modules. Eclipse can help you a great deal with this. Put each module in its own directory. In your Eclipse workspace, set up a separate project for each module, and add dependent projects in each project build path. This way you can never introduce cyclic dependencies or go across module boundaries. Eclipse will simply refuse to find the class and flag it as a compiler error.

For example, here is how I’ve set up BridgeDb in eclipse. In the package explorer you see that I’ve defined a separate project for each module in BridgeDb.

And to complete the example, here is how I configured the build path for the org.bridgedb.bio module. As you can see, the org.bridgedb project is listed as its sole dependency.

Martijn’s Continuous Build System part 2

Sunday, July 19th, 2009

In part 1, I described what a continuous build system is, and what it is useful for. Now I’m going to write about another important use of the build system: testing interfaces between modules.

In a modular system, the parts evolve independently in different directions and at different speeds. This is true in programming as well as in biology. Applications with a plug-in system (plug-ins, extensions, modules and drivers are really all the same thing) can add new features while avoiding bloat, can be customized to highly specific uses without burdening the user interface for everybody, etc. In the end, every interesting program will have a need for a plug-in system of some sort.

Shackled by a stable interface

Shackled by a stable interface

In a plug-in system, you have to define an interface between the main program and the plug-ins. This interface is also called API (Application Programming Interface). It is important that this interface is well defined and doesn’t change over time. If the API unilaterally changes, all the plug-ins will stop working. So naturally, most programs strive to keep the interfaces between the program and the plug-ins stable. This is what the Cytoscape people refer to when they are talking about the “Stable Plugin API”, a holy grail that they have yet to achieve unfortunately.

For PathVisio we use an unconstrained development model where the interface between the program and plug-ins can change at any time, as needed for the improvement of the program. How is this possible?

Linux pioneered that model: they call a stable API nonsense. The interface between drivers and the kernel changes all the time. If the Linux developers think of a better, more consistent or more efficient way to interface with the drivers they go ahead and make that change.

So how is this possible? How does Linux not degrade into a stinking heap of old drivers with interface mismatches that can’t communicate with the kernel properly? The answer is simple: because Linux is completely open source, any kernel developer can update all drivers at the same time as they change the API.

This model has two consequences

  1. Linux developers are free to improve the kernel in every way they can. They do not have to keep supporting an old crufty outdated API to keep old drivers working.
  2. Drivers for Linux have to be open source, or they run the risk of getting out of date really quickly.

The fact that their hands are not tied to a stable API gives the kernel developers enormous freedom to improve their work. Compare that to Windows. Living in a closed-source world, the Windows developers are stuck, they can never improve their kernel without breaking everything. Windows developers tried to break out of this choke hold with Vista. Vista came with a fresh new driver API, different from XP. The consequence, of course, was that several months after the release of Vista people were still complaining about broken drivers.

Of course the problems with Vista did get resolved in the end, but it took a lot of time and effort. The key difference is in who updates the drivers. In the Linux world, the person who changes the API is also the person who updates the drivers. This is only possible because the drivers are open source. The Windows developers have to notify all the driver developers about the API changes. This is a huge communication burden.

Back to bioinformatics. Unfortunately, Cytoscape can’t follow the Linux model because they want to support closed-source plugins. A number of core developers of Cytoscape live in the closed source world, and are not keen to release their plug-in source code. This means that Cytoscape has to continue on its quest for that elusive Stable plugin API.

PathVisio, on the other hand, has no such tie-in. Although the PathVisio license agreement certainly permits the development of closed-source plug-ins, we strongly discourage it. PathVisio does not seek a stable plugin API. Instead, what we have is the PathVisio promise:

If you make your plug-in open source, we will update your plug-in whenever there is an API change.

And we can fulfill this promise thanks to the continuous build system. It tests interfaces between modules. It tracks modules from multiple repositories, and runs fresh tests whenever a programmer checks in new code. Even better, all modules that interact with a changed module is tested as well, so we can check that the interface between them still works. At this moment we track dozens of modules from 10 different repositories. All this testing lets us follow a development model where any interface can be changed as needed. If something breaks, the build system will tell us and we will fix it immediately.

About Modules and Superpathways

Sunday, May 10th, 2009

This post is a bit of background on the projects of two of our summer of code students: JJ and Helen.

The quintessence of modularity

The quintessence of modularity

For WikiPathways, people often ask for a way to compose pathways into a single large network. Currently, WikiPathways is a collection of small pathways with clear boundaries. If it were a large network, the advantage would be that you get a complete overview of the cell without arbitrary limits. On the other hand, small pathways can still be understood, edited and curated manually by a single person. Manual interaction is something that we value very much at WikiPathways, and we constantly battle the tendency to invent automatic tools for everything. Another advantage of small pathways is that we can do over-representation analysis (such as GSEA) to rank pathways depending on experimental data – of course it is not possible to make a ranking if you have only a single large network.

Jianjiong Gao is a summer student already for the second year. Last year he tried to solve this problem with the NetworkMerge plugin. It was a great result but not yet very good at merging networks from different sources, that are usually annotated with different types of identifiers. To solve that, JJ is working this year on improving  ID-mapping. If you want to know more, check out JJ’s blog.

Xuemin (Helen) Liu will work on the SuperPathways project. This is similar to Network Merge, but focused on showing the relations between pathways. Thus it will create a new view of a set of pathways that shows the interaction and crosstalk between them. This will let the users of WikiPathways take an overview of the whole pathway collection while still keep the pathways themselves manageably small.

NetworkMerge and SuperPathways will no doubt be useful. But the question remains: do pathways really exist, or is that a human invention to make biology manageable? I think there is a case to be made that pathways do exist as biological entities, even though they have fuzzy boundaries. It helps to think of pathways as modules. There are two schools of thought on this issue.

Map of the phi-X174 genome

Map of the phi-X174 genome

The molecular biology school thinks of pathways as separate modules that can be seen relatively independently. Although there is clearly overlap and cross-talk at times, you can still study a pathway, measure it, and talk about it as though it’s an independent entity.

On the other hand, the systems biology school considers a cell as a gigantic network of molecules all interacting with each other. There are so many influences that you can’t make any predictions about the state of a cell unless you measure all it’s components and know all it’s interactions.

The module school can be accused of simplifying the complexity of cells too much, creating artificial delineations that are not there in reality, just because they are hard to comprehend. Evolution does not favor a clean solution, it lands on a good-enough solution, and we can’t expect cellular networks to be easy to understand.

On the other hand, there is clear evidence that evolution does encourage modularity. The reasoning is that biomolecules are organized in modules because modules can evolve independently, and thus make the organism as a whole more evolveable, i.e. more flexible to adapt over the course of generations.

Consider the phage phi-X174. (A phage is a virus that targets bacteria). It has 11 genes spread over no more than 5386 nucleotides. Its genome is so condensed that several genes overlap, in a certain section even three genes overlap. That means that if in that region one nucleotide is changed, three genes will be affected. In the case of phi-X174, evolution selected a genome as short as possible, so that it can replicate extremely quickly. The downside is that phi-X174 can’t evolve anymore. It is an evolutionary dead-end. To a phage, the genes are its modules and overlapping genes destroy modularity.

There is a clear parallel in software development. Computer programmers usually strive for a modular design. Code is “smelly” when there are too many interactions across the boundaries of modules. You know you have a problem when you fix a bug and two new bugs appear in totally unrelated places. When this happens it’s time to start paying off your technical debt, or face the risk of entering a spiral of doom.

In the bazaar of open source, there are always a dozen projects in any category competing for developer resources and public attention. Thus modular development is favored in the long run. A non-modular project can probably be developed quicker initially, but this comes at the cost of flexibility to add new features. Therefore there is a simple natural selection process going on, where projects have to become modular or face being out-competed. Modularity means evolveability, and ensures long term survival.