Oracle versus Google

August 20th, 2010

A while back I wrote rather optimistically that I expected Oracle to play nice with the open source community

“I don’t see why Oracle won’t be able to keep MySQL open and still have a nice profitable business model.”

However, all that seems rather less likely, now that Oracle is suing Google over patent infringement in the open source Android platform.

Here is a great point of view from a Java developer on this lawsuit. Not only it explains what the suit is all about, but also gives historical perspective, how Java got into this mess in the first place.

From that link:

“Android actually *was* a great platform that supported existing Java developers and libraries incredibly well (without actually being a Java environment), and for the first time there was a serious contender to “standard” Java that Sun had absolutely no control over.”

and:

“If for nothing else, Jonathan Schwartz will be remembered as the man who broke open the Sun piñata, simultaneously releasing more open-source software than any company in history and killing Sun in the process. Either Jonathan had no “step 2″ or the inertia of a company built on closed-source products was too great to overcome. In either case, by spring of 2009 Sun was hemorrhaging”

It’s long but well worth a read if you are concerned about the future of Java.

Going to COMBINE 2010

July 31st, 2010

COMBINE 2010 is a meeting about all systems biology standards: SBML, SBGN, BioPAX, … etc. It’s from October 6 to 9 in Edinburgh, just before the ICSB conference.

Registration just opened, I’ve already signed up.

SBGN 5.5 Hackathon writeup

April 25th, 2010

The goal of Systems Biology Graphical Notation or SBGN is to define a set of standard shapes for things like reactions, enzymes, genes, metabolites, compartments etc. For example, here is the reaction that takes place in your body after an alcoholic beverage, using SBGN notation.

Alcohol dehydrogenase as SBGN diagram

The SBGN community organizes regular meetings or “hackathons”, where specs are discussed and new SBGN-related software is presented. This year I was present at the meeting held at Wittenberg in Germany (where you can still find the church door where Luther pinned his 95 theses). Here is my writeup, by no means complete, just a collection of impressions.

Vanted SBGN-ED
In the category SBGN-related software, Tobias Czauderna demonstrated SBGN-ED,  a new plugin for Vanted that lets you create diagrams in SBGN format. Especially interesting is the nice validator that can tell you if you drew something that is not allowed by the SBGN specifications. A very complete SBGN editor and probably one of the nicest out there. No link yet, but there will be a publication very soon.

Vanted BridgeDb plugin
Unfortunately a few presentations had to be canceled the first day due to the volcanic ash that was plaguing the European skies. To fill the gaps in the agenda, we just started hacking on random stuff. One result of this was a Vanted BridgeDb plugin that I made together with Christian Kuklas. Christian immediately found a number of bugs and requested new features in BridgeDb. There is not yet an easy way to install the plugin, but if you’re interested you can play with the source code

SBGN Exchange format
The second and third day, most participants managed to overcome the disruption of the air traffic and join the conference. Besides discussion on the planned next release of SBGN, there was another important topic: LibSBGN.

SBGN currently only exists as a specification of glyphs and symbols and what they mean, there is no computer file format. But there are now several software packages out there that deal with SBGN, and they need a standard exchange format to work together.

The existing standard file formats for pathways, SBML and BioPAX, do not store  layout information, i.e. they do not store the position of the elements of a diagram. According to the SBGN spec, layout information does not carry any meaning. Biologically the diagram means the same thing, no matter if elements are arranged vertically, horizontally, in a circle, or in random order. So you would think there is no problem.

But as it turns out, it’s exactly the layout information that has to be exchanged between pathway software. It turns out that a good pathway layout is really hard to do automatically, so once you have painstakingly defined a layout you want to preserve it. So weirdly enough we want an SBGN file format to exchange information that is not part of the SBGN spec at all.

Because there is no standard, the current tools all defined their own ad-hoc file format out of necessity. The lack of a standard file format is really becoming an impediment to cooperation. There has been a lot of talk how this should look like, XML? DTD? GML? Object model? But in my opinion it doesn’t matter too much in the end, you just have to make a decision and stick with it. We’ve been able to turn a lot of discussion into tangible results in the form of a new SourceForge project, and uploaded some small samples and the beginnings of an XML Schema definition. Hopefully we can keep the momentum and get it mostly done by the next SBGN meeting.

jSBML
Not related to SBGN directly, but I found it interesting nontheless: A new java native library for SBML is being developed called jSBML. There already existed a library called libSBML, but it’s written in C++. Because LibSBML was considered a “good-enough” solution for java programs, the SBML community for a long time resisted the notion of a native java library. But it turned out that many SBML-based Java projects were actually developing their own native library anyway. Ironically,  the tendency to avoid duplication of work actually led to multiple incomplete projects that all duplicated each other. Sometimes it’s best to accept reality.

WikiPathways Curation Jamboree Evaluation

April 16th, 2010

WikiPathways content is growing nicely, but it’s not growing like one of those nice exponential curves that you see in the first slide of almost every bioinformatics presentation nowadays. We want exponential curves in our presentations too, dammit, so we want to get more people actively involved.

A big challenge for WikiPathways is to get people to take the first step, to get them over that initial hump and actually start participating. Certainly a lot of people are very interested in WikiPathways, but there is some hesitation to just start working on the content. It’s something we have to work on. Besides clearing technical hurdles, we try to gently help people, simply to get started.

As an experiment, we organized a dedicated curation jamboree, a focused effort to get together and crank through a list of curation tasks. We prepared documentation, contacted several mailing lists and harassed all our colleagues. We also put together a special chat channel where newcomers can get instantaneous contact and answers to quick questions. This event happened for two days in February.

So, was it a success? Yes, if you look at edit activity. Thomas made this graph of the number of pathways tagged with either “needs reference” (for pathways that don’t have any literature references) and “missing description” (for pathways that don’t have a nice description text). As you can see, the numbers dropped quickly during the two days of the curation event, by at least 25%. (ignore the initial jump in the blue line, that’s due to a bug in the data collection script). WikiPathways gained a lot of curated data in a short period of time.

Numbers of pathway with a curation tag over time

The most active contributors were the usual suspects: Thomas, Alex, Kristina and me, the core WikiPathways team. But you can see in the graph below that other people joined in as well. Even if they did only a few curation tasks, that’s good  enough. The most important thing is to get people to take the first step. So the graph below is misleading: participating really is more important than winning.

Number of edits per user

That’s Great, Now Please Fix OpenOffice.org

April 8th, 2010

Did you hear about the OpenOffice.org mouse? It’s a mouse that has no less than 18 buttons. From the press release:

“With a revolutionary and patented design featuring 18 buttons, an analog joystick, and support for as many as 52 key commands, the OOMouse is intended to provide a faster and more efficient user interface for most complex software applications than the conventional icons, pull-down menus, and hotkeys presently permit.”

This is the same logic that brought us the 5-blade razor. Somebody please smack these people with a copy of The Design of Everyday Things. Why would anybody want such a complicated thing? Beginning users will not be interested in memorizing 18 unmarked buttons. And for advanced users, ordinary keyboard shortcuts are much more effective anyway, because they will prefer to keep their fingers near the home row (the row from “asdf” to “jkl;” – touch typists are trained to put their fingers there at all times). For maximum speed, you want to move your hand away from the keyboard as little as possible.

This is even more ironic when you consider that the OpenOffice.org suite, in spite of all its open-source karma, is still full of tiny annoyances. What OpenOffice.org needs now is a focus on user-friendliness, not a mouse that looks like the console of a nuclear power-plant.

Just a tiny example. I’ve used MS Word a fair bit and I’ve learned a lot of handy keyboard shortcuts. Just try this: if you want to change the style of a paragraph, just press Ctrl + Shift + S, and start typing the name of the style you want. The same trick works with Ctrl + Shift + F for font and Ctrl + Shift + P for font size (points). If you have to do this a lot, this shortcut is a great time saver. I haven’t been able to find any equivalent shortcut in OpenOffice.org. Is that the reason why we need an 18-button mouse now? Can’t we just fix the software itself?

I wonder why OpenOffice.org allowed their name to be used for such a weird project. I fear that this will only re-enforce the stereotype that open source software is never user-friendly.

BridgeDb: now also with metabolite identifier mapping

March 25th, 2010

Ha, fooled ya! BridgeDb has been able to deal with metabolite identifiers since the beginning. But mapping genes is such a common problem that metabolites aren’t getting any attention. Nearly all the code examples that we have thus far are with genes.

Somebody on the mailinglist asked for an example with metabolites. Well here you go, you’ll see it’s really easy. This example takes the ChEBI identifier for methionine, and looks up the corresponding PubChem identifier.

// Using the BridgeRest webservice as mapping
// service, it does compound mapping fairly well.
// We select the human species, but it doesn't really
// matter which species we pick.
Class.forName ("org.bridgedb.webservice.bridgerest.BridgeRest");
IDMapper mapper = BridgeDb.connect(
    "idmapper-bridgerest:http://webservice.bridgedb.org/Human");

// Start with defining the Chebi identifier for
// Methionine, id 16811
Xref src = new Xref("16811", BioDataSource.CHEBI);

// the method returns a set, but actually there is only one result
for (Xref dest : mapper.mapID(src, BioDataSource.PUBCHEM))
{
    // this should print 6137,
    // the pubchem identifier for Methionine.
    System.out.println ("" + dest.getId());
}

Compile this example with org.bridgedb.jar, org.bridgedb.bio.jar and org.bridgedb.webservice.bridgerest.jar in the classpath, which can be downloaded from http://bridgedb.org/data/releases/

Google Summer of Code 2010

March 18th, 2010

Yay! It’s official, we’re going to be in the Google Summer of Code again this year. Our application as a mentoring organization was just accepted. Cytoscape, PathVisio, WikiPathways and even BridgeDb are all joined under the GenMAPP umbrella organisation. Unfortunately I don’t have time to mentor again, so I’ll be watching from the sidelines this year. But I do want to encourage students to apply.

Students from all nations, we want to hear from you! If you’re interested in developing open source bioinformatics software, please send us a proposal. Check our ideas page, write a proposal and send it to our mailinglist. You have a chance to gain valuable development experience and earn a little money at the same time. The earlier you contact us, the better your chances.

The Downside of Modularity

March 13th, 2010

I’m a big fan of modularity. I’ve even got a modular system in my living room. It consists of the following modules:

  • One module that converts a digital signal to a two-dimensional picture.
  • One module that reads a rotating plastic disk with a laser and produces a digital signal.
  • One module that gets a digital signal from a socket in the wall, stores it temporarily on magnetic disk, and sends it out again upon request.
  • One module that generates a digital signal based on a simulation of a virtual world, with which I can interact in real time using motion and pressure sensitive input devices.

In case you hadn’t guessed already, I was talking about my TV, DVD player, Hard-disk recorder, and Game Console.

Imagine if all of this came in one device. A TV+DVD+HDR+Console-in-one. Imagine what it would cost. If only one part broke, I would have to get everything anew. I would never be able to move it abroad, because the HDR is tied to my cable provider. I would never be able to get the games that do not involve Italian plumbers.

But to be fair, there are also disadvantages to modular systems. Just take a look at the remote control that comes with it:

How to develop Modular Software

March 6th, 2010

It’s always good to make software modular. Modular software is strong and healthy, monolithic software is sickly and bedridden. I’ve touched before on how modularity increases adaptability. But modularity also helps to keep software small, nimble and unbloated. I’ll illustrate how we’re applying modular design in BridgeDb.

Modularity is the only known antidote against bloatware. The more features a piece of software has, the larger it has to be. When you don’t use 90% of those features, it’s perceived as a problem. Bloated software takes a long time to start, fills up your hard drive, clogs your tubes. We want bioinformatics developers to use BridgeDb as much as possible, and we don’t want them to complain that BridgeDb is bloated.

For example, BridgeDb supports identifier mapping through several different web services. Some of those webservices are based on SOAP, others on XML-RPC or on REST. For each type of webservice, you need additional libraries. If it was only one monolithic chunk, you’d always need several megabytes of library dependencies.

You may say: “A few megabyte, so what?”. When I was at mediamarkt the other day, I couldn’t even find memory sticks smaller than 2 Gb anymore. But size still matters when you expect fast download times. For example WikiPathways uses BridgeDb on each pathway page. Bigger libraries means longer load times, which means annoyed users.

We want many features, but we don’t want bloat. The solution is to cut BridgeDb up into many small pieces, where you can choose the ones you need, and ignore the rest. You also don’t need the dependencies of the parts you ignore.

So how do you decide which pieces of BridgeDb you need? I’ve compiled this handy graph. On the right side, you see all the different “features” (i.e. identifier mapping services) that you can choose. Follow the arrows to the left, and note the modules that you encounter. Those are the modules you need for that mapping service.

If you’re getting started with modular software development, I can give you a few tips. You really don’t need any of those terribly complicated frameworks like Maven or OSGi. All you need is a good IDE like Eclipse and a bit of determination.

You have to be careful to manage the boundaries between modules. Eclipse can help you a great deal with this. Put each module in its own directory. In your Eclipse workspace, set up a separate project for each module, and add dependent projects in each project build path. This way you can never introduce cyclic dependencies or go across module boundaries. Eclipse will simply refuse to find the class and flag it as a compiler error.

For example, here is how I’ve set up BridgeDb in eclipse. In the package explorer you see that I’ve defined a separate project for each module in BridgeDb.

And to complete the example, here is how I configured the build path for the org.bridgedb.bio module. As you can see, the org.bridgedb project is listed as its sole dependency.

BridgeDb paper published

January 12th, 2010

I’m very happy that our paper on BridgeDb was accepted by BMC Bioinformatics. It’s open access so download it to your hearts content. BridgeDb is all about identifier mapping, which I blogged about before (here, here and here).

BridgeDb lets you find cross-references for identifiers, but BridgeDb is not simply a cross-reference database. BridgeDb provides a standard method to access other cross-reference databases. And because of that level of standardization, you can easily decide to switch to a different source of cross-references.

Deepak Singh uses the term “middleware”, which is a good way to explain it, if that sort of word means anything to you.

But let me try to explain in a different way. BirdgeDb is really a travel adapter. Suppose you’re in Japan and you’ve brought some gear like a laptop, cell phone and a nintendo DS (just in case you get stuck in a blizzard while transferring at CDG). Much to your dismay you discover, after checking into your hotel, that none of your plugs fit in Japanese electrical sockets. So what do you do? Do you go down to Akihabara and spend a grand on a new laptop, phone and portable video game unit? Or do you buy a travel adapter for $1.95?

Just like there are many different power plugs around the world, there are many databases that do identifier mapping. And just like travel adapters let you plug in your laptop anywhere, no matter what country, BridgeDb lets you use your favorite bioinformatics tool, no matter what the source of identifier mappings is (Provided that the tool uses BridgeDb).

Power plugs around the world

It’s important to realize that BridgeDb is simply a conduit of information. It does not calculate cross-references from scratch, nor does it give any guarantees about the validity of those cross-references. You shouldn’t ask if BridgeDb provides better identifier mappings. That is like asking if a travel adapter provides better electricity. You still depend on the power company to give you a stable source of electricity. The travel plug just gives you flexibility to adapt to different circumstances.