Saturday, August 13, 2011

Creating scientific (bioinformatics) software: going Agile?

This post is the summary of a few personal thoughts about scientific research software development process. This is the results of personal experiences as phD in physics computational modeling, bioinformatician up to head of software development (CTO) in a medium size biotech software company (, academic research/service group leader and programming course lecturer. These thoughts are based on a lot of personal failures, some success stories, even more observations of other projects and of course literature (see references)...

Verbier to Zinal in speed flying, 2009

The context
There has often been a gap (a precipice?) in the conception of how scientific and programmer produce software ( where the scientific code is “just a tool to get an answer” and the programmer should deliver a packaged product.

I will focus on a case study where bioinformatician (dry lab) are working on an exploratory project with wet lab scientists, but much of the arguments are true in other situations or even if the scientist is himself the developer.

No culture...Let’s put it frankly: there is not much software development methodology culture in bioinformatics. Why? habits are hard to change, code writers are often not software engineers trained peoples, the motto that “our domain is just different”...

There is just not the maturity in this field, such as, for example, in game software creation, and this lead to recurrent problems.

Common problems
We can observe several recurrent situations (check your favorites).
From the collaboration (developers/scientist) point of view:
  • software folks solve their own problems, not the wet lab ones: the result is a frustrating software for the scientist; 
  • scientist cannot define all their needs at project start (by nature, in exploration, questions will appear on the path); 
  • lack of trust between dry/wet lab; 
  • too long delays between software releases kill dialogue and collaborative creation; 
  • software is not able to adapt and follow a moving target (research by essence is); 
  • fear to innovate from the scientist (e.g. new ways of representing data) versus unleashed excitation about innovation from developer leading to useless feature (although they are super cool...); 
  • “unplanned” data structure (rare events, increasing volume killing speed) cannot always be predicted upfront; 
  • scientist refrain to simplify the problem even though some level or simplification is often possible and does not corrupt the solution quality, while the challenge can become programmable. 

From the software creation point of view:
  • over architecture and code replication; 
  • YAGNI ('t_gonna_need_it); 
  • code is hardly refactorable, thus cannot follow a moving target and a changing environment (third parties libraries, new data structures...); 
  • Shaving the YAK (wiktionary: “The actually useless activity you do that appears important when you are consciously or unconsciously procrastinating about a larger problem”); 
  • buggy software; 
  • untested code (well, that’s quite related to the previous point..); 
  • not easy (impossible?) to pass code from a developer to another; 
  • re-re-re-inventing the wheel; 
  • code is polluted by unused components, tested technologies that were later abandoned but not removed; 
  • just poor code... 

“Ok! But our domain is just different from all others”
As soon as you address this problem with senior scientist and propose solution to try to cope with these problems, they will often a) confirm they face(d) these situations, b) argue that their field and challenges are just unique.

In practice, you can often end up with a combination of the following situations (check again you favorites):
control freak manager that will just design every details by fear;
“heroic” developers killing themselves to make the bird fly, and often producing a non sustainable effort, poor code and becoming good candidate for a splendid burnout after a variable number of years;
scientist surviving with hand made, painful excel solutions;
cut of the communication channel between wet and dry labs;
total waste of resource, loss of motivation and poor outcome.

But is bioinformatics really different?
Well, not that much... If you look at the main taps, the situation is very similar to other fields, such as web application or game development:
  • starting from a vision, development must continuously adapt to end user needs (which no one could even imagine at start); 
  • in scientific exploration software, the moving target is the quest itself which take the form of: 
    • hypothesis to (in-)validate, 
    • ideas from articles to test or adapt, 
    • digging in data raises new questions, thus the need for new tools. 
Moreover, the tools need to be delivered in a continuous manner, to allow faster interactions and a real dialogue between scientists and developers.

This is a more and more common practice in software development, often quoted as “release early, release often” with very popular tools released daily (or even hourly) on the web (

The software community answer: going Agile
All those pathologies have been the routine landscape in software development for decades. As a remedy, a methodology (or more precisely, something like a philosophy or mindset) arose, very well suited, modulo some adaptations, to the situations we face.

Agile is a set of principles, which can be implemented through Scrum, XP etc. These implementations are nevertheless not miracle recipes to be applied as-is, but offer principles that answer our questions.

To quote Clinton Keith, the Agile manifesto ( can be read as follow:
“We are uncovering better ways of developing software by doing it and helping others to do it.
  • Through this work we have come to values:
  • individuals and interactions over process and tools,
  • working software over comprehensive documentation,
  • customer collaboration over contract negotiation,
  • responding to change over following plan.
That is, while there is value in the items on the right, we value the items on the left more.”
Isn’t that pretty what we need?

An agile methodology proposal for scientific research software development
Adopting literally Scrum or XP from the book might not be the best nor the most realist approach (if ever possible), but keeping the root principles we can make a few adaptations.

The roles
  • The team refers mainly to developers, although frequent dialogue with the scientist are needed. The team can be re-inforced, for some iterations by a statistician for example. 
  • The scientist(s) takes the place of the stakeholder/customer, the person for whom the software is developed (or more precisely, the person directly in phase with the underlying scientific question). 
  • For a large project, the product owner can be the lab head, ensuring the global direction of the project. If the software planned is light, the role of the product owner will be minimalistic. 

How it works
  • Starting from a global vision, a non exhaustive list of items (software features) are stored in the backlog. Backlog items are short stones, with a clear completion status, involving at most a couple of weeks of work. The backlog will be populated throughout the project, as new questions or needs appear. 
  • The heartbeat of the development process is a sprint. A sprint is a development iteration which should be short for research project (typically two weeks). A sprint consist in: 
    • highest priority items are selected from backlog at sprint start; 
    • once a sprint is launched, the sprint item list shall not change; 
    • items are split into tasks, with estimated development time (a task longer than 16 hours should be split); 
    • at end of sprint, a fully workable software must be available; at a much higher pace (daily or even hourly) the software can be built on a test server to lower the latency in the wet/dry dialogue; 
    • daily timeboxed standup team meetings (15 minutes) are held to keep everyone aware or improvements and quickly raise and address impediments. 
  • Only the minimalistic code needed at the time is produced (YAGNI). 
  • Application is fully tested (TDD - to allow continuous deployment. 
  • Code refactoring is a explicitly encouraged versus over-architecting. 
  • Peer code reviewing and pair programming on critical tasks or to put junior developer up to speed more rapidly. 
  • To reduce risk, develop a culture of spikes (independent piece of code to test a technology) and prototypes (quick and dirty code to validate an idea). This will also limit the integration of adventurous code into the main project.
The tools
Well, they could hardly be lighter:
  • a whiteboard with 3x5 index cards for planning and progress tracking; 
  • a continuous build/test/deploy infrastructure (process should be automatized and painless); 
  • test and prod servers (typically virtualized machines). 
Such infrastructure are commonly available.

Beside proposing a framework to create software, going agile have several advantages:
  • enhancing horizontal communication (between wet and dry lab) versus hierarchical management; 
  • shorter cycles of working software allow to track progress more effectively and reduce risk; 
  • iterative decision about feature implementation limit the the need for upfront design and offer the possibility to pivot and follow the scientist quest path. 
Finally, and maybe the most important outcome is how developers and scientist really build a solution together and participate to a common goal. This mindset, setting up an environment where the product belongs to its creators, is a key part in the motivation of the team and the production of a high quality solution.

Going agile?
Going agile in an existing environment cannot be forced by upper management or a magic wand, but must be adopted by the base. Different techniques exist to push this adoption (see Clinton Keith’s, chapter 16) and experience shows that with a careful process, motivated team members quickly see the benefit of such a move...

Further readings

No comments:

Post a Comment