Big-Ass Servers™ and the myths of clusters in bioinformatics
Spending $55k for a 512GB machine (Big-Ass Server™ or BAS™) can be a tough sell for a bioinformatics researcher to pitch to a department head.
Speaking as someone who keeps his copy of CLR safely stored in the basement, ready to help rebuild society after a nuclear holocaust, I am painfully aware of the importance of algorithm development in the history of computing, and the possibilities for parallel computing to make problems tractable.
Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.
From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:
Google has to focus on doing one thing very well: search.
Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.
Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.
An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.
There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.
Various programming languages have features to take advantage of clusters. For example, R has SNOW. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.
The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.
Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.
The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even threaded, much less designed to work amongst several machines.
Even the BLAST setup we think of as being the essence of parallel (a segmented genome index - every node gets a part of the genome) is often not the one that many institutions have settled on. Many rely on farming out queries to a cluster in which every node has the full genome index in memory.
Secondly, the mpiBLAST appears to be more suited to dividing an index among older machines than today's, which typically have >32GB RAM. Here is a telling FAQ entry:
Your comments on this topic are welcome!
Dell PowerEdge r900, available in orange and lemon-lime |
Having recently spent 3 years in industry, however, I am now more inclined to just throw money at problems. In the case of hardware, I think this approach is more effective than clever programming for many of the current problems posed by NGS.
From an economic and productivity perspective, I believe most bioinformatics shops doing basic research would benefit more from having access to a BAS™ than a cluster. Here's why:
- The development of multicore/multiprocessor machines and memory capacity has outpaced the speed of networks. NGS analyses tends to be more memory-bound and IO-bound rather than CPU-bound, so relying on a cluster of smaller machines can quickly overwhelm a network.
- NGS has forced the number of high-performance applications from BLAST and protein structure prediction to doing dozens of different little analyses, with tools that change on a monthly basis, or are homegrown to deal with special circumstances. There isn't time or ability to write each of these for parallel architectures.
Myth: Google uses server farms. We should too.
Google has to focus on doing one thing very well: search.
Bioinformatics programmers have to explore a number of different questions for any given experiment. There is not time to develop a parallel solution to many of these questions as they will lead to dead ends.
Many bioinformatic problems, de-novo assembly being a prime example, are notoriously difficult to divide among several machines without being overwhelmed with messaging. You can imagine trying to divide a jigsaw puzzle among friends sitting several tables, you would spend more time talking about the pieces than fitting them together.
Myth: Our development setup should mimic our production setup
An experimental computing structure with a BAS™ allows for researchers to freely explore big data without having to think about how to divide it efficiently. If an experiment is successful and there is the need to scale-up to a clinical or industrial platform, that can happen later.
Myth: Clusters have been around a long time so there is a lot of shell-based infrastructure to distribute workflows
There are tools for queueing jobs, but those are often quite helpless to assist in managing workflows that are written as parallel and serial steps - for example, waiting for steps to finish before merging results.
Various programming languages have features to take advantage of clusters. For example, R has SNOW. But Rsamtools requires you to load BAM files into memory, so a BAS™ is not just preferable for NGS analysis with R, it's required.
Myth: The rise of cloud computing and Hadoop means that homegrown clusters are irrelevant that but also means we don't need a BAS™
The popularity of cloud computing in bioinformatics is also driven by the newfound ability to rent time on a BAS™. The main problem with cloud computing is the bottleneck posed by transferring GBs data to the cloud.
Myth: Crossbow and Myrna are based on Hadoop, we can develop similar tools
Ben Langmead, Cole Trapnell, and Michael Schatz, alums of Steven Salzberg's group at UMD, have developed NGS solutions using the Hadoop MapReduce framework.
- Crossbow is a Hadoop-based implementation of Bowtie.
- Myrna is an RNA-Seq pipeline.
- Contrail is a de novo short read assembler.
The dynamic scripting languages used most bioinformatics programmers are not as well suited to Hadoop as Java. To imply we can all develop similar tools of this sophistication is unrealistic. Many bioinformatics programs are not even threaded, much less designed to work amongst several machines.
Myth: embarrassingly parallel problems imply a cluster is needed
A server with 4 quad-core processors is often adequate for handling these embarrassing problems. Dividing the work just tends to lead to further embarrassments.
Here is a particularly telling quote from Biohaskell developer Ketil Malde on Biostar:
In general, I think HPC are doing the wrong thing for bioinformatics. It's okay to spend six weeks to rewrite your meteorology program to take advantage of the latest supercomputer (all of which tend to be just a huge stack of small PCs these days) if the program is going to run continously for the next three years. It is not okay to spend six weeks on a script that's going to run for a couple of days.
In short, I keep asking for a big PC with a bunch of the latest Intel or AMD core, and as much RAM as we can afford.
Myth: We don't have money for a BAS™ because we need a new cluster to handle things like BLAST
IBM System x3850 X5 expandable to 1536GB, mouse not included |
Secondly, the mpiBLAST appears to be more suited to dividing an index among older machines than today's, which typically have >32GB RAM. Here is a telling FAQ entry:
I benchmarked mpiBLAST but I don't see super-linear speedup! Why?!
mpiBLAST only yields super-linear speedup when the database being searched is significantly larger than the core memory on an individual node. The super-linear speedup results published in the ClusterWorld 2003 paper describing mpiBLAST are measurements of mpiBLAST v0.9 searching a 1.2GB (compressed) database on a cluster where each node has 640MB of RAM. A single node search results in heavy disk I/O and a long search time.
http://www.mpiblast.org/Docs/FAQ#super-linear
Your comments on this topic are welcome!
The first answer: Galaxy - it makes easy to parallel everything. The second answer: one large VM in Amazon EC2. In Poland we have also central scientific computational facility that delivers among others large virtual machines in similar way to Amazon.
ReplyDeleteWe have heavy computations once per week, so we only utilize about 20% of available computational time. So there is no need to buy costly BAS.
This comment has been removed by a blog administrator.
DeleteHi Marcin,
ReplyDeleteI don't understand how a high-memory virtual machine can effectively span multiple nodes, so there must be an actual physical BAS that is supported by your research. So it kind of sounds like you have, albeit indirectly, bought 20% of a BAS.
-jerm
Of course it is like buying 20% of a BAS for price which is about 40% of a BAS price. Moreover, it is limited in terms of size (68.4 GB of RAM in case of Amazon). However, we don't have human resources to maintain (invite tenders, backup, update, etc.) either a BAS or a cluster. We have only two bioinformaticians in the lab and I am one of them. I suppose it would be one of us responsible for maintaining the server...
ReplyDeleteEC2 is great (with or without Galaxy) and for some groups the large memory instance will suffice as a big (or medium) ass server. But for lots of applications including BLASTing against a big database (>64Gb) or doing complex assemblies, you will need more memory. It may be that EC2 will offer larger instances in the future. But I completely agree with the original post - a big-ass server is often a way better investment than a cluster. Unfortunately the guys that run high-performance compute facilities tend to be stuck in a cluster mindset, and that may not be the best solution for a lot of bioinformatics projects.
ReplyDeleteJerm - I am definitely going to consider getting one of those bad boys!
I generally agree with the post, but an going to come from the contrarian viewpoint in general.
ReplyDeleteA BAS makes sense under a few conditions. Your utilization is predictable, and high, and you can take some downtime risk. For many applications that is more than sufficient.
Having said that, there are some flaws in the argument
1. BLAST is hardly a high end application, and MPIBlast shouldn't exist. It does because we don't know how to write distributed systems.
2. At scale, BAS is always a bad idea cause hardware will fail. You should buy cheaper servers and make your software fault tolerant, so unless you are an occasional user and working with small data sizes (which is absolutely true for many people), you should consider other options.
3. Google does a LOT more than search. They have different data stores and processing engines optimized for different problems precisely cause they have so many. Their problem set and data complexity is beyond anything NGS has to deal with (and boy should we be paying attention).
There are two real nits I do have.
1. We are terrible programmers and the quality of our science is going to suffer as long as we remain there. You are arguing for lazyness and a lack of interest in solving hard computational problems. Then all the smart people are going to keep going to Facebook cause their skills get appreciated there.
2. Parallel programming is not one thing. We need a distributed systems approach, i.e. assume your networks are crappy, disk is slow and compute is fast and cheap. You can bet that the infrastructure at Google is way way way cheaper than the University cluster. Yes, they need to invest in the software, but there are lessons to be learned. Otherwise, you incur a ton of long term debt.
Of course, as scientists, we don't care about that.
Interesting post
ReplyDeleteSome points:
1: You don't have to program in java to do hadoop, it supports scripting languages via hadoop streaming and other frameworks (Dumbo etc)
2: You mention I/O as a bottleneck. What is the I/O subsystem that feeds your BAS? One of the reasons why people move to clusters is I/O from manay internal disks/spindles typically outperforms
3:What is your plan when you outgrow your BAS? Buy a bigger one for 75K? What if there is no bigger one?
4: You mention "Many bioinformatics programs are not even threaded, much less designed to work amongst several machines."
How are such programs going to benefit from a BAS?
Good post! Going to blog this on kevin-gattaca.blogspot.com
ReplyDeleteI almost agree with you totally? But I think BAS isn't a very shareable resource.
it is useful to have loads of ram and loads of cores for one person's use. But when it is shared, you have a hard time juggling resources in a fair manner especially in Bioinformatics where walltimes and ram requirements are known post analysis.
That said Cloud computing is having trouble keeping up with I/O bound stuff like bioinformatics, and smaller cloud computing services are all trying to show that they have faster interconnects, but you can't really beat a BAS that's on a local network.
The discussion is reignited once again on biostar:
ReplyDeletehttp://biostar.stackexchange.com/questions/16129/big-ass-servers-storage
Yannick Wurm tells me the British equivalent to "big-ass servers" is "fuck-off big machines". This is good to know.
ReplyDeletehttp://biostar.stackexchange.com/questions/18873/peer-reviewed-justification-for-big-ass-servers
wow: http://vimeo.com/64637406
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDelete