Mark Drapeau (Washington, DC) —
More data is being produced, analyzed, shared, and stored than ever before. Scientific research, particularly biological sciences like genomics, is one of the more prominent examples of this, with laboratories producing terabytes of information every minute. There are many challenges moving forward with large-scale computational research in the biosciences and other areas, and computer scientists are just beginning to understand the benefits and limits in this area.
Here, I discuss how inexpensive cloud computing options are changing how bioscientists are beginning to think strategically about the storage and analysis of, and collaboration around, their explosive growth of research data, and review three lessons learned from early experiments in large-scale biocomputational research in the cloud.
Cloud Computing, -Bursting, and -Balancing for Research
Cloud computing means many different things to many different people, but to academic science laboratories (which operate in many ways like small businesses), one primary goal of leveraging the cloud is the minimize costs by eliminating the ownership of infrastructure like servers. Not so long ago, labs doing a lot of number-crunching would reserve time on a university mainframe computer. More recently, a savvy principal investigator would set up a “Linux cluster” of computers that could crunch data in-house.
The downside of both of these approaches is the cost, responsibility, maintenance and even space constraints associated with these machines. The degree to which this can be eliminated and by which data can be accessed through standard devices – laptops, smartphones, etc. – is associated with the degree to which cloud computing can benefit academic researchers who would in many cases rather spend limited grant money on other supplies, or staff.
Not all data need be stored in the cloud, though. Whether we’re thinking about a modest-sized academic lab preferring to keep an in-house Linux cluster, or a corporate biotech lab using enterprise-level software, “private” clouds, and managing many employees, there are still ways in which cloud computing can empower researchers. Something called “cloudbursting” – in essence, bursting into a remote, public cloud when one’s internal capacity is temporarily reached – has caught on as a discussion topic. Another tactic, “cloud balancing,” balances workloads across multiple clouds given certain pre-set conditions. No doubt, more ways to use cloud computing for scientific data analyses will emerge in the future.
Two Examples of Cloud Computing in Biological Research
At Seattle Children’s Hospital, researchers interested in protein interactions wanted to know more about the interrelationships of known protein sequences. Due to the sheer number of known proteins — nearly 10 million — this would have been a very difficult problem for even the most state-of-the art computer to solve. Initial stimates indicated that it would take a single computer more than six years to find the results. But by leveraging the power of the cloud in data centers spanning multiple countries in two continents, the researchers cut their computing time substantially – down to one week.
Across town, scientists at the University of Washington are working on a project to identify key drivers for producing hydrogen, a promising alternative fuel. The method they adopted characterizes a population of strains of the bacterium Rhodopseudomonas palustris and uses genomics approaches to dissect the molecular networks of hydrogen production. Part of their process involves a series of comparisons among 16 distinct strains of R. palustris, each with about 5,000 proteins, to look for specific similarities and differences that would be helpful. Leveraging the cloud, these researchers were able to move from a time of three hours or more to analyze each strain to something far, far less – about 30 minutes total.
How is this possible? It is possible because of advances in how cloud computing processes and power are used to do old things in new, more efficient ways.
Get Your Free Cloud at the National Science Foundation
In February, the National Science Foundation, one of the primary grant-giving bodies for scientific researchers, announced that it had teamed up with Microsoft to offer some grantees free access to cloud computing resources on Microsoft’s new platform named Azure. With an annual budget of about $7 billion, the NSF funds roughly 20 percent of federally supported university research, and thus this is a very significant advance for cloud computing in the scientific community.
Dan Reed, Corporate Vice President of Technology Policy and Strategy and leader of the Microsoft eXtreme Computing Group, and sponsor of the collaboration with NSF, noted that, “Windows Azure provides on-demand compute and storage to host, scale and manage Web applications on the Internet… Microsoft researchers and developers will work with grant recipients to equip them with a set of common tools, applications and data collections that can be shared with the broad academic community.”
The value of this cloud access is increasingly obvious to researchers. Dr. Ed Lazowska, a computer scientist at the University of Washington, was cited in The New York Times as saying that “the explosion of data being collected by scientists had transformed the staffing needs of the typical scientific research program on campus from a half-time graduate student one day a week to a full-time employee dedicated to managing the data… such exponential growth in cost was increasingly hampering scientific research.” Thus, being able to use Windows Azure fully or partly should free valuable resources for other tasks.
BLAST Your DNA Samples Into the Cloud
In March 2010, Steve Ballmer, the CEO of Microsoft, commented, “Technology has an exponential path in front of it, meaning it has the ability to propel science, medicine, business, social issues and personal interactions in ways that are increasingly important to society and our own everyday lives.” In that vein, Microsoft announced in November that a critical genomics tool called BLAST (Basic Local Alignment Search Tool) would be made available through the cloud, enabling scientists to conduct their genomics analyses orders of magnitude easier and faster than before.
Researchers in fields ranging from bioinformatics to energy to drug research – like the Seattle-based researchers described above – use BLAST to sift through large databases, identify new animal species, improve drug effectiveness, produce biofuels, and much more. What the new NCBI BLAST hosted in the cloud does is provide a user-friendly Web interface and access to back-end (and largely out-of-sight) cloud computing on Windows Azure for very large BLAST computations. In more advanced scenarios, scientists will not only be able to conduct BLAST analyses on their private data collections, but also include public data hosted entirely in the cloud (these data include that from peer-reviewed scientific publications).
When I was in graduate school working on different aspects of genetics with Prof. Tony Long, and later at NYU working on genomics as a postdoctoral fellow, I used BLAST frequently. It was hosted BY a, a, Danby a part of the National Institutes of Health (NIH) called the National Center for Biotechnology Information (NCBI). NCBI BLAST was slow. You would enter just a single DNA or protein sequence and then “blast” it against the millions of sequences in the public database to find matches, and this could take a while. Users would get a message saying “Results ready in 30 seconds…” and then after 30 seconds get another one saying “Still working… Results ready in 90 seconds…” and so on. And this was 10 years ago.
These scientific tools in the cloud help labs on the small end of the scale. “NCBI BLAST on Windows Azure gives all research organizations the same computing resources that traditionally only the largest labs have been able to afford,” said Bob Muglia, a Senior Vice President at Microsoft oveseeing all its work in the cloud. “It shows how Windows Azure provides the genuine platform-as-a-service capabilities that technical computing applications need to extract insights from massive data, in order to help solve some of the world’s biggest challenges across science, business and government.”
Now, with much more sophisticated tools to collect biological and other data, scientists are being overpowered with a “data tsunami” of sorts. As Dan Reed wrote in a blog post on the issue called The Future of Discovery and the Power of Simplicity, “Simply put, science is in transition from data poverty to data plethora. The implication is that future advantage will accrue to those who can best extract insights from this data tsunami… I believe this will have a transformative, democratizing effect – driving change and creating discovery and innovation opportunities.”
More broadly, computer science researchers from different companies are interested in bringing the full force of cloud computing resources to technical specialists in many different science and engineering disciplines. Microsoft Research, for its part, is driving a worldwide program to engage the research community. Microsoft’s technical computing initiative is aimed at bringing supercomputing power and resources – particularly through Windows Azure – for modeling and prediction to more organizations across science, business and government.
Lessons Learned About Large-Scale Computational Research in the Cloud
The application of BLAST on Windows Azure to research conducted by the University of Washington and Children’s Hospital groups mentioned above taught Microsoft Research many important lessons about how to structure large-scale research projects in the cloud. Moreover, most of what was learned is applicable not just to BLAST, but to any parallel jobs run at large scale in the cloud. Here are three lessons learned about large-scale computational research in the cloud:
- Design for failure: Large-scale data-set computation will nearly always result in some sort of failure. In the week-long run of the Children’s Hospital project, there were a number of failures: both failures of individual machines, and entire data centers taken down for regular updates. In each case, Windows Azure produced messages about the failure, and had mechanisms in place to make sure jobs were not lost.
- Structure for speed: Structuring individual tasks in an optimal way can significantly reduce the total computation run time. Researchers conducted several test runs before embarking on runs of the whole dataset to make sure that input data was partitioned to get the most use out of each worker node.
- Scale for cost savings: If a few long-running jobs are processing alongside many shorter jobs, it is important to not have idle worker nodes continuing to run up costs once their piece of the job is done. Researchers learned to detect which computers were idle and shut them down to avoid unnecessary costs.
Working on large-scale projects in the cloud isn’t like typical, traditional biological research. For a period of time (when I was in the laboratory) you actually had to understand how to run Linux and use programming languages to control robots and genomic sequencing machines, and then other languages like R to custom-analyze the data. And some people will still do this. But now science appears to be entering a phase of being aided by new kinds of apps in new form factors, backed by the technology underlying cloud computing to maximize time, effort, and ultimately, discovery and public good brought from government research grant dollars. And in the long run, such advances help everyone. The cloud is simply accelerating the pace at which discovery comes about.
Dr. Mark Drapeau is the Director of U.S. Public Sector Social Engagement at Microsoft, and the Editor of SECTOR: PUBLIC. He is also a trained research biologist. Mark last wrote about the relationship between Wikileaks and the Open Government movement. You can follow him on Twitter at @cheeky_geeky.
Images from Wikipedia except DNA, the desk and the rocket, used under Creative Commons.
Leave a Reply
You must be logged in to post a comment.