Today’s Hottest Big Data Technology – Already Obsolete?

Database guru and 2014 Turing Award winner Dr. Michael Stonebraker is not a big fan of MapReduce. Or so we can readily surmise, based on some of the comments he made (together with co-author David DeWitt, another highly acclaimed database researcher) in a blog post in early 2008. He and DeWitt called Mapreduce a “giant step backward … sub-optimal … not novel at all … missing most of the features that are routinely included in current DBMS … incompatible with all of the tools DBMS users have come to depend on.[1]

More than five years later, his opinion had hardly changed as he stated simply that MapReduce “is too inefficient for production use.[2]

His more recent criticisms of MapReduce have been offered in the context of Hadoop, the Apache open-source implementation of MapReduce and supporting technologies.

Now it seems to us that the name “Hadoop” has become synonymous with “Big Data” for many, so perhaps some clarification is in order.

Hadoop was created about ten years ago by a software developer named Doug Cutting. He had earlier developed an open-source Web indexing and search technology he called Lucene, and with another developer named Mike Cafarella, a Web crawler and HTML parser called Nutch.

As a contract programmer (and later employee) at Yahoo!, Cutting was looking for ways to more efficiently handle the massive workloads involved in keeping an Internet search engine up-to-date and performing well.

Meanwhile, engineers at Google had developed a number of useful Web-scale tools and technologies, and so (and as they continue to do to this date) they shared the fruits of their efforts with the world, by way of research papers.

In October 2003, Google published a paper describing a reliable, scalable, high-performance and low-cost distributed file system they called the Google File System, or GFS.[3] In December 2004, they published a paper describing in detail a high-volume data processing framework that ran on top of GFS[4]. They called this processing framework or model MapReduce, after the two basic (and batch-oriented) phases of processing that it performs – mapping queries to distributed data stores, and reducing the results, by aggregation.

Cutting recognized immediately that Google’s ideas represented a powerful solution to the problems he was working on for Yahoo!, and so starting in 2004 he undertook the development of an open-source, java-based implementation of MapReduce and GFS, with help from Cafarella and others.

He named the project “Hadoop” (after his son’s toy elephant). Within Hadoop, he kept the name MapReduce (which was never trademarked) for the analytical framework, and named the associated file system HDFS, for Hadoop Distributed File System.

Originally, Hadoop consisted of only four components – MapReduce, HDFS, a common resource library, and a resource management and scheduling tool called YARN.

As of today, the Apache Hadoop “stack” has grown considerably, and now includes:

  • Ambari – a Web-based administration tool for Hadoop implementations;
  • Avro – a framework for serializing data and associated schema for efficient network transfers;
  • Cassandra – a distributed, NoSQL DBMS originally developed at Facebook;
  • Chukwa – a toolset for collecting and analyzing log files across large distributed systems;
  • HBase – a distributed, NoSQL DBMS developed as an open source version of Google’s “Bigtable” massively scalable, sparse, non-relational tabular data store;
  • Hive – a data warehouse infrastructure that runs on top of Hadoop; it provides a high-level query language, and automatically converts and submites queries as MapReduce, Spark (see below), or Tez (below) jobs;
  • Mahout – a free collection of scalable, machine learning algorithms that can be run on Hadoop;
  • Pig – a high level language for generating MapReduce programs;
  • Spark – a clustered-computing and in-memory alternative to MapReduce that can outperform MapReduce by orders of magnitude on certain classes of problems. Spark also supports SQL data access, streaming data analytics, distributed graph processing, and other features not available in MapReduce;
  • Tez – a framework with capabilities similar to Spark, but intended for low-level integration with other tools;
  • ZooKeeper –services to help manage the configuration and coordination of large scale distributed systems.

(Not to mention hundreds of other non-Apache technologies that are part of the Hadoop “ecosystem”.)

So getting back to Dr. Stonebraker, he has recently observed that 1) Hadoop now refers to a collection of technologies built around HDFS, of which MapReduce is just one part; 2) Google has long since abandoned MapReduce and so should we; and 3) as a distributed file system HDFS is fine, but as the foundation for a DBMS, it is “a fate worse than death”[5].

And as a true die-hard DBMS guy, Stonebraker of course has to remind us that in a distributed file system world, “[features that users take for granted in a DBMS environment, such as load-balancing, auditing, resource governors, data independence, data integrity, high availability, concurrency control, and quality-of-service will be slow to come…[6]”.

We certainly live in interesting times.

Jim Tyson

I am an IT Senior Executive with 30+ years of experience. I have a passion for both human nature and Information Technology. Visit us at www.smdi.com and share your comments on this blog, LinkedIn or tweet them to @JimT_SMDI

 

___________________________________________________________

[1]https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html

[2] No Hadoop: The Future of the Hadoop/HDFS Stack | Intel Science & Technology Center for Big Data: 2015. http://istc-bigdata.org/index.php/no-hadoop-the-future-of-the-hadoophdfs-stack/. Accessed: 2015- 06- 22.

[3] Google Research Publication: The Google File System: 2015.http://research.google.com/archive/gfs.html. Accessed: 2015- 06- 22.

[4] Google Research Publication: MapReduce: 2015.http://research.google.com/archive/mapreduce.html. Accessed: 2015- 06- 22.

[5] Hadoop at a Crossroads?: 2015. http://cacm.acm.org/blogs/blog-cacm/177467-hadoop-at-a-crossroads/fulltext. Accessed: 2015- 06- 22.

[6] Ibid.

Leave a Comment

Leave a comment

Leave a Reply