The Noisy Channel

 

DeWitt and Stonebraker vs. MapReduce, Round 2

April 15th, 2009 · 6 Comments · General

A few months ago, database titans David J. DeWitt and Michael Stonebraker wrote  a polemic entitled “MapReduce: A major step backwards” that received a lot of attention, including responses like “Relational Database Experts Jump The MapReduce Shark“. Those unfamiliar with MapReduce might want to take a look at the Wikipedia entry.

Well, they’re at it again. As Eric Lai reports in Computerworld, DeWitt and Stonebreaker has written a paper with Daniel J. Abadi, Samuel Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin entitled “A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks“. The authors are a who’s who of database researchers, and the paper will appear in SIGMOD Record.

Zenobia Godschalk of Aster Data, a database that integrates SQL with MapReduce, has already chimed in: “why wouldn’t you use both SQL AND MapReduce?”

But this is no time for post-partisanship. Not when the database guys are clearly looking for a fight. As Lai suggests, this paper may be a response to Google’s announcement last November that it used MapReduce to sort a petabyte terabyte of data in just 68 seconds. Unfortunately, it seems likely that people will eventually reach the obvious conclusion that different approaches are better suited to different tasks. But hopefully we’ll see some nice sparks fly in the mean time.

6 responses so far ↓

  • 1 John B. Lee // Apr 15, 2009 at 1:31 am

    I’m amused by this debate, particularly since I’ve sort of gone down this path recently. I tried Hadoop on a tiny cluster with relatively small data and just did not get performance that was any better than more traditional techniques. I think my data and cluster are too small.

    Like most things, “it depends” seems to be the answer.

    I still want to build a distributed database system that has MapReduce as its core and simply “compiles” SQL (or whatever *query* language you want) into a MapReduce job.

  • 2 Mark Reid // Apr 15, 2009 at 1:37 am

    Just a small correction: The Google blog post you refer to says they sorted a petabyte of data in just over 6 hours. They were able to sort a terabyte in 68 seconds.

  • 3 Daniel Tunkelang // Apr 15, 2009 at 1:44 am

    Mark, corrected–thanks!

  • 4 Albert Sheu // Apr 15, 2009 at 3:30 am

    You should also totally check out Hive (http://wiki.apache.org/hadoop/Hive), which is what we use really extensively at Facebook right now. It seems pretty similar to what Aster is doing — it’s an implementation of the Hive Query Language HQL over an HDFS data store plus table metadata. Language is pretty young and the tutorial needs some work, but it’s succinct and damn fast; a query that normally took me 4 hours over the MySQL tier took under 20 minutes using Hive. However, its true beauty really only shows over huge, terabyte-order datasets, which is why most people aren’t going to see the benefit in it until they reach that big of a data need.

  • 5 Ashutosh // May 16, 2012 at 2:06 pm

    You were able to bring down the execution time of a query from 4 hours to 20 minutes…Care to elaborate! Your schema, Indexes etc..hardware….

  • 6 Daniel Tunkelang // May 16, 2012 at 2:42 pm

    Ashutosh, you might want to reach out to Albert directly — I doubt he’s following this comment thread 3 years and 2 jobs later.

Clicky Web Analytics