Earlier this week EMC announced it will create its own distribution for Apache Hadoop. Hadoop provides distributed computing capabilities that enable organizations to process very large amounts of data quickly. As I have written previously, the Hadoop market continues to grow and evolve. In fact, the rate of change may be accelerating. Let’s start with what EMC announced and then I’ll address what the announcement means for the market.
EMC announced three new offerings, slated for the third quarter of 2011, that leverage its acquisition of Greenplum last year, ranging from an open source version to incorporation in its data warehouse appliance.
The EMC Greenplum HD Community Edition is a free, open source version of the Apache Hadoop stack comprising HDFS, MapReduce, Zookeeper, Hive and HBase. EMC extends Hadoop with fault tolerance for the Name Node and Job Tracker, both of which are well-known points of failure in standard Hadoop implementations.
The EMC Greenplum HD Enterprise Edition, interface-compatible with the Apache Hadoop stack, provides several additional features including snapshots, wide-area replication, a Network File System (NFS) interface and some management tools. EMC also claims performance increases of two to five times the performance over standard packaged versions of Apache Hadoop.
The EMC Greenplum HD Data Computing Appliance integrates Apache Hadoop with the Greenplum database and computing hardware. The appliance configuration provides SQL access and analytics to Hadoop data residing on the Hadoop Distributed File System (HDFS) as external tables, eliminating the need to materialize the data in the Greenplum database.
Until now Cloudera has dominated the emerging commercial Hadoop market and faced little or no competition since it introduced the Cloudera Distribution for Hadoop (CDH). The EMC announcements are both good and bad news for Cloudera. On the one hand they suggest – you might even say validate – that Cloudera has chosen a valuable market. EMC seems to be willing to invest heavily to try to get a share of it. On the other hand, Cloudera now faces a competitor that has significant resources. For customers competition is generally a good thing, of course, as it pushes vendors to innovate and improve their products to win more business.
EMC’s approach to the market differs dramatically from IBM’s strategy. IBM announced on Twitter at its Big Data Symposium held this week that it is putting all its weight behind Apache Hadoop in the hope of avoiding the fragmentation that plagued the UNIX market for years. EMC’s Enterprise Edition promises to tackle issues well known to the Hadoop market, but EMC faces competition from others who are also tackling these issues. If lower-cost or free competitive offerings adequately address these issues it could seriously undercut the market for EMC’s Enterprise Edition. While EMC brings more enterprise credentials to the Hadoop market than Cloudera, it has less experience with Hadoop. Multiple vendors are attempting to bring enterprise class capabilities to Hadoop, and it’s too soon to see who will succeed. However, overall, the Hadoop market will benefit from all the attention and investment.
I find it interesting and a little ironic that prior to its acquisition by EMC, Greenplum (along with Aster Data, now part of Teradata) helped popularize MapReduce, one of Hadoop’s most commonly used components, by embedding MapReduce as part of its databases. These proprietary implementations could be credited with helping to bring Hadoop into the mainstream big-data market because they combined data warehousing with MapReduce. It spawned a debate in which database guru Mike Stonebraker at first dismissed MapReduce and then embraced it. The debate attracted attention, a key ingredient in building any new market. Now EMC Greenplum completes the circle by embracing Hadoop.
To its credit, EMC aligned a dozen partners around these announcements, creating an ecosystem of third-party products and services. Concurrent, CSC, Datameer, Informatica, Jaspersoft, Karmasphere, MicroStrategy, Pentaho, SAS, SnapLogic, Talend and VMware all announced their support for the EMC products in one form or another. Most of these companies also partner with Cloudera, so this is a good move but not a coup for EMC.
The Hadoop market continues to evolve. We are now analyzing the data collected in our benchmark research on the state of the large-scale or now called the big data market, including Hadoop. Stay tuned for the results. It will be interesting to see where the market ends up. I expect more changes and innovation driven in part by the increased competition.
The Hadoop market is no longer a one-elephant race.
David Menninger – VP & Research Director