You are currently browsing the tag archive for the ‘Microsoft’ tag.

It’s part of my job to cover the ecosystem of Hadoop, the open source big data technology, but sometimes it makes my head spin. If this is not your primary job, how can you possibly keep up? I hope that a discussion of what I’ve found to be most important will help those who don’t have the time and energy to devote to this wide-ranging topic.

I was a little late to the party. I first wrote about Hadoop for Ventana Research in 2010. Apache Hadoop then was about four years old and consisted of three modules with three top-level projects and a few subprojects. It didn’t reach the version 1.0 designation until a year later, in December 2011. Since then it has continued to evolve at a pace that is always steady and sometimes dizzying. Today the Apache Foundation lists four modules and 11 projects on its Hadoop page and a total of 35 projects that fall into the big data category.

The open source model has had a major impact on the big data market, yet in some ways, the open source approach has succeeded despite its shortcomings. For one thing, it is not an ideal business model. Few “pure” open source companies have been able to make a profit. Red Hat is the most notable financial success in the open source world. Hortonworks, one of the Hadoop distribution vendors, strives to be entirely open source but has struggled to make a profit.

Instead, when it comes to commercializing open source technologies, most vendors use a hybrid licensing model that combines open source components with licensed products to create revenue opportunities. So far, this model hasn’t proven to be financially viable either. Cloudera and MapR have chosen a hybrid Hadoop model, but they are private companies that don’t disclose their financials publicly. By some analysts’ estimates Cloudera won’t be profitable until 2018, and MapR has indicated it won’t have a positive cash flow until mid-2017.

The real, if nonmonetary, value of an open source model is that it helps create a large community, one that few organizations could create on their own. Here the Hadoop community is an outstanding example. The Strata+Hadoop World events will take place in five different locations this year, and organizers expect to attract a combined audience of more than 10,000 attendees. The Hadoop Summits will take place in four different cities and also attract thousands of attendees. On the adoption front, nearly half (48%) of the participants in our big data integration benchmark research said they now use Hadoop or plan to use it within 12 months.

A large community such as this one typically spawns more innovation than a small community. This is both the blessing and the curse of the Hadoop ecosystem.

Hadoop constantly changes. New projects are created as the community seeks to improve or extend the existing capabilities. For example, in many cases, the MapReduce programming model is being supplemented or replaced by Spark, as I have noted. In its original incarnation, Hadoop was primarily a batch-oriented system, but as it grew in popularity users started to apply it in real-time scenarios including Internet of Things (IoT) applications, which I’ve written about. Multiple Apache projects sprung up to deal with streaming data including Flink, Kafka, NiFi, Spark Streaming and Storm.

Regarding the last capability, all the major Hadoop distribution vendors have adopted some form of streaming data. Cloudera uses Spark and is adding Envelope and Kudu for low-latency workloads. Earlier this year, Hortonworks launched its second product, Hortonworks Data Flow, which is based on Kakfa, NiFi and Storm for streaming data. MapR introduced MapR Streams to deal with streaming data and IoT applications using the Kafka API. It’s clear that Hadoop vendors see a need to provide streaming of data, but the variety of approaches creates confusion for organizations about which approach to use.

Early Hadoop distributions did not emphasize security and governance. In our research more than half (56%) of organizations said they do not plan to deploy big data integration capabilities because it poses security risks or issues. Now those gaps are being addressed. The Apache Knox, Ranger and Sentry projects add security capabilities to Hadoop distributions. Unfortunately, there is not much consistency among vendors on which of these projects they support, again creating more confusion about which projects to use. Two other Apache projects, Atlas and Flacon, are designed to support data governance capabilities. Atlas and Ranger are still in the incubation process, the Apache process for accepting new products, but nothing prevents vendors from adopting these projects at this stage.

So how should your organization deal withvr_BDI_02_state_of_big_data_technology all these moving parts? Here’s my recipe. First it is important to have the skilled resources needed to manage big data projects. In our research 44 percent reported that they don’t have the Hadoop-specific skills needed. Those without them should consider hiring or contracting appropriately skilled Hadoop resources. However, some vendors provide packaged Hadoop offerings that reduce the need to have all the skills in house. For instance, there are cloud-based versions of Cloudera, Hortonworks and MapR. Amazon EMR also provides a managed Hadoop framework. Some vendors recognized the shortage of skills and have built businesses around offering big data as a service including Altiscale and BlueData.

Analytic database and data warehouse vendors have also attempted to make it easier to access and take advantage of Hadoop. These products typically take the form of SQL capabilities on Hadoop, an appliance configuration that comes installed with Hadoop or a cloud-based service that includes Hadoop. This table summarizes several vendors’ offerings.

Hadoop_Ecosystem_Menninger

The Open Data Platform initiative (ODPi), an industry consortium, attempts to reduce the skills needed to master different projects and versions within the Hadoop ecosystem by defining specifications for a common set of core Hadoop components. Currently Hortonworks and IBM offer ODPi-compliant versions of their Hadoop distributions, but Cloudera and MapR do not. The specification provides value to those who are looking for stable versions of the core Hadoop components.

The SQL on Hadoop products mentioned above still require that an organization have Hadoop, but it is worth considering whether you need Hadoop at all. Snowflake Computing was founded on the premise that organizations want to take advantage of the SQL skills they already have. This vendor built a cloud-based elastic data warehouse service that can scale and accommodate diverse data types while retaining a SQL interface. This approach may not be far-fetched; our research shows that relational databases are still the most commonly used big data technology.

To say the least, the Hadoop ecosystem is varied and complex. The large community surrounding big data continues to produce innovations that add to the complexity. While organizations can derive significant value from Hadoop, it does require investment. As your organization considers its investments in big data, determine which approach best suits its requirements and the skills available.

Regards,

David Menninger

SVP & Research Director

Follow Me on Twitter @dmenningerVR and Connect with me on LinkedIn.

There has been a spate of acquisitions in the data warehousing and business analytics market in recent months. In May 2010 SAP announced an agreement to acquire Sybase, primarily for its mobility technology and had already been advancing its efforts with SAP HANA and BI. In July 2010 EMC agreed to acquire data warehouse appliance vendor Greenplum. In September 2010 IBM countered by acquiring Netezza, a competitor of Greenplum. In February 2011 HP announced after giving up on its original focus with HP Neoview and now has acquired analytics vendor Vertica that had been advancing its efforts efficiently. Even Microsoft shipped in 2010 its new release of SQL Server database and appliance efforts. Now, less than one month later, Teradata has announced its intent to acquire Aster Data for analytics and data management. Teradata bought an 11% stake in Aster Data in September, so its purchase of the rest of the company shouldn’t come as a complete surprise. My colleague had raised the question if Aster Data could be the new Teradata but now is part of them.

All these plays have implications for how enterprises manage and use their fast-growing stores of data. We’re living in an era of large-scale data, as I wrote recently. Founded in 1979, Teradata has been a dominant player in this market for years. Teradata was a pioneer in massively parallel processing (MPP) for database systems that I recently described, a concept behind much of today’s analytic database market, including all the recently acquired vendors mentioned above. When I worked at Oracle in the late 1990s, Teradata was the chief competitor when pursuing 1 terabyte (TB) data warehouse opportunities. Yes, managing a single terabyte was considered a significant challenge then that few vendors were ready to take on. Although the data volumes have grown, little else has changed since those years with Oracle now competing against more providers despite its recent promotions of its second generation Oracle Exadata appliance and it Oracle 11g Release 2 database at the 2010 Oracle OpenWorld.

Of course, that has all changed long since. Over the last few years Aster Data established itself as a player in the data warehousing market with an MPP relational database product. It also embraced the MapReduce parallel processing technology earlier than most other data warehouse vendors. MapReduce is a key concept of the increasingly significant Apache Hadoop project; it appears that Aster Data’s proprietary implementation of MapReduce was a significant factor in Teradata’s decision to acquire it. The tone of the joint Teradata/Aster Data briefing for analysts even suggests that Teradata is attempting an end-run around Hadoop. Aster Data has successfully developed a market niche of customers doing analysis of unstructured data and social networks. Both of these are activities one might use Hadoop to do. The company also had success in other segments, such as financial services, marketing services, retail and e-commerce.

Aster Data customers should benefit from the increased resources Teradata can invest in developing the product, and the rich heritage of Teradata in this space should enhance the infrastructure supporting the Aster Data product line. For example, Teradata’s workload management tools are among the best in the industry. However, even if the association with Teradata brings some of these capabilities to the Aster Data platform, it’s likely that the cost advantages of Aster Data over Teradata will decline over time. Integration into the Teradata organization and technology stack could detour Aster Data from its previous path of innovation. So customers may see a longer time between releases and maybe a less ambitious product roadmap.

Obviously Teradata has many more customers than Aster Data and is most concerned about them.  They, too, might see some negative impact on development schedules, but on the plus side they instantly receive a new source of technology that could be beneficial to them. The question, as yet unanswered until a roadmap is published, will be how quickly Teradata customers can take advantage of the innovations Aster Data has brought to market.

Teradata officials talked about Aster Data retaining some independence in operations, in the product line and perhaps in identity, but integration is the key to making the acquisition valuable to the Teradata customer base. One likely outcome favorable to current and future Teradata customers is more support for industry-standard server hardware, which was specifically mentioned as a benefit of the acquisition. Teradata customers may also benefit from the columnar database capabilities if those capabilities are ported to the main Teradata product line.

There were a couple of notable omissions from the discussion as the acquisition was announced. Both Teradata and Aster Data had partnerships with SAS and Cloudera, the Hadoop vendor. The SAS relationship provided advanced analytics including statistical analyses and data mining embedded within the database. Teradata may be looking to shift to using Aster Data’s embedded analytic capabilities. With respect to Hadoop, although both companies had publicly announced partnerships with Cloudera, neither had offered any deep integration with Hadoop. I expect such an arm’s-length relationship to continue, but I suspect the new combination will put more weight behind the embedded SQL-MR capability that Aster Data developed. Teradata may be large enough to attempt such an independent strategy, but I think it would be a mistake to alienate the entire Hadoop community that I have been researching, which in my opinion is a market not served by Teradata today.

Customers of other data warehousing companies should feel little immediate negative impact from this coupling. In fact, more competition over the analysis of unstructured data could spark those companies to enhance their product offerings. As I noted in beginning, the market for independent products has shrunk over the last six months, so the remaining vendors could see an increase in revenues as customers who were using Aster Data or others in preference to larger companies may start to consider other alternatives. However, at some point the musical chairs of the large-scale database acquisitions is going to stop, and when that does, if your vendor is still reliant on venture capital (VC) funding (i.e., does not yet have positive cash flow) you could have a problem. The VCs may decide to stop funding such a company, which could force them to scale back on their plans.

Probably some other shoes will drop before the market shift is over. Dell had a partnership with Aster Data that is now called into question. Dell therefore might seek an alternative partner or even acquire one of the other vendors to get into the game. Potential candidates include many we have been assessing including: 1010Data, Calpont, Kognitio and ParAccel. I didn’t include Infobright in this list because it doesn’t have an MPP offering, which seems to be table stakes now. This latest acquisition could also lead to another round of acquisitions based on Hadoop or NoSQL technologies. Or perhaps the game may take a step in the direction of complex event processing or predictive analytics. We’ll have to wait and see although at this rate we may not have to wait long!

Regards,

 David Menninger – VP & Research Director

Follow on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 17 other followers

David Menninger – Twitter

Ventana Research

Top Rated

Blog Stats

  • 44,117 hits
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: