You are currently browsing the tag archive for the ‘Hadoop’ tag.

It’s part of my job to cover the ecosystem of Hadoop, the open source big data technology, but sometimes it makes my head spin. If this is not your primary job, how can you possibly keep up? I hope that a discussion of what I’ve found to be most important will help those who don’t have the time and energy to devote to this wide-ranging topic.

I was a little late to the party. I first wrote about Hadoop for Ventana Research in 2010. Apache Hadoop then was about four years old and consisted of three modules with three top-level projects and a few subprojects. It didn’t reach the version 1.0 designation until a year later, in December 2011. Since then it has continued to evolve at a pace that is always steady and sometimes dizzying. Today the Apache Foundation lists four modules and 11 projects on its Hadoop page and a total of 35 projects that fall into the big data category.

The open source model has had a major impact on the big data market, yet in some ways, the open source approach has succeeded despite its shortcomings. For one thing, it is not an ideal business model. Few “pure” open source companies have been able to make a profit. Red Hat is the most notable financial success in the open source world. Hortonworks, one of the Hadoop distribution vendors, strives to be entirely open source but has struggled to make a profit.

Instead, when it comes to commercializing open source technologies, most vendors use a hybrid licensing model that combines open source components with licensed products to create revenue opportunities. So far, this model hasn’t proven to be financially viable either. Cloudera and MapR have chosen a hybrid Hadoop model, but they are private companies that don’t disclose their financials publicly. By some analysts’ estimates Cloudera won’t be profitable until 2018, and MapR has indicated it won’t have a positive cash flow until mid-2017.

The real, if nonmonetary, value of an open source model is that it helps create a large community, one that few organizations could create on their own. Here the Hadoop community is an outstanding example. The Strata+Hadoop World events will take place in five different locations this year, and organizers expect to attract a combined audience of more than 10,000 attendees. The Hadoop Summits will take place in four different cities and also attract thousands of attendees. On the adoption front, nearly half (48%) of the participants in our big data integration benchmark research said they now use Hadoop or plan to use it within 12 months.

A large community such as this one typically spawns more innovation than a small community. This is both the blessing and the curse of the Hadoop ecosystem.

Hadoop constantly changes. New projects are created as the community seeks to improve or extend the existing capabilities. For example, in many cases, the MapReduce programming model is being supplemented or replaced by Spark, as I have noted. In its original incarnation, Hadoop was primarily a batch-oriented system, but as it grew in popularity users started to apply it in real-time scenarios including Internet of Things (IoT) applications, which I’ve written about. Multiple Apache projects sprung up to deal with streaming data including Flink, Kafka, NiFi, Spark Streaming and Storm.

Regarding the last capability, all the major Hadoop distribution vendors have adopted some form of streaming data. Cloudera uses Spark and is adding Envelope and Kudu for low-latency workloads. Earlier this year, Hortonworks launched its second product, Hortonworks Data Flow, which is based on Kakfa, NiFi and Storm for streaming data. MapR introduced MapR Streams to deal with streaming data and IoT applications using the Kafka API. It’s clear that Hadoop vendors see a need to provide streaming of data, but the variety of approaches creates confusion for organizations about which approach to use.

Early Hadoop distributions did not emphasize security and governance. In our research more than half (56%) of organizations said they do not plan to deploy big data integration capabilities because it poses security risks or issues. Now those gaps are being addressed. The Apache Knox, Ranger and Sentry projects add security capabilities to Hadoop distributions. Unfortunately, there is not much consistency among vendors on which of these projects they support, again creating more confusion about which projects to use. Two other Apache projects, Atlas and Flacon, are designed to support data governance capabilities. Atlas and Ranger are still in the incubation process, the Apache process for accepting new products, but nothing prevents vendors from adopting these projects at this stage.

So how should your organization deal withvr_BDI_02_state_of_big_data_technology all these moving parts? Here’s my recipe. First it is important to have the skilled resources needed to manage big data projects. In our research 44 percent reported that they don’t have the Hadoop-specific skills needed. Those without them should consider hiring or contracting appropriately skilled Hadoop resources. However, some vendors provide packaged Hadoop offerings that reduce the need to have all the skills in house. For instance, there are cloud-based versions of Cloudera, Hortonworks and MapR. Amazon EMR also provides a managed Hadoop framework. Some vendors recognized the shortage of skills and have built businesses around offering big data as a service including Altiscale and BlueData.

Analytic database and data warehouse vendors have also attempted to make it easier to access and take advantage of Hadoop. These products typically take the form of SQL capabilities on Hadoop, an appliance configuration that comes installed with Hadoop or a cloud-based service that includes Hadoop. This table summarizes several vendors’ offerings.

Hadoop_Ecosystem_Menninger

The Open Data Platform initiative (ODPi), an industry consortium, attempts to reduce the skills needed to master different projects and versions within the Hadoop ecosystem by defining specifications for a common set of core Hadoop components. Currently Hortonworks and IBM offer ODPi-compliant versions of their Hadoop distributions, but Cloudera and MapR do not. The specification provides value to those who are looking for stable versions of the core Hadoop components.

The SQL on Hadoop products mentioned above still require that an organization have Hadoop, but it is worth considering whether you need Hadoop at all. Snowflake Computing was founded on the premise that organizations want to take advantage of the SQL skills they already have. This vendor built a cloud-based elastic data warehouse service that can scale and accommodate diverse data types while retaining a SQL interface. This approach may not be far-fetched; our research shows that relational databases are still the most commonly used big data technology.

To say the least, the Hadoop ecosystem is varied and complex. The large community surrounding big data continues to produce innovations that add to the complexity. While organizations can derive significant value from Hadoop, it does require investment. As your organization considers its investments in big data, determine which approach best suits its requirements and the skills available.

Regards,

David Menninger

SVP & Research Director

Follow Me on Twitter @dmenningerVR and Connect with me on LinkedIn.

It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept. In this post I’d like to clarify what a data lake is, review the reasons an organization might consider using one and the challenges they present, and outline some developments in software tools that support data lakes.

Data lakes offer a way to deal with big data. A data lake combines massive storage capabilities for any type of data in any format as well as processing power to transform and analyze the data. Often data lakes are implemented using Hadoop technology. Raw, detailed data from various sources is loaded into a single consolidated repository to enable analyses that look across any data available to the user. To understand why data lakes have become popular it’s helpful to contrast this approach with the enterprise data warehouse (EDW). In some ways an EDW is similar to a data lake. Both act as a centralized repository for information from across an organization. However, the data loaded into an EDW is generally summarized, structured data. EDWs are typically based on relational database technologies, which are designed to deal with structured information. And while advances have been made in the scalability of relational databases, they are generally not as scalable as Hadoop. Because these technologies are not as scalable, it is not practical to store all the raw data that come in to the organization. Hence there is a need for summarization. In contrast, a data lake contains the most granular data generated across the organization. The data may be structured information, such as sales transaction data, or unstructured information, such as email exchanged in customer service interactions.

Hadoop is often used with data lakes becausevr_Big_Data_Analytics_21_external_data_sources_for_big_data_analytics it can store and manage large volumes of both structured and unstructured data for subsequent analytic processing. The advent of Hadoop made it feasible and more affordable to store much larger volumes of information, and organizations began collecting and storing the raw detail from various systems throughout the organization. Hadoop has also become a repository for unstructured information such as social media and semistructured data such as log files. In fact, our benchmark research shows that social media data is the second-most important source of external information used in big data analytics.

In addition to handling larger volumes and more varieties of information, data lakes enable faster access to information as it is generated. Since data is gathered in its raw form, no preprocessing is needed. Therefore, information can be added to the data lake as soon as it is generated and collected. This approach has caused some controversy with many industry analysts and even vendors to raise concerns about data lakes turning into data swamps. In general, the concerns about data lakes becoming data swamps center around the lack of governance of the data in a data lake, an appropriate topic here. These collections of data should be governed like any other set of information assets within an organization. The challenge was that most of the governance tools and technologies had been developed for relational databases and EDWs. In essence, the big data technologies used for data lakes had gotten ahead of themselves, without incorporating all the features needed to support enterprise deployments.

Another, perhaps more minor controversy centers around terminology. I raise this issue so that, regardless of the terminology a vendor chooses, you can recognize data lakes and be aware of the challenges. Cloudera uses the term Enterprise Data Hub to represent essentially the same concept as a data lake. Hortonworks embraces the data lake terminology as evidenced in this post. IBM acknowledges the value of data lakes as well as its challenges in this post, but Jim Kobielus, IBM’s Big Data Evangelist, questioned the terminology in a more recent post on LinkedIn, and the term “data lake” is not featured prominently on IBM’s website.

Despite the controversy and challenges, data lakes are continuing to grow in popularity. They provide important capabilities for data science. First, they contain the detailed data necessary to perform predictive analytics. Second, they allow efficient access to unstructured data such as social media or other text from customer interactions. For business this information can create a more complete profile of customers and their behavior. Data lakes also make data available sooner than it might be available in a conventional EDW architecture. OurVentanaResearch_DAC_BenchmarkResearch data and analytics in the cloud benchmark research shows that one in five (21%) organizations are integrating their data in real time. The research also shows that those who integrate their data more often are more satisfied and more confident in their results. Granted, a data lake contains raw information, and it may require more analysis or manipulation since the data is not yet cleansed, but time is money and faster access can often lead to new revenue opportunities. Half the participants in our predictive analytics benchmark research said they have created new revenue opportunities with their analytics.

Cognizant of the lack of governance and management tools some organizations hesitated to adopt data lakes, while others went ahead. Vendors in this space have advanced their capabilities in the meantime. Some, such as Informatica, are bringing data governance capabilities from the EDW world to data lakes. I wrote about the most recent release of Informatica’s big data capabilities, which it calls Intelligent Data Lake. Other vendors are bringing their EDW capabilities to data lakes as well. Information Builders and Teradata both made data lake announcements this spring. In addition, a new category of vendors is emerging focused specifically on data lakes. Podium Data says it provides an “enterprise data lake management platform,” Zaloni calls itself “the data lake company,” and Waterline Data draws its name “from the metaphor of a data lake where the data is hidden below the waterline.”

Is it safe to jump in? Well, just like you shouldn’t jump into a lake without knowing how to swim, you shouldn’t jump into a data lake without plans for managing and governing the information in it. Data lakes can provide unique opportunities to take advantage of big data and create new revenue opportunities. With the right tools and training, it might be worth testing the water.

Regards,

David Menninger

SVP & Research Director

Follow on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 19 other followers

David Menninger – Twitter

Ventana Research

Top Rated

Blog Stats

  • 44,811 hits
%d bloggers like this: