You are currently browsing the tag archive for the ‘Hadoop’ tag.

It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept. In this post I’d like to clarify what a data lake is, review the reasons an organization might consider using one and the challenges they present, and outline some developments in software tools that support data lakes.

Data lakes offer a way to deal with big data. A data lake combines massive storage capabilities for any type of data in any format as well as processing power to transform and analyze the data. Often data lakes are implemented using Hadoop technology. Raw, detailed data from various sources is loaded into a single consolidated repository to enable analyses that look across any data available to the user. To understand why data lakes have become popular it’s helpful to contrast this approach with the enterprise data warehouse (EDW). In some ways an EDW is similar to a data lake. Both act as a centralized repository for information from across an organization. However, the data loaded into an EDW is generally summarized, structured data. EDWs are typically based on relational database technologies, which are designed to deal with structured information. And while advances have been made in the scalability of relational databases, they are generally not as scalable as Hadoop. Because these technologies are not as scalable, it is not practical to store all the raw data that come in to the organization. Hence there is a need for summarization. In contrast, a data lake contains the most granular data generated across the organization. The data may be structured information, such as sales transaction data, or unstructured information, such as email exchanged in customer service interactions.

Hadoop is often used with data lakes becausevr_Big_Data_Analytics_21_external_data_sources_for_big_data_analytics it can store and manage large volumes of both structured and unstructured data for subsequent analytic processing. The advent of Hadoop made it feasible and more affordable to store much larger volumes of information, and organizations began collecting and storing the raw detail from various systems throughout the organization. Hadoop has also become a repository for unstructured information such as social media and semistructured data such as log files. In fact, our benchmark research shows that social media data is the second-most important source of external information used in big data analytics.

In addition to handling larger volumes and more varieties of information, data lakes enable faster access to information as it is generated. Since data is gathered in its raw form, no preprocessing is needed. Therefore, information can be added to the data lake as soon as it is generated and collected. This approach has caused some controversy with many industry analysts and even vendors to raise concerns about data lakes turning into data swamps. In general, the concerns about data lakes becoming data swamps center around the lack of governance of the data in a data lake, an appropriate topic here. These collections of data should be governed like any other set of information assets within an organization. The challenge was that most of the governance tools and technologies had been developed for relational databases and EDWs. In essence, the big data technologies used for data lakes had gotten ahead of themselves, without incorporating all the features needed to support enterprise deployments.

Another, perhaps more minor controversy centers around terminology. I raise this issue so that, regardless of the terminology a vendor chooses, you can recognize data lakes and be aware of the challenges. Cloudera uses the term Enterprise Data Hub to represent essentially the same concept as a data lake. Hortonworks embraces the data lake terminology as evidenced in this post. IBM acknowledges the value of data lakes as well as its challenges in this post, but Jim Kobielus, IBM’s Big Data Evangelist, questioned the terminology in a more recent post on LinkedIn, and the term “data lake” is not featured prominently on IBM’s website.

Despite the controversy and challenges, data lakes are continuing to grow in popularity. They provide important capabilities for data science. First, they contain the detailed data necessary to perform predictive analytics. Second, they allow efficient access to unstructured data such as social media or other text from customer interactions. For business this information can create a more complete profile of customers and their behavior. Data lakes also make data available sooner than it might be available in a conventional EDW architecture. OurVentanaResearch_DAC_BenchmarkResearch data and analytics in the cloud benchmark research shows that one in five (21%) organizations are integrating their data in real time. The research also shows that those who integrate their data more often are more satisfied and more confident in their results. Granted, a data lake contains raw information, and it may require more analysis or manipulation since the data is not yet cleansed, but time is money and faster access can often lead to new revenue opportunities. Half the participants in our predictive analytics benchmark research said they have created new revenue opportunities with their analytics.

Cognizant of the lack of governance and management tools some organizations hesitated to adopt data lakes, while others went ahead. Vendors in this space have advanced their capabilities in the meantime. Some, such as Informatica, are bringing data governance capabilities from the EDW world to data lakes. I wrote about the most recent release of Informatica’s big data capabilities, which it calls Intelligent Data Lake. Other vendors are bringing their EDW capabilities to data lakes as well. Information Builders and Teradata both made data lake announcements this spring. In addition, a new category of vendors is emerging focused specifically on data lakes. Podium Data says it provides an “enterprise data lake management platform,” Zaloni calls itself “the data lake company,” and Waterline Data draws its name “from the metaphor of a data lake where the data is hidden below the waterline.”

Is it safe to jump in? Well, just like you shouldn’t jump into a lake without knowing how to swim, you shouldn’t jump into a data lake without plans for managing and governing the information in it. Data lakes can provide unique opportunities to take advantage of big data and create new revenue opportunities. With the right tools and training, it might be worth testing the water.

Regards,

David Menninger

SVP & Research Director

On Monday, March 21, Informatica, a vendor of information management software, announced Big Data Management version 10.1. My colleague Mark Smith covered the introduction of v. 10.0 late last year, along with Informatica’s expansion from data integration to broader data management. Informatica’s Big Data Management 10.1 release offers new capabilities, including for the hot topic of self-service data preparation for Hadoop, which Informatica is calling Intelligent Data Lake. The term “data lake” describes large collections of detailed data from across an organization, often stored in Hadoop. With this release Informatica seeks to add more enterprise capabilities to data lake implementations.

This is the latest step in Informatica’s  big data efforts. The company has been investing in Hadoop for five years, and I covered some of its early efforts. The Hadoop market has been evolving over that time, growing in popularity and maturing in terms of information management and data governance requirements. Our big data benchmark research has shown increases of more than 50 percent in the use of Hadoop, with our big data analytics research showing 37 percent of participants in production. Building on decades of experience in providing technology to integrate and manage data in data marts and data warehouses, Informatica has been extending these capabilities to the big data market and Hadoop specifically.

The Intelligent Data Lake capabilities are the most significant features of version 10.1. They include self-service data preparation, automation of some data integration tasks, and collaboration features to share information among those working with the data. The concept of self-service data preparation has become popular of late. Our big data analytics research shows that preparing data for analysis and reviewing it for quality and consistency are the two most time-consuming tasks, so making data preparation easier and faster would benefit most organizations.  Recognizing this market opportunity, several vendors are competing in this space; Informatica’s offering is called REV. With version 10.1 the Big Data Management product will have similar capabilities, including a familiar spreadsheet-style interface for working with and blending data as it is loaded into the target system. However, the REV capabilities available as part of Informatica’s cloud offering are separate from those in Big Data Management 10.1. They require separate licenses and there is no upgrade path or option as a result sharing work between the two environments is limited. Informatica faces two challenges with self-service: how well users view its self-service capabilities and user interface vs. those of their competitors and whether analysts and data scientists will be inclined to use Informatica’s products since they are mostly targeted at the data preparation process rather than the analytic process.

The collaborative capabilities of 10.1 should help organizations with their information management processes. Our most recent findings on collaboration come from our data and analytics in the cloud research, which shows that only 30 percent of participants are satisfied with their collaborative capabilities. The new release enables those who are working with the data to tag it with comments about what they found valuable or not, note issues with data quality and point others toward useful transformations they have performed. This type of information sharing can help reduce some of the time spent on data preparation. Ideally these collaboration capabilities could be surfaced all the way through the business intelligence and analytic process, but Informatica would have to do that through its technology partners since it does not offer products in those markets.

Version 10.1 includes other enhancements. The company has made additional investments in its use of Apache Spark both for performance purposes and for its machine-learning capabilities. I recently wrote about Spark and its rise in adoption. More transformations are implemented in Spark than in Hadoop’s MapReduce, which Informatica claims speeds up the processing by up to 500 percent. It also uses Spark to speed up the matching and linking processes in its master data management functions.

I should note that although Informatica is adopting these open source technologies, its product is not open source. Much of big data development is driven by the open source community, and that presents an obstacle to Informatica. Our next-generation predictive analytics research shows that Apache Hadoop is the most popular distribution, with 41 percent of organizations using or planning to use this distribution. Informatica itself does not provide a distribution of Hadoop but partners with vendors that do. Whether vr_Big_Data_Analytics_20-Hadoop_for_big_data_analyticsInformatica can win over a significant portion of the open source community remains a question. Whether it has to is another. In positioning release 10.1 the company describes the big data use cases as arising alongside conventional data warehouse and business intelligence use cases.

This release includes a “live data map” that monitors data landing in Hadoop (or other targets). The live data map infers the data format (such as social security numbers, dates and schemas) and creates a searchable index on the type of data it has catalogued; this enables organizations to easily identify, for instance, all the places where personally identifiable information (PII) is stored. They can use this information to ensure that the appropriate governance policies are applied to this data. Informatica has also enhanced its security capabilities in Big Data. Its Secure@Source product, which won an Innovation Award from Ventana Research last year , provides enterprise visibility and advanced analytics on sensitive data threats. The latest version adds support for Apache Hive tables and Salesforce data. Thus for applications that require these capabilities a more secure environment is available.

The product announcement was timed to coincide with the Strata Hadoop conference, a well-attended industry event that many vendors use to gain maximum visibility for such announcements. However, availability of the product release is planned for the second quarter of 2016. As an organization matures in its use of Hadoop, it will need to apply proper data management and governance practices.  With version 10.1 Informatica is one of the vendors to consider in meeting those needs.

Regards,

David Menninger

SVP & Research Director

Follow on WordPress.com

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 17 other followers

David Menninger – Twitter

Ventana Research

Top Rated

Blog Stats

  • 43,532 hits
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: