A data lake is a centralized repository designed to house big data in structured, semi-structured and unstructured form. I have been covering the data lake topic for several years and encourage you to check out an earlier perspective called Data Lakes: Safe Way to Swim in Big Data? for background. Our data lake research has uncovered some points to consider in your efforts, and I’d like to offer a deeper dive into our findings.
Planning and deploying data lakes can be time consuming and complicated. And while there are complications and risks in working with data lakes, there are also significant benefits such as creating a competitive advantage and helping lower operational costs. You can learn more about data lake benefits and challenges in this dedicated perspective on the confluence with data warehouses. Our research shows that data lakes require some commitment and determination. Organizations using data lakes for at least two years are more satisfied with their deployment results than those who have been using them for less than two years.
One of the reasons data lakes require time and investment is that most successful implementations rely on a big data framework designed to store and process huge datasets. Hadoop was popular in the early days of data lakes and is still widely deployed by organizations. It has been successfully used to scale up to thousands of nodes, and to store and process petabytes of structured and unstructured data. More recently, organizations have turned to cloud-based object storage adoption as well as on-premises object storage as an alternative to HDFS (Hadoop Distributed File System). Object storage is generally inexpensive and allows for separation of storage and compute, which enables for better scalability required by data lakes.
Regardless of whether Hadoop or object storage was used, our research shows improved confidence and satisfaction compared to organizations relying on relational data lakes or other alternatives. We found that 60% of those using Hadoop or object storage reported they were satisfied/somewhat satisfied compared to only 18% who are not (and 51% are confident/very confident compared to only 27% that are not) in their organization’s ability to analyze big data.
Organizations are continuously growing their data lakes, collecting data from various sources, teams and departments. But collecting data is only half of the equation. Many organizations can’t take full advantage of their data lakes because they don’t know what data actually exists. The data needs to be cataloged so that it can be called up for on-demand analysis by analysts and data scientists. Catalogs can help provide context—where the data originated, when it was last updated and how it can and should be used. Data catalogs also improve governance and reduce the risk of errors by indicating which sources are approved and certified. Perhaps the most fundamental benefit of a data catalog is that it makes it easier to search for and find the information needed for an analysis. Our research found that data catalog users are significantly more satisfied (68% compared to 39% who are not using data catalogs) in their organization’s ability to analyze big data.
Organizations face a multitude of challenges with data lakes such as replicating data, data security and data governance, which includes regulations such as GDPR restricting data location. Data virtualization can be used to solve some of these challenges by accessing data in place as it is needed rather than moving or copying the data to another location. In our research, we found that nearly one-quarter (24%) of organizations already include data virtualization in their data lake implementations. Additionally, half (47%) of organizations are planning to include the capability at some point in the future.
Data virtualization can integrate various data sources across multiple data types and locations, providing the end-user with a single logical layer, through which data can be easily accessed and shared across organizations. It also offers a way to unify data governance and security controls. Our research shows a significant difference in satisfaction between organizations using Data Virtualization and those that are not. More than three-quarters (79%) of organizations using data virtualization are satisfied with their data lake deployments compared to only 36% that are not using data virtualization.
Armed with this information, your organization will be better prepared to take a deeper dive into data lakes. Adopting a big data framework may not be as easy as trying to use your existing platform, but the research shows it can be worth the effort. Think about adopting a data catalog, if you haven’t already, to make it easier for your analysts to find and take advantage of the data you have collected. And consider the role of data virtualization in your data lake implementation for assembling and governing large amounts of data from various parts of the organization.
Ventana Research is currently conducting Data Lake Dynamic Insights research. We invite you to take the survey and contribute your organization’s perspective on data lakes and gain immediate guidance on improving your efforts. Thank you for your consideration!