A data lake is a centralized repository designed to house big data in structured, semi-structured and unstructured form. I have been covering the data lake topic for several years and encourage you to check out an earlier perspective called Data Lakes: Safe Way to Swim in Big Data? for background. Our data lake research has uncovered some points to consider in your efforts, and I’d like to offer a deeper dive into our findings.
Data virtualization is not new, but it has changed over the years. The term describes a process of combining data on the fly from multiple sources rather than copying that data into a common repository such as a data warehouse or a data lake, which I have written about. There are many reasons for an organization concerned with managing its data to consider data virtualization, most stemming from the fact that the data does not have to be copied to a new location. It could, for instance, eliminate the cost of building and maintaining a copy of one of the organization’s big data sources. Recognizing these benefits, many database and data integration companies offer data virtualization products. Denodo, one of the few independent, best-of-breed vendors in this market today, brings these capabilities to big data sources and data lakes.
It has been more than five years since James Dixon of Pentaho coined the term “data lake.” His original post suggests, “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state.” The analogy is a simple one, but in my experience talking with many end users there is still mystery surrounding the concept. In this post I’d like to clarify what a data lake is, review the reasons an organization might consider using one and the challenges they present, and outline some developments in software tools that support data lakes.
Topics: Big Data, Data Science, Predictive Analytics, Social Media, Business Analytics, Business Intelligence, Data Governance, Data Lake, Governance, Risk & Compliance (GRC), Information Management, Uncategorized, Strata+Hadoop