David Menninger's Analyst Perspectives

Databricks Lakehouse Platform Streamlines Big Data Processing

Written by David Menninger | Oct 26, 2021 10:00:00 AM

Databricks is a data engineering and analytics cloud platform built on top of Apache Spark that processes and transforms huge volumes of data and offers data exploration capabilities through machine learning models. It can enable data engineers, data scientists, analysts and other workers to process big data and unify analytics through a single interface. The platform supports streaming data, SQL queries, graph processing and machine learning. It also offers a collaborative user interface — workspace — where workers can create data pipelines in multiple languages — including Python, R, Scala, and SQL — and train and prototype machine learning models.

Databricks recently closed its series H funding of $1.6 billion, reaching a post-money valuation of $38 billion. With this round of funding, Databricks has raised a total of nearly $3.6 billion. The company intends to use the funds to enter new markets and grow its partner ecosystem.

Databricks Lakehouse Platform is the flagship product, combining the aspects of data warehouse and data lake systems in a unified platform. Business workers can store both structured and unstructured data in the platform and use it for analytics workloads and data science. The Lakehouse also includes capabilities such as schema enforcement, auditing, versioning and access controls. Databricks Lakehouse is an example of emerging data platforms which we’ve written about previously.

Organizations are collecting large amounts of data from many different sources, and the storage of this big data, which can be in any form (images, audio files, other unstructured data), becomes challenging and requires a different architectural approach. We assert that by 2025, three-quarters of organizations will require unstructured data capabilities in their data lakes to maximize the value of audio, video and image data. Databricks enables workers to query the data lakes with SQL, build data sets to generate ML models, and create automated extract, transform, and load pipelines and visual dashboards.

Databricks also introduced Delta Sharing earlier this year, which is included within the open-source Delta Lake 1.0 project, establishing a common standard for sharing all data types – structured and unstructured – with an open protocol that can be used in SQL, visual analytics tools and programming languages such as Python and R. Large-scale datasets can be shared in the Apache Parquet and Delta Lake formats in real time without copying.

Organizations are using a multitude of systems which introduce complexity and, more importantly, introduce delay as workers invariably need to move or copy data between different systems. Teams must grapple with data silos that prevent a single source of truth, the expense of maintaining complicated data pipelines and reduced decision-making speed. Using a unified platform such as Databricks allows traditional analytics, data science and machine learning to coexist in the same system. With Delta Sharing, workers can connect to the shared data through pandas, Tableau or other systems that implement the open protocol, without having to deploy a specific platform first. This can reduce the access time and work for data providers.

Databricks continues to expand its portfolio of big data software around the Databricks Lakehouse Platform, adding more capabilities and integrations to tap the broader market. Databricks can connect to a variety of popular cloud storage offerings including AWS S3, Azure and Google Cloud Storage. It also offers several built-in tools to support data science, business intelligence reporting and machine learning operations. I recommend that organizations looking to organize data and analytics operations into a single platform consider the capabilities of the Databricks platform.

Regards,

David Menninger