Databricks is a data engineering and analytics cloud platform built on top of Apache Spark that processes and transforms huge volumes of data and offers data exploration capabilities through machine learning models. It can enable data engineers, data scientists, analysts and other workers to process big data and unify analytics through a single interface. The platform supports streaming data, SQL queries, graph processing and machine learning. It also offers a collaborative user interface — workspace — where workers can create data pipelines in multiple languages — including Python, R, Scala, and SQL — and train and prototype machine learning models.
Access to external data can provide a competitive advantage. Our research shows that more than three-quarters (77%) of participants consider external data to be an important part of their machine learning (ML) efforts. The most important external data source identified is social media, followed by demographic data from data brokers. Organizations also identified government data, market data, environmental data and location data as important external data sources. External data is not just part of ML analyses though. Our research shows that external data sources are also a routine part of data preparation processes, with 80% of organizations incorporating one or more external data sources. And a similar proportion of participants in our research (84%) include external data in their data lakes.
The technology industry throws around a lot of similar terms with different meanings as well as entirely different terms with similar meanings. In this post, I don’t want to debate the meanings and origins of different terms; rather, I’d like to highlight a technology weapon that you should have in your data management arsenal. We currently refer to this technology as data virtualization. Other similar terms you may have heard include data fabric, data mesh and [data] federation. I’ll briefly discuss these terms and how I see them being used, but ultimately, I’d like to share with you some research that shows why data virtualization can be valuable, regardless of what you call it.
Alteryx is a data analytics software company that offers data preparation and analytics tools to simplify and automate data wrangling, data cleaning and modeling processes, enabling line-of-business personnel to quickly access, manipulate, analyze and output data. The platform features tools to run a variety of analytic functions such as diagnostic, predictive, prescriptive and geospatial analytics in a unified platform, and can connect to various data warehouses, cloud applications, spreadsheets and other sources.