David Menninger's Analyst Perspectives

Stream and Event Processing Require Real-Time Analytics

Written by David Menninger | Oct 25, 2023 10:00:00 AM

In my past perspectives, I’ve written about the evolution from data at rest to data in motion and the fact that you can’t rely on dashboards for real-time analytics. Organizations are becoming more and more event-driven and operating based on streaming data. As well, analytics are becoming more and more intertwined with operations. More than one-fifth of organizations (22%) describe their analytics workloads as real time in our Data and Analytics Benchmark Research and nearly half (47%) of organizations report their IoT workloads require seconds or sub-second latency. If organizations can’t rely on dashboards for real-time analytics, what should they consider?

Let’s start with a consideration of the requirements for real-time analytics. Clearly, latency is one consideration. The analysis must be completed quickly enough to meet the requirements of the specific use case. Some use cases require very low latency, such as digital ad servers. Other use cases, such as social media monitoring, are less demanding. But in either case, one of the key requirements is automation. Manually monitoring continuous streams of data is not a very effective or efficient way to analyze and respond to events as they occur.

Another question is how much context is needed to provide an accurate analysis. In some cases, a single event might be sufficient, for instance if you are monitoring temperature and need to know if any single reading is outside normal operating limits. In other cases, you may need a window of events, such as social media monitoring, to see if sentiment has shifted over a period of time. And in still other situations, you may need large amounts of historical context, for example a customer’s past purchase history, so you can make product recommendations as they are browsing your site or speaking with a customer service agent.

The options to handle these requirements fall into two broad categories: data platforms and stream-processing frameworks. My colleague, Matt Aslett, has written about and evaluated operational data platforms. Data platform vendors, both relational and non-relational (NoSQL), have been reducing the latency and increasing the throughput of their systems. In many use cases, these capabilities may be sufficient. In other cases, organizations may need to consider a stream-processing framework, or they may have already adopted a streaming data platform and wish to enrich the data as it streams through their organization.

There are a variety of Apache open-source projects associated with processing streaming data including Beam, Calcite, Flink, Kafka, Spark, Storm and others. Among these, Kafka and Spark have been widely adopted. Kafka has emerged as a de facto standard for streaming data, with many vendors supporting it in their data platforms or offering it as a managed service. However, while Kafka provides a robust data streaming platform, it is limited in the types of transformations and analyses it supports. Perhaps as a result of these limitations, Apache Flink is on a growth trajectory that matches Kafka’s earlier growth trajectory. Flink provides stateful computations over data streams, including an SQL processing layer.

Apache Spark has also emerged as a de facto standard for processing large amounts of data and is the backbone of many data lake implementations. Within Spark are projects to support SQL as well as a project for Structured Streaming which applies SQL to streaming data. For those that require more sophisticated analyses, Spark also includes a machine learning (ML) library.

Basically, organizations have three categories to choose from for real-time analytics over streaming data. First, they can rely on their data platform provider. Most are offering some support for stream processing and, depending on the use case and the relationship with the vendor, that may be sufficient. For those organizations that have made big commitments to a data lake architecture, Spark is likely to be a good alternative, again depending on the use case. However, for those organizations that are heavily invested in Kafka or one of its variants, Flink may be the best option, enabling the analyses to be associated with the streams independent of where they land and what data platforms are used.

Some operations don’t require real time. Some require near real time and still others will require nearly instantaneous responses. Explore your options so you can be prepared to meet your organization’s needs.

Regards,

David Menninger