Splunk Makes Machine-Generated Big Data Serve Analytics

Posted by David Menninger Sep 9, 2011 9:35:04 AM 6 minutes to read

Splunk may be one of the biggest software companies you’ve never heard of. I’ve been following the seven-year-old company for over six months now and recently attended its second annual user conference. Splunk focuses on analyzing large volumes of machine-generated data in underlying applications and systems, which includes application and system logs, network traffic, sensor data, click streams and other loosely structured information sources. Many of these “big data” sources are the same sources analyzed with Hadoop, according to our recently published benchmark research. However, Splunk takes a different approach that focuses on performing simple analyses on this data in real time rather than the batch-based advanced analytics we see as the most common use for Hadoop.

Although privately held, Splunk operates much like a public company and appears to be grooming itself for an initial public offering. In its fiscal year ended January 31, 2011, Splunk reported $66 million in revenue and has announced that its goals for FY 2012 include generating $100 million in revenue. With 68% and 70% growth in its first two quarters this year, Splunk appears to be on track to meet this goal. CEO Godfrey Sullivan, formerly CEO of Hyperion, has a successful track record in the business intelligence software space. All these indications suggest a promising future for the company. Data originates from a variety of sources in ever increasing volumes, and organizations are trying to figure out how they can maximize the value of this data. Splunk has rapidly grown based on the simplicity of the tool for IT professionals to adopt and utilize against machine or IT specific data from an individual or department that according to our IT Analytics benchmark finds plenty of demand in IT.

As stated above, Splunk focuses on a specific segment of the big-data market: machine-generated data. This type of data originates constantly from many sources throughout an organization and in large quantities. The other common characteristic of machine-generated data is that generally it is less structured than data in typical relational databases. Often the information is captured as logs consisting of text files containing various record lengths and record structures. To effectively utilize this loosely structured information in real time, two challenges must be overcome: loading the data quickly and easily navigating through and analyzing the information once it is loaded.

Splunk tackles the first challenge by loading the information in its raw form. No preprocessing is necessary, therefore no delay is introduced and no data is “lost.” Retaining all the raw data has business value as well. If you later decide that you want to investigate some new piece of information that previously you didn’t think was important, it will be available for analysis.

A search-based mechanism provides the solution to the second challenge. Our information applications research shows the importance of search, which ranked third on the list of very important analysis capabilities overall, and for end users specifically it topped the list of very important capabilities (46%), ahead of navigating to and retrieving information. Search based access to analytics has been a large driver in growth and was highlighted by my colleague in 2009. Search overcomes the issues created by the lack of “structure” in the machine-generated data. In reality the data has plenty of structure – users search for strings representing occurrences of certain types of events. Splunk supplements the query mechanism with analytical functions that can be used to create aggregates, time-period comparisons and other common analyses. In addition, queries can be saved for reuse and as the basis of reports, dashboards and alerts. I heard anecdotal proof of the value of search at the Splunk user conference from two undergraduate students who, as part of their summer internship, had learned the Splunk query techniques quickly and implemented reports and analyses for monitoring the systems of a major financial services software company.

Architecturally, Splunk employs massively parallel processing to spread the data and processing across a number of individual servers. At query time, a proprietary MapReduce mechanism – one not based on Hadoop – gathers the data from the individual nodes to satisfy the user’s request. Users do not need to know about the MapReduce mechanism. The translation of the query to the appropriate execution strategy is done automatically. However, as with any distributed data system, some knowledge of how the data gets distributed across the nodes can be helpful in identifying performance bottlenecks and tuning certain slowly running queries.

The currently released version, Splunk 4.2 was introduced earlier in 2011 and includes real-time alerting on streams of data. It also includes a new agent-based data collection mechanism, called a universal forwarder, that makes the task easier and provides more reliability when collecting data from multiple endpoints or devices. Splunk separates the workload between indexers that perform the data loading and search heads that execute the queries. Version 4.2 introduced search-head pooling for load balancing so searches can be directed to anyone of the search heads; it also provides high availability among the search processes.

At the conference, Splunk introduced version 4.3 and made the beta version available to registered users. One of the more popular demonstrations was Splunk 4.3 running as a non-Flash application on the iPad. The company also made a number of announcements of specific applications and extensions of the product. Splunk Storm provides visibility and operational analytics of cloud-based applications. Splunk App for Citrix XenDesktop and Splunk App for VMWare provide visibility into virtualized and private cloud environments. The company also introduced a software development kit (SDK) for the Python programming language, which is open source and available at github.

The Splunk product is not perfect, of course. Continued investment in the user interface is needed to make it easier to use. Currently users have to learn the Splunk syntax – I was introduced to those internals to show that this is easy – and a graphical query interface also would make the product more widely usable. When I probed about high availability, it became clear that you can use the Splunk tools to load dual systems to have a standby system in case of failure, but that’s not done automatically. But the company representatives were open about its shortcomings with both me and its customers, which was refreshing.

Nearly every organization has some form of machine data. Splunk says that more than 2,900 enterprises have found a reason to purchase its products. The company’s mission is now to raise its visibility and broaden its applicability. Splunk provides a free, limited-capability version of its product so you can try it for yourself and see if it applies to your needs.

Regards,

David Menninger – VP & Research Director

Big Data, Predictive Analytics, Sales Performance, Social Media, Supply Chain Performance, Business Analytics, Business Intelligence, Business Performance, Customer & Contact Center, Machine data, Operational Intelligence, IT Performance Management (ITPM)

Authors:

David Menninger

Executive Director, Technology Research

David Menninger leads technology software research and advisory for Ventana Research, now part of ISG. Building on over three decades of enterprise software leadership experience, he guides the team responsible for a wide range of technology-focused data and analytics topics, including AI for IT and AI-infused software.

David Menninger's Analyst Perspectives