We recently published the results of our benchmark research on Big Data to complement the previously published benchmark research on Hadoop and Information Management. Ventana Research undertook this research to acquire real-world information about levels of maturity, trends and best practices in organizations’ use of large-scale data management systems now commonly called Big Data. The results are illuminating.
Volume, velocity and variety of data (the so-called three V’s) are often cited as characteristics of big data. Our research offers insight into each of these three categories. Regarding volume, over half the participating organizations process more than 10 terabytes of data, and 10% process more than 1 petabyte of data. In terms of velocity, 30% are producing more than 100 gigabytes of data per day. In terms of the variety of data, the most common types of big data are structured, containing information about customers and transactions. However, one-third (31%) of participants are working with large amounts of unstructured data. Of the three V’s, nine out of 10 participants rate scalability and performance as the most important evaluation criteria, suggesting that volume and velocity of big data are more important concerns than variety.
This research shows that big data is not a single thing with one uniform set of requirements. Hadoop, a well-publicized technology for dealing with big data, gets a lot of attention (including from me), but there are other technologies being used to store and analyze big data. The research data shows an environment that is still evolving. The majority of organizations still use relational databases but not exclusively: More than 90 percent of participants using relational databases also use at least one other technology for some of their big-data operations. One-third (34%) are using data warehouse appliances, which typically combine relational database technology with massively parallel processing. About as many (33%) are using in-memory databases. Each of these alternatives is being more widely used than Hadoop. As well, 15% use specialized databases such as columnar technologies, and one-quarter (26%) are using other technologies.
While these technologies enable organizations to do things they haven’t done before, there is no technological silver bullet that will solve all big-data challenges. Organizations struggle with people and process issues as well. In fact, our research shows that the most troublesome issues are not technical but people-related: staffing and training. Big data itself and these new approaches to processing it require additional resources and specialized skills. Hence we see high levels of interest in big-data industry events such as Hadoop World and the Strata Conference. Recognizing the dearth of trained resources here, some academic institutions have launched degree programs in analyzing big data, and IBM has started BigData University.
Research participants cited real-time capabilities and integration as their key technical challenges. The velocity with which they generate data and the fact that over half the organizations analyze their data more than once a day are forcing them to seek real-time capabilities; the pace of business today demands that they extract as soon as possible all useful information to support rapid decision-making. When respect to integration, less than half of participants are satisfied with integration of third-party products, and almost two-thirds cite lack of integration as an obstacle to analyzing big data. Three-quarters have integrated query and reporting with their big-data systems, but more advanced analytics such as data mining, visualization and what-if analysis are seldom available as integrated capabilities. Responding to such comments, vendors have been racing to integrate their business intelligence and information management products with big-data sources. As you consider big-data projects and technologies, make sure that the vendors you select can handle the big-data sources you must use.
Looking ahead we expect more changes in this evolving landscape. In some ways big-data challenges and the presence of Hadoop in particular have paved the way for other technologies besides relational databases. NoSQL alternatives, such as Cassandra, MongoDB and Couchbase, are gaining notice in enterprise IT organizations after the success of Hadoop. In-memory databases, once considered a niche technology, are being considered by SAP, in HANA, as its primary big-data analytical platform. There are differing opinions about whether these various big-data technologies will converge or diverge. We can look to the past for some indications of where the market might go. Over the years a variety of alternatives to relational databases have emerged, including OLAP, data warehouse appliances and columnar databases; each eventually was absorbed into relational databases.
We also see signs of the major relational vendors embracing big-data technologies. IBM acquired Netezza for its massively parallel data warehouse appliance technology. IBM has also invested heavily in Hadoop. Oracle introduced its own line of data warehouse appliances and recently brought a big-data appliance to market that includes Hadoop and NoSQL technologies. Microsoft has invested in massively parallel processing and Hadoop. We also see independent vendors such as Hadapt combining relational database technology with Hadoop. The past is not necessarily an indication of the future, but our research shows and recent market dynamics suggest it may be premature to write off the relational database vendors as out of touch.
In light of this information, I recommend that your organization explore various alternatives for solving specific challenges. At a minimum you should be aware of the alternatives so when the need arises you will know what is available. Use our big-data research to guide your use of these technologies and to help avoid some of the obstacles they present so you can be more successful in applying big data to business decisions.
David Menninger – VP & Research Director