You are currently browsing the tag archive for the ‘Data Management’ tag.
There has been a spate of acquisitions in the data warehousing and business analytics market in recent months. In May 2010 SAP announced an agreement to acquire Sybase, primarily for its mobility technology and had already been advancing its efforts with SAP HANA and BI. In July 2010 EMC agreed to acquire data warehouse appliance vendor Greenplum. In September 2010 IBM countered by acquiring Netezza, a competitor of Greenplum. In February 2011 HP announced after giving up on its original focus with HP Neoview and now has acquired analytics vendor Vertica that had been advancing its efforts efficiently. Even Microsoft shipped in 2010 its new release of SQL Server database and appliance efforts. Now, less than one month later, Teradata has announced its intent to acquire Aster Data for analytics and data management. Teradata bought an 11% stake in Aster Data in September, so its purchase of the rest of the company shouldn’t come as a complete surprise. My colleague had raised the question if Aster Data could be the new Teradata but now is part of them.
All these plays have implications for how enterprises manage and use their fast-growing stores of data. We’re living in an era of large-scale data, as I wrote recently. Founded in 1979, Teradata has been a dominant player in this market for years. Teradata was a pioneer in massively parallel processing (MPP) for database systems that I recently described, a concept behind much of today’s analytic database market, including all the recently acquired vendors mentioned above. When I worked at Oracle in the late 1990s, Teradata was the chief competitor when pursuing 1 terabyte (TB) data warehouse opportunities. Yes, managing a single terabyte was considered a significant challenge then that few vendors were ready to take on. Although the data volumes have grown, little else has changed since those years with Oracle now competing against more providers despite its recent promotions of its second generation Oracle Exadata appliance and it Oracle 11g Release 2 database at the 2010 Oracle OpenWorld.
Of course, that has all changed long since. Over the last few years Aster Data established itself as a player in the data warehousing market with an MPP relational database product. It also embraced the MapReduce parallel processing technology earlier than most other data warehouse vendors. MapReduce is a key concept of the increasingly significant Apache Hadoop project; it appears that Aster Data’s proprietary implementation of MapReduce was a significant factor in Teradata’s decision to acquire it. The tone of the joint Teradata/Aster Data briefing for analysts even suggests that Teradata is attempting an end-run around Hadoop. Aster Data has successfully developed a market niche of customers doing analysis of unstructured data and social networks. Both of these are activities one might use Hadoop to do. The company also had success in other segments, such as financial services, marketing services, retail and e-commerce.
Aster Data customers should benefit from the increased resources Teradata can invest in developing the product, and the rich heritage of Teradata in this space should enhance the infrastructure supporting the Aster Data product line. For example, Teradata’s workload management tools are among the best in the industry. However, even if the association with Teradata brings some of these capabilities to the Aster Data platform, it’s likely that the cost advantages of Aster Data over Teradata will decline over time. Integration into the Teradata organization and technology stack could detour Aster Data from its previous path of innovation. So customers may see a longer time between releases and maybe a less ambitious product roadmap.
Obviously Teradata has many more customers than Aster Data and is most concerned about them. They, too, might see some negative impact on development schedules, but on the plus side they instantly receive a new source of technology that could be beneficial to them. The question, as yet unanswered until a roadmap is published, will be how quickly Teradata customers can take advantage of the innovations Aster Data has brought to market.
Teradata officials talked about Aster Data retaining some independence in operations, in the product line and perhaps in identity, but integration is the key to making the acquisition valuable to the Teradata customer base. One likely outcome favorable to current and future Teradata customers is more support for industry-standard server hardware, which was specifically mentioned as a benefit of the acquisition. Teradata customers may also benefit from the columnar database capabilities if those capabilities are ported to the main Teradata product line.
There were a couple of notable omissions from the discussion as the acquisition was announced. Both Teradata and Aster Data had partnerships with SAS and Cloudera, the Hadoop vendor. The SAS relationship provided advanced analytics including statistical analyses and data mining embedded within the database. Teradata may be looking to shift to using Aster Data’s embedded analytic capabilities. With respect to Hadoop, although both companies had publicly announced partnerships with Cloudera, neither had offered any deep integration with Hadoop. I expect such an arm’s-length relationship to continue, but I suspect the new combination will put more weight behind the embedded SQL-MR capability that Aster Data developed. Teradata may be large enough to attempt such an independent strategy, but I think it would be a mistake to alienate the entire Hadoop community that I have been researching, which in my opinion is a market not served by Teradata today.
Customers of other data warehousing companies should feel little immediate negative impact from this coupling. In fact, more competition over the analysis of unstructured data could spark those companies to enhance their product offerings. As I noted in beginning, the market for independent products has shrunk over the last six months, so the remaining vendors could see an increase in revenues as customers who were using Aster Data or others in preference to larger companies may start to consider other alternatives. However, at some point the musical chairs of the large-scale database acquisitions is going to stop, and when that does, if your vendor is still reliant on venture capital (VC) funding (i.e., does not yet have positive cash flow) you could have a problem. The VCs may decide to stop funding such a company, which could force them to scale back on their plans.
Probably some other shoes will drop before the market shift is over. Dell had a partnership with Aster Data that is now called into question. Dell therefore might seek an alternative partner or even acquire one of the other vendors to get into the game. Potential candidates include many we have been assessing including: 1010Data, Calpont, Kognitio and ParAccel. I didn’t include Infobright in this list because it doesn’t have an MPP offering, which seems to be table stakes now. This latest acquisition could also lead to another round of acquisitions based on Hadoop or NoSQL technologies. Or perhaps the game may take a step in the direction of complex event processing or predictive analytics. We’ll have to wait and see although at this rate we may not have to wait long!
David Menninger – VP & Research Director
This is the second in a series of posts on the architectures of analytic databases. The first post addressed massively parallel processing (MPP) and database technology. In this post, we’ll look at columnar database technology. Given the recent announcement of HP’s plan to acquire Vertica, a columnar database vendor, there is likely to be even more interest in columnar database technology, how it operates and what benefits it offers.
Fundamentally, columnar database technology offers two primary benefits, increased speed and reduced storage requirements. We repeatedly emphasize the importance of speed to end users. Our benchmark research on business intelligence and performance management show that performance is a key consideration among those seeking improvement in this area. Besides speed and reduced storage other benefits may exist in particular implementations, but these are the two most significant – and most directly attributable to columnar technology.
Columnar technology was made popular by Sybase, now part of SAP, with its IQ product. Today, many other vendors have brought columnar database products to market, among them 1010data, Calpont, Infobright, ParAccel, Sand Technology, SenSage and Vertica who was just acquired by HP. Recently, traditional row-oriented database vendors – among them Aster Data now being acquired by Teradata, EMC Greenplum and Oracle – have added some columnar capabilities to their products including. Additionally, in-memory database technologies frequently utilize column-based architectures.
Columnar database technology typically includes columnar storage and/or columnar database execution. We’ll talk about both and why each is important.
Columnar storage turns the traditional relational (row-oriented) database on its side. Instead of storing data in rows, which is the norm for databases such as IBM DB2, Microsoft SQLServer, MySQL and Oracle, data is stored by column. Whereas a row would consist of customer name, order date, amount of the order, order number, shipment date, method of shipment, and so on, a column of data would consist of all the customer names. A separate column would contain all the order dates. Another would contain all the order amounts, and so on. Row-oriented storage is more efficient (that is, faster) when recording or retrieving the data necessary to process a transaction. The disk only needs to go to one location for either operation. Column-oriented storage, on the other hand, is more efficient (that is, faster) when querying the same item across many rows of data, which is common in most business intelligence (BI) queries. Since all the similar items are grouped together on disk the database can scan them more rapidly than if it had to retrieve all the fields in a record just to get at one or two of them. The difference is magnified when querying hundreds of millions or billions of rows.
Columnar organization yields several storage-related benefits as well. First, columnar databases make little or no use of user-defined indexes, so they require no additional storage other than that for the data. Second, once the data is sorted and stored by column, the potential for compression increases dramatically. One simple example I like to use is to imagine storing billions of stock quotes, instances of web activity, network activity or any other type of data that includes a date field. Even if you kept five years of history, there are only 1826 or 1827 (depending on leap years) unique date values in that time period. With a row-oriented storage approach, you would need to store all of those billions of values on disk. Using a columnar storage technique you could potentially store each date only once and record the number of occurrences for that date. This technique, referred to as run-length encoding, can lead to dramatic compression ratios, as suggested by the example. Other techniques are available too but I won’t go into those at this point. What is important to note is that the benefits of compression are significant enough that row-oriented vendors have figured out how to engineer some these same techniques into their storage algorithms.
Let’s move on to the second major aspect of columnar databases, columnar execution. Columnar databases not only can reduce the amount of storage required, but also in many cases can reduce the amount of memory required to process the data. If the compact representation of the data can be retained as queries are processed, it follows that less memory is required to manipulate the data. Without going into all the details, some columnar database engines can select subsets of data and perform joins on the compact representation, resulting in even more efficiency and performance gains.
Columnar databases do have their downsides. They are typically less efficient when it is necessary to update or delete data, for several reasons. First and foremost, updating or deleting a single row of data requires finding several locations on disk where the individual columns are stored. Even single row retrievals can be slower, resulting in a noticeable performance difference. I know of one financial service firm that has spreadsheets with thousands of individual data references in separate cells of the spreadsheets they use. Even though each individual lookup only takes a small fraction of a second longer, the overall performance is much slower because the difference is magnified many times over. The second reason updates or deletes can be slower is the organization of the data. If a single value in the middle of a long list of values is deleted or updated, some portion of the page needs to be reorganized. Depending on the vendor’s approach this reorganization issue can become significant over time as more and more updates or deletes are processed.
Because of these issues with columnar databases, at least two hybrid implementations have emerged. One form of hybrid is a row-oriented database with some columnar capabilities added, such as those noted above. The other form is a columnar database with some row-oriented capabilities added on. Although a hybrid architecture minimizes the downsides of each approach, I suggest that you think about it this way: While a hybrid minimizes the downsides it does not eliminate them.
The bottom line is that your database engine will be primarily column- or row-oriented, and you should consider that in the selection and evaluation process. Over time I suspect the hybrid techniques may approach each other in terms of performance and capabilities, but for the time being I think you will still see some differences depending on the specific use case and workload you are trying to manage.
Columnar databases can also be implemented as MPP (massively parallel processing) systems, as hardware appliances or as in-memory systems. Calpont, Paraccel and Vertica are examples of MPP columnar databases. Kickfire, acquired by Teradata, is probably the best example of an appliance-based columnar system, while the SAP High-Performance Analytic Appliance (HANA) is probably the most ambitious in-memory columnar system.
I hope you can use this information as you investigate database vendors and evaluate their product offerings. Consider whether you need columnar storage alone or if you need columnar execution as well. Consider whether the cost of learning and managing a separate system is worth the benefits. Consider your current and future use cases and how they match a row-versus-column orientation. Regardless of what you think you understand about the different approaches make sure to conduct a proof of concept with your data and your workload. As you might imagine from this article, the differences between these approaches can be subtle and a proof of concept is likely to be the only way to evaluate which approach works best for you.
David Menninger – VP & Research Director