You are currently browsing the monthly archive for January 2011.
Open source business intelligence (BI) software vendor Jaspersoft recently announced general availability of its flagship product Jaspersoft 4 and earlier this week announced a new reporting project that provides data connectors to a variety of large-scale data sources.
Jaspersoft claims millions of downloads of its software and has a staff of 120 employees. The company focuses on the information delivery aspects of BI and partners with another open source vendor, Talend, that I recently wrote about. Jaspersoft 4 has a robust set of traditional BI features based around a report metaphor. It includes ad-hoc reports, pixel-perfect reporting , dashboards and online analytical processing (OLAP), both in-memory OLAP and relational OLAP (ROLAP). But as with many business intelligence tools, these capabilities are read-only and Jaspersoft hasn’t yet tackled planning, what-if analyses and other advanced analyses such as statistics and predictive analytics.
In my previous life, I partnered with Jaspersoft as well as other BI tool vendors. I observed that while the platform was robust and capable, the look and feel needed an overhaul because it looked antiquated and lacked the interactivity available in modern Web interfaces. Jaspersoft 4 addresses that issue with a new user interface that makes the product look much more modern and provides much better interactivity for the end user. The new interface is part of a re-architecting of the product that separates the presentation layer from the events layer and the structure. (The product is also architected with a data abstraction layer, but more about that later.)
The separation of the presentation layer is important to Jaspersoft’s focus on enabling embedded BI capabilities. The company also has invested in multitenancy capabilities, including per-tenant user interface themes. So, for example, if your organization has multiple divisions with different branding, each division could have its own look and feel even if they all run on the same instance of Jaspersoft.
These features are derived from Jaspersoft’s strategy to build platform as a service (PaaS) and software as a service (SaaS) offerings. As the software market evolves, more and more products will be delivered as a service, and Jaspersoft is betting that it can capture a large share of this market by architecting its products to behave well when running as a service and by supporting the data sources these service providers might choose to leverage.
For software vendors, committing to re-architecting is always a difficult decision because it takes lots of development effort without providing much visible benefit in the form of new features. So it is often driven by a fear that the company can’t be competitive with its previous approach. Jaspersoft sees the market evolving toward a model where BI becomes an embedded capability; I endorsed that notion in my posting “What Is Wrong with Business Intelligence?”
Jaspersoft 4 provides a complete Web-based BI product stack while many BI products still have developer or designer components that must run outside of the browser environment. The advantage of a completely Web-based BI stack is that it’s easier to move and cross the boundaries between end user and developer. For example, if you want to allow certain groups of users in your organization to design new reports or screens, you can do this with Jaspersoft 4 simply by turning on or off certain features for certain classes of users. If the developer tool was a completely separate product, doing that would be much more difficult.
Despite all the focus on Web-based delivery, mobile delivery platforms are not currently a core focus of this vendor. Jaspersoft 4 works on mobile devices via a browser, but it does not provide native applications and therefore cannot support gestures and features specific to a platform. Given the rising interest in mobile BI, as evidenced in our business intelligence and performance management benchmark research, Jaspersoft will need to do more to take full advantage of mobile platforms. I expect to hear more from them in the future on this subject.
Independent of Jaspersoft 4, the company created a new project to provide native access to a variety of large-scale data sources including massively parallel processing (MPP) database technologies, Hadoop and NoSQL. The news here is not the support of MPP technologies, although Jaspersoft now supports additional MPP database products. Nor is it the support of Hadoop, although it is among the early group of BI vendors to support Hadoop. No, the real news is support for a variety of other large-scale data technologies often referred to as NoSQL alternatives, including Cassandra, CouchDB, GemFire, HBase, Infinispan, MongoDB, Neo4J, Redis, Riak and VoltDB.
These projects are in relatively early states. The connectors to all the open source technologies are themselves open source and available for download at the project link above. Jaspersoft refers to them as beta versions, but that might be a little generous in terms of how far along they are: They support parameterized reports only and generally require some modification of the code to make the connectors work with a particular data source. As with many open source projects, the goal of releasing them in their current state is to engage others in helping to develop these connectors further.
It is an interesting play on the part of Jaspersoft. It gives the company an opportunity to grab market share in a segment where there is little or no competition currently. I suspect that this market is likely to grow. As more software is delivered as a service, performance and scalability become much bigger concerns than adherence to standards. A company that makes its living developing and delivering software services can afford to stray from industry standards under the covers of the services that it provides. In many cases that’s the only way to deliver the performance and scalability users require. In fact, that’s how some of today’s most popular technologies originated: Google, Facebook, Yahoo and others did not want to constrain their growth by limiting themselves to the capabilities of standard relational databases. The question for Jaspersoft is whether the technologies it is involved with continue to grow and whether they represent a large enough market to justify the investment it is making. There’s certainly enough interest in the NoSQL market to attract attention these days, as evidenced by a recent article in the New York Times. The companies behind some of these technologies have attracted venture capital investments.
As a Jaspersoft customer or potential customer, what does this mean to you? If you use one of the NoSQL technologies, you now have a potential reporting solution. If you don’t, then the question is whether the energy devoted to these NoSQL alternatives detracts from investment and attention in more traditional data sources. It’s a long list of data sources to support. The quality assurance process alone could become a big burden going forward. Jaspersoft does have the advantage of its open source model to attract support from the community for many of these connectors, but depending on the level of external support I would be concerned about negative impacts on the timeliness or quality of the new releases of the many product lines.
Charting a new direction has risks. This could be a brilliant move on the part of Jaspersoft, or it could turn out to be a costly mistake. Only time will tell.
David Menninger – VP & Research Director
This is the first in a series of posts on the architectures of analytic databases. This is relational database technology that has been “supercharged” in some way to handle large amounts of data such as typical data warehouse workloads. The series will examine massively parallel processing (MPP), columnar databases, appliances and in-database analytics. Our purpose is to help those evaluating analytic database technologies understand some of the alternative approaches so they can differentiate between different vendors’ offerings. There is no single best solution for all analytics for all types of applications; usually the decision involves a series of trade-offs. Understanding what you might be giving up or gaining, you may be able to make a better decision about which solution is best for your organization’s needs.
Analytic databases exist because of the need for speed and scale. Our benchmark research on business intelligence and performance management show that performance and scalability are key considerations seeking improvement in this area. Some years ago operating on summary data via OLAP cubes could provide speed to end users, but that approach is no longer sufficient for many types of analyses.
In this post we’ll look at MPP concepts including how these techniques speed up processing, what some of the challenges are and the different approaches that vendors use. Fundamentally, massively parallel processing is about splitting a database workload across multiple servers or nodes and executing a portion of the workload in parallel on each of the nodes. Once the parallel activities are completed, the results from each node are stitched together to deliver the end results.
MPP systems eliminate the need to manually split workloads across multiple machines. The manual technique known as sharding became a common way to overcome limitations of single-server database systems such as MySQL. Many organizations found that a single instance of their database could not deliver the performance or scale they needed, so they created application level schemes to split the data manually across multiple machines. For instance, all data for customers whose last name began with A-M would be stored in one database instance on one machine and all data for N-Z on another. With this approach and a lot of effort (i.e., coding) you could get theoretically almost twice the performance and twice the scale out of your system. However, the amount of coding and maintenance in such a system can become significant. Instead many organizations have opted to purchase an MPP system with the built-in ability to split workloads.
As you explore different MPP architectures you’ll need to understand how a vendor segments the data across the nodes in the system as well as how it segments the workload. Both are necessary to accomplish the maximum degree of parallelization in the system. If the data is not evenly distributed across the nodes, it can affect the speed of the queries. For example, suppose you have a cluster consisting of five nodes. If 80% of your data is on one node and 5% of the data is on each of the other nodes, it’s obvious that a query would have to wait for the node with 80% of the data to finish its work while the nodes with only 5% of the data probably sat idle. If the data were split evenly with 20% on each node, it’s likely that all five nodes would finish their work at approximately the same time and considerably faster than the unbalanced arrangement.
When it comes to how the workload is segmented across the nodes, the data distribution dictates which nodes must be involved in queries, but vendors have adopted different styles of workload distribution. MPP systems can be designed with either homogeneous or heterogeneous workload distribution. In the case of homogeneous nodes, the workload – whether loading or querying data – is distributed to all nodes in the system. In the case of heterogeneous nodes, each node is designated for a specific type of workload. For instance, queries are directed to one set of nodes and load processes are directed to a different set of nodes.
It is not obvious that one approach is better than the other. However, in the case of homogeneous nodes, the burden is on the software vendor to design a system that can ensure an optimized and balanced workload. In the case of heterogeneous nodes, the burden is shifted somewhat to the enterprise deploying the system. If the load processes are slow, you can alter the number of nodes dealing with the load processes, by adding more nodes to the system or changing the designation of some of the query nodes to data loading. I like to think of this trade-off as the difference between an automatic transmission and a standard transmission. Some people like the control offered by the standard transmission, while others don’t want to have to think about shifting.
Another issue to understand is where queries are initiated within the system. The initiation process can become a bottleneck if all queries must funnel through one or a few nodes of the system. The reason is that the node initiating the query has extra responsibilities. Some amount of post-processing is necessary by the query initiator. For instance, if the user requests the top 10 products based on sales, each node is asked to produce its list of the top 10 within its segment of the data and ships the results to the initiator. Then the initiator compares each of the top 10 results sets to come up with the top 10 overall. Ideally the bulk of the work is done in parallel on each of the nodes but then a small amount of work is done on the aggregated results on the initiating node alone.
Unfortunately it’s not always that simple, and the load on the initiator can become significant. Not all problems can be neatly divided and performed on a specific segment of the data and then combined at the end of the process for some finishing touches. Some calculations require data to move back and forth across multiple nodes. For example, supposed data is segmented by across nodes by customer ID and you want to calculate each customer’s revenue as a percentage of the best-selling product in each region, state or city – in other words, which customers are contributing the most to the products that are selling well in that area. The data for one customer may all be on the same node, but to calculate total product sales in each region city or state you have to look at sales from all the nodes since any customer could have made a purchase in any region.
The simplest approach is to ship all the product sales data to the query initiator and have the initiator perform the rest of the processing. While this approach may be the best strategy, it probably is not. The query optimizer needs to evaluate a series of trade-offs to get the best response time for the query. Should it ship all the data to one node and do all the processing on that one node? While not highly parallel, that strategy might work if the total amount of data that needs to be combined is relatively small, for example, if the sales are restricted to just a few products and a few time periods.
Alternatively, should the query optimizer ship some portion of data to each of the nodes to speed up processing? This scenario might work if there is a relatively small set of data that can be constructed (for example, product sales by region) that would be combined with large amounts of data on each node. These examples oversimplify the complexities that the optimizers must deal with. For examples of some of the considerations in MPP query plan optimization and the way some vendors have tackled these issues, see these posts by EMC Greenplum and Vertica.
Queries can be much more complicated than the examples given above; data distribution schemes vary and the speed of moving data across nodes is affected by interconnect speeds, storage medium and volume of data to be moved. Query optimizers need to consider many points and must have algorithms that produce efficient plans.
An IT organization considering an MPP database product needs to understand a few things in order to evaluate and maximize the performance of different database systems, starting with the data distribution scheme. Typically, the person designing the database implementation must provide some guidance or information to the MPP database product (such as which keys or columns to use in its distribution scheme) to help it distribute data in a way that produces a highly parallel and evenly distributed workload.
I recommend that you test your typical query workload against the database to see what kind of results you get. If your workload never changes, this approach may be sufficient. However, if your workload has an ad-hoc component or if it changes over time, odds are you will encounter some sort of challenge in query tuning sooner or later. In that case, you need to understand the types of techniques the optimizer uses in its query plan design.
Here are some of the issues to consider:
• Where do queries get initiated? Are there enough nodes processing queries?
• Similarly, are there enough nodes loading data?
• How does data loading impact query processing? If the nodes are separate, this might be easier to understand and address a conflict between loading and query processing.
• How much data and which data (if any) should be replicated across nodes? Keeping frequently accessed data on all nodes can minimize the movement of data across nodes and speed processing.
• Can redundant copies of the data (perhaps used to achieve high availability) be used in the query plans to minimize data movement across nodes and facilitate more local joins?
• What tools are available to you to understand the query execution plans? Can you use these tools to identify the bottlenecks?
The bottom line is that you need to understand the “secrets” of the MPP database systems you are considering or using. One is the data distribution strategy and the extent to which you can or need to influence it. Another key is the workload distribution strategy and what causes it difficulties. Evaluate those difficulties and determine how they might affect your typical workloads. Finally, there are additional tricks or techniques that can be used to maximize parallelization and minimize data movement. Evaluate whether those additional techniques are well documented and easily applied when needed.
David Menninger – VP & Research Director