December 11, 2015

1856 Views

OLAP Cubing for Big Data

Authored by Pradeep Srinivasa, Santosh Kumar and Sarnath Kannan

OLAP (Online Analytical Processing) cubes are used predominantly to view and analyze multiple dimensions of business data to gather insights that will help them define a business strategy. As the data grows in volume and velocity, building and managing these cubes require a Big Data Solution. In this blog, we will discuss our view on creating a Big Data solution for OLAP processing of large datasets residing in HDFS and serving them. And also compare our results with an open source Distributed Analytics Engine called Apache Kylin [1].  

If you like our work, please help us present our work at the Hadoop Summit 2016 by voting for our idea here.

What is an OLAP Cube?

According to the definition from olap.com [2] website “An OLAP(Online Analytical Processing) Cube is a data structure that allows fast analysis of data according to the multiple dimensions that define a business problem.  A multidimensional cube for reporting sales might be, for example, composed of 7 Dimensions: Salesperson, Sales Amount, Region, Product, Region, Month, and Year.

If the above definition went over your head, imagine that you are the Senior Manager of Sales and you want a view of what your team has done this year. So, you ask your BA (Business Analyst) to build dashboards showcasing sales performance. Your BA gets holds of a BI (Business Intelligence) tool and builds dashboards like Sales by region for a year (e.g. Sales in Texas is so much for 2015, Sales in Arizona is so much for 2015 etc..), Sales by region for all 12 months of the year, Sale of a particular product in Texas in the month of May in 2015 and so on…. If you pause and think of all these in terms of SQL, all these are GROUP BY statements with some aggregations projected out. As your data set grows, these SQL statements are going to take a lot of time to compute and your BI tool is no-more interactive. To alleviate this performance issue, people started creating Data Cubes – which are nothing but aggregations of data across a combination of different dimensions – which can be queried. Microsoft came up with MDX (just like how SQL is for raw data, MDX is for Data cubes) in an effort to formalize the access.

OLAP on Big Data

Coming back to our discussion - to build an Online Analytical Processing (OLAP) cube, we need to have a specification which will define the data source, from where to build the cube, dimensions and metrics present in the data-set, the combination of dimensions that need to be used for building the Cube, the metrics that need to be aggregated and so on. Once the Cubes are built, they need to be served using a fast data store. The entire process must be Scalable right from Cube building, Cube Storage to Querying.

Cube Specification

In our solution we define the specification in an XML file with the following sections. The examples given below are nascent representations to communicate the idea to the readers. We are in the process of formalizing the specification.

  1. Data Source Section –The input for the aggregation can be a CSV file, Hive Table, Hive Query or a Hive View as of now.

    Our current design abstracts the data source component and hence is easily extensible to other data sources as long as they can provide an input-format, HDFS path and a column de-serializer.

    Fig. 1: Cube Specification - Data Source Section

  2. Columns Section – The columns contain information about the Columns in the dataset that is to be used. While Columns can be discovered from Hive metastore, they are difficult to discover for sources like SQL Queries. We differentiate Columns as atomic and non-atomic. Atomic Columns are non-decomposable where as Non-atomic are decomposable. For example, Time Stamp can be decomposed into Year, Month, etc. The user can define any user defined java Class as the decomposer.

    In the future, we plan to support creating derived dimensions as a function of many columns and derived dimensions. This will allow for a very generic, powerful and flexible usage.

    Fig. 2: Cube Specification - Columns Section - Column Decomposition

  3. Aggregation Section - This specifies the type of aggregation along with the columns to be grouped and the metrics that need to be computed. The following are the different types of aggregations. In the following aggregation types, we have used the “SUM” operation for illustration. In the future, we also plan to support custom aggregation functions that can be specified as a static function which is both associative and commutative.
    • Simple Aggregation -This is the basic type of aggregation. It aggregates the metrics specified in the “Using” clause based on all the Columns/dimensions that are mentioned in the configuration.

      Fig. 3: Cube Specification - Aggregation Section - Simple Aggregation

    • Drilldown Aggregation – This specifies an array of aggregations needed to build a drill-down of the listed dimensions. The order of specification of the dimensions define the hierarchy for the drill down. Our system does not make any implicit assumptions about hierarchies. Total number of aggregations is equal to the number of hierarchy levels.

      Fig. 4: Cube Specification-Aggregation Section-Drill down Aggregation<\p>

    • Capped-PowerSet Aggregation – This specifies a list of aggregations that form a power set of all the listed dimensions. Total number of aggregations for N dimensions is 2^N. Capped-PowerSet mandates certain dimensions to be required in order to stop computing meaningless combinations (e.g. {Year, Month} combination is meaningless unless it is accompanied by a ProductID or so).

      Fig. 5: Cube Specification-Aggregation Section-Capped Powerset Aggregation

    • Join Aggregations [Future] – This specifies a list of aggregations those are created by the cross-product of groupings listed by the referred aggregations. The metric dimensions are not specified instead they are assumed to be the union of the referred aggregations. To filter the unwanted dimensions, we plan to introduce a new clause. After the filtering is done, it is passed through the set of “Required” dimensions which further filters out the Joined aggregations.

Cube Building – Map Reduce

Once we have the specification for the cube, we compute the aggregations through a single Map-Reduce program in one shot (very much like how Hive computes the GROUPING SETS we believe). And the idea of emitting multiple outputs from a single MR Job formed the base idea for our Big Data OLAP Solution.

Cube Storage – ElasticSearch

The computed results are then stored as key value pairs in ElasticSearch [3] which indexes all the data in order to serve fast. The idea here is to represent each aggregation as a document within ElasticSearch. The fields of the document are the different dimensions used for grouping. The metrics and other meta-information about the aggregation can also be represented as document fields.

Operations like Slicing, Dicing and others boil down to simple search queries over REST API. ElasticSearch Provides JSON based DSL (Domain Specific Language) for querying/Filtering over REST API. The figure below shows how ElasticSearch filters can be used for slicing and dicing. These filters must be aimed at the right Elastic Search index that contains the Cube of our interest.

We are currently working on enabling SQL support which can accelerate queries through pre-computed data cubes. This will enable existing BI tools to seamlessly connect to data sources without knowing about the existence of Cubes.

Apache Kylin

Our work is essentially similar to what Apache Kylin does. Kylin uses HBase as their store and it uses carefully designed Row-Keys for searching data in Cubes. If we understand right, the row-keys are made up of a bitmask representing the dimensions that are grouped followed by values of each dimension. The values corresponding to the row-key are the different metrics calculated for that combination of dimensions.

In our opinion, Row-key based search in HBase is essentially a search on lexicographically ordered data and this can cause un-necessary lags in OLAP Cube Search (especially when you are slicing and dicing the cube). For e.g. Let us say we want to search for all words in an English dictionary whose second letter is ‘a’. We still need to go through all chapters of a “dictionary”. Inside each chapter, we still need to “scan” until we find our results. Our solution uses a Search mechanism powered by inverted-index (Courtesy: ElasticSearch). Inverted index does not require such nearly-full-scans and should be able to retrieve data much faster. In our case, ElasticSearch lifts this burden and additionally we don’t have to worry about Storage, Dictionary Building, Compression, Searching Process and the Search API. This makes our cube engine very light weight.

We wanted to put our notions to test. Does Row-key based search introduce lags in search-timing? If so, does Inverted Index based approach fare any better?

Experiment

Test Setup

We used an 8-node Hadoop cluster for our tests. Each node has 4 cores and 3GB RAM. HBase runs on 3 of these nodes and ElasticSearch runs on Standalone mode in 1 machine.

 

Data Set

We used a synthetic dataset. Our dataset was a modest table with 2 dimensions (ProductID and BranchID), 1 metric (Quantity) and had 10 million entries. There were 1000 different ProductIDs and 1000 different BranchIDs.

 

Cube Spec

We built a cube combining “ProductID” and “BranchID”. This means that there will be 1 million number of combinations of ProductID and BranchID. This cube was built using our solution as well as Apache Kylin.

Storage Result

Apache Kylin reports a Cube size of 192MB (as reported in the Portal for MEDIUM Cube size). ES reports 55MB. However, we don’t store COUNT. We only store the SUM. Kylin stores both. So this is not apples to apples. We are including this information just for the sake of reporting. Take that with a pinch of salt.

Random Query Test

To test the sensitivity of Row-key based search towards the Sliced Dimensions, we used the following 2 SQL queries.

Query #1 : SELECT ProductID, SUM(Quantity) from TEST_TABLE where ProductID = <pid> GROUP BY ProductID, BranchID

Query #2: SELECT BranchID, SUM(Quantity) from TEST_TABLE where BranchID = <bid> GROUP BY ProductID, BranchID

For Kylin, we used the SQLs directly on top of its REST API. For our Solution, we converted the SQLs as Elastic Search Queries and invoked ElasticSearch REST API.

We ran the query for 100 different random ProductIDs and BranchIDs through the “curl –output /dev/null” command from a script and timed it using the “time” command. While this time covers several overheads apart from query timing, this was sufficient to test our hypothesis.

Random Query Result

The graph below shows the time taken in milliseconds for the 100 queries. Top 2 lines belong to Kylin and the bottom ones are for ElasticSearch. The graph shows the fluctuation in performance introduced by Row-Key based approach. You can see that ElasticSearch based approach does not show any fluctuation. We don’t want to comment yet on the apparent speed of ElasticSearch based approach compared to Kylin. Kylin has an SQL overhead and we will make that comparison apple-to-apple when we get our SQL wrappers ready. And we hope that we will still be faster than Kylin.

Fig. 6: Query Performance of Kylin vs Our Solution

Okay, you came all the way until here. If you liked our work, please help us present our work at the Hadoop Summit 2016 by voting for our idea here.

References

  1. Apache Kylin: http://kylin.apache.org/
  2. OLAP Definition: http://olap.com/learn-bi-olap/olap-bi-definitions/olap-cube/
  3. Elastic Search Company: https://www.elastic.co/