Pivot analysis is an essential and integral component for many business enterprise reporting. This type of analysis has been in existence for nearly 3 decades and tools like MS Excel has democratized the availability of this feature, making it even more popular with just a simple drag and drop. Pivot tables are often seen as a form of simplifying the complex OLAP (Online Analytics Processing) analysis and it is of no surprise that Microsoft went ahead and trademarked the compound term – PivotTable, given its popularity.
Engineering aspect of Pivot
On a simpler definition, pivoting involves picking up the distinct values of a given column of choice in a table, name them as new column label/dimension, and perform the preferred summarization/aggregation over the factual data. This operation requires the following engineering steps involved:
Identify unique/distinct property of a single or a set of columns
Create new set of column labels.
Execute standard aggregation / custom aggregation function
Re-plot the existing tables with the new aggregations
Challenges of Pivot Analysis in Big Data
The above engineering might look so simple in MSExcel, but it requires following data engineering capabilities:
Performing a permutation and combination of unique values across different columns in case of multi-column selections
Performing data aggregations – e.g:Sum,Avg
Performing data distinct value identifications
In addition to above challenges, the real world business applications impose multi-dimensional challenges such as:
The pivot query often may vary in terms of column choices from one analysis to another analysis, hence the above engineering steps may have to repeated for each query in a standard OLAP query – similar challenges are evinced in Ad-hoc query use cases.
Large volume of data in Terabytes, Petabytes.
Support in SQL on Hadoop – SQL tools in Bigdata shall support this feature.
The above problems are getting really complex in the big data scenario, as there were no proper solution / framework available.
Let’s take a sample retail dataset.
If the developer wants to pivot the dataset by Category and Product columns, then the pivoting result should look like the below table,
In the case of pivoting only with one column i.e. Product, then the result should like below,
As stated above, Hive doesn’t support pivoting a table. We can achieve this pivoting by a couple of ways,
CROSS TAB Query
One method is crosstab query; the simplest way to do this is to use the CASE function. Query to pivoting the data based on category and product is shown in below,
The crosstab query construction can become complex when the number of columns in SELECT clause varies in nature, and one has to know the unique column values are to be used (e.g.: Electronics for category column) for Pivoting before issuing Hive query. However, the challenges lies with determining the values of columns on the fly and issue dynamically generated query based on pivoting needs. Another major disadvantage of crosstab query is the length of SQL query and performance issues.
UDAF based Techniques
Another way is to achieve this pivoting in Hive is using User Defined Aggregation Function technique (thanks to Brickhouse collect UDAF ). This UDAF Aggregates values into a list or a map. For pivoting, we can look up the category and product value from the grouping map.
For using this Collect UDAF, first we have to add the Brickhouse jar file into the Hive class path.
An example query for pivoting the Sales data by category and product is shown in below,
We found the above two procedures were suited in our implementations, and we hope these techniques would help to realize Pivoting needs in a similar scenario.
We also observed Apache Spark has the ability to pivot data, creating pivot tables, with a DataFrame.
The above procedures are NOT our endorsements to fit for every kind of your Pivot requirements, and we advise the developers to take a thorough due-diligence before they decide to incorporate the above approaches in their solution.