Polybase is a technology that is designed to query data from Hadoop Distributed File System. It works by integrating SQL Server Parallel Data Warehouse and supports SQL queries to read and write data from HDFS.
In simple, we can create an external PDW table that reference Hadoop data and then write SQL statement to query data from HDFS. With this approach it provides capabilities to get data by joining native relational tables and Hadoop data.
Let’s have a look on the benefits of using PolyBase:
- No need to understand Map Reduce that is mostly written in Java technologies. Write simple queries to process and load data from HDFS.
- We can connect it (PWD) with Visual Studio SQL server object Explorer and from SQL Server 2014.
- It gives tremendous high performance in data reading, writing due to PDW’s parallel data processing capabilities. It creates “HDFS bridge” to query multiple Hadoop data nodes concurrently.
- It has capabilities to fetch data directly from HDFS and bypass MapReduce (as most of Hadoop’s components like Hive, Sqoop instead use MapReduce to extract data). So PolyBase works in cost effective manners due to its intelligence to decide when and where to use MapReduce and when to directly access HDFS.
- It can be integrated with BI tools like Power Pivot, Power View, and SSRS etc.
- No extra burden to learn Hadoop, MapReduce or Java technology. As most of the developers already know SQL queries and can work easily with it.
CREATE HADOOP CLUSTER GSL_HDFS_CLUSTER WITH (namenode=‘localhost’, nnport=9000 jobtracker=‘localhost’, jtport = 9010); CREATE HADOOP FILEFORMAT TEXT_FORMAT WITH (INPUTFORMAT = 'org.apache.hadoop.mapreduce.lib.input.TextInputFormat', OUTPUTFORMAT = 'org.apache.hadoop.mapreduce.lib.output.TextOutputFormat', ROW_DELIMITER = '0x7c0x0d0x0a', COLUMN_DELIMITER = '0x7c‘); CREATE EXTERNAL TABLE hdfsCustomer ( c_custkey bigint not null, c_name varchar(25) not null, c_address varchar(40) not null, c_nationkey integer not null, … ) WITH (LOCATION =hdfs('/tpch1gb/customer.tbl’, GSL_HDFS_CLUSTER, TEXT_FORMAT);
- Three times faster than Hive and Sqoop, due to PDW’s parallel query execution framework.
Limitations and drawbacks with PolyBase:
- PolyBase does not support DML and has dependency with PDW external table.
- Works with Microsoft technologies and is not an open source like Hadoop, Hive, HBase.
- Can read or write data on HDFS (using SQL statements) but doesn’t have the capability to write custom logics or apps (as we can in MapReduce program).
- Till now no option to work/integrate with SQL Azure.