July 18, 2014

950 Views

PolyBase: A Superior Alternative to process and query data of Hadoop

Polybase is a technology that is designed to query data from Hadoop Distributed File System. It works by integrating SQL Server Parallel Data Warehouse and supports SQL queries to read and write data from HDFS.

In simple, we can create an external PDW table that reference Hadoop data and then write SQL statement to query data from HDFS. With this approach it provides capabilities to get data by joining native relational tables and Hadoop data.

Let’s have a look on the benefits of using PolyBase:

  • No need to understand Map Reduce that is mostly written in Java technologies. Write simple queries to process and load data from HDFS.
  • We can connect it (PWD) with Visual Studio SQL server object Explorer and from SQL Server 2014.
  • It gives tremendous high performance in data reading, writing due to PDW’s parallel data processing capabilities. It creates “HDFS bridge” to query multiple Hadoop data nodes concurrently.
  • It has capabilities to fetch data directly from HDFS and bypass MapReduce (as most of Hadoop’s components like Hive, Sqoop instead use MapReduce to extract data). So PolyBase works in cost effective manners due to its intelligence to decide when and where to use MapReduce and when to directly access HDFS.
  • It can be integrated with BI tools like Power Pivot, Power View, and SSRS etc.
  • No extra burden to learn Hadoop, MapReduce or Java technology. As most of the developers already know SQL queries and can work easily with it.
    Syntax examples:
    CREATE HADOOP CLUSTER  GSL_HDFS_CLUSTER
    WITH (namenode=‘localhost’, nnport=9000
          jobtracker=‘localhost’, jtport = 9010);
    CREATE HADOOP FILEFORMAT TEXT_FORMAT 
    WITH (INPUTFORMAT = 'org.apache.hadoop.mapreduce.lib.input.TextInputFormat', 
          OUTPUTFORMAT = 'org.apache.hadoop.mapreduce.lib.output.TextOutputFormat',
          ROW_DELIMITER = '0x7c0x0d0x0a',   
          COLUMN_DELIMITER = '0x7c‘);
    CREATE EXTERNAL TABLE hdfsCustomer
      ( c_custkey       bigint not null,    
        c_name          varchar(25) not null,
        c_address       varchar(40) not null,
        c_nationkey     integer not null,
        …
      )
    WITH (LOCATION =hdfs('/tpch1gb/customer.tbl’, GSL_HDFS_CLUSTER, TEXT_FORMAT);
    
  • Three times faster than Hive and Sqoop, due to PDW’s parallel query execution framework.

Limitations and drawbacks with PolyBase:

  • PolyBase does not support DML and has dependency with PDW external table.
  • Works with Microsoft technologies and is not an open source like Hadoop, Hive, HBase.
  • Can read or write data on HDFS (using SQL statements) but doesn’t have the capability to write custom logics or apps (as we can in MapReduce program).
  • Till now no option to work/integrate with SQL Azure.

References: