This blog post will provide some insights into MuleSoft's integration options with Azure HDInsight and Collibra. Hive is an application running via the Hadoop framework, providing a SQL-like interface known as the Hive Query Language (HQL) for data retrieval or processing. Collibra is an enterprise-oriented data governance platform that allows enterprises to organize and manage data assets and their related policies and rules.
MuleSoft provides a simple and easy-to-use connector for Collibra that includes data governance operations which can be ingested into Collibra. The connector comes with a simple configuration option to select a combination of operations and asset type IDs enabling faster delivery of APIs. The connector also supports the bulk insertion of data assets and preserves data relationships across assets. It eliminates the need for point-to-point connectivity, as these are available as reusable APIs.
Data governance is a centralized control mechanism that enables the availability, protection, and integrity of the organization's business and technological assets (data). Data governance must be implemented across people, processes, and technologies. Another goal of data governance is to ensure that a consistent mapping of business and technological assets is carried out and that it is always accessible for easy search.
A 'data governance council' is formed to oversee people and process-related data governance activities. Collibra specializes in data governance technology solutions across industries that enable organizations to ensure that their corporate data meets regulatory standards and provides users with auditing and tracking capabilities.
The different types of assets include technology, data, business, and policy which are distributed across multiple data sources such as Hive, Qlikview, Datameer, and Informatica. As Collibra uses MuleSoft integration platform, it is necessary to aggregate metadata from all these data sources and ingest the same into the tool to be available as Collibra assets.
The diagram below shows the overall Mulesoft Azure integration approach. The Hive System APIs are used to extract metadata from the HDInsight Cluster using the WebHCat REST API. The Collibra System API ingests extracted metadata through the upsert operation of the Mule Collibra connector. A process API triggers the entire flow through a scheduler.
WebHCat provides a list of https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference DDL (data definition language) resources. The API provides a SQL-like interface to interact with the Hadoop cluster.
MuleSoft integration with Azure HDInsight
This section illustrates the steps to invoke the Azure HDInsight API to retrieve the database, table, and columns asset information.
Provisioning an HDInsight cluster on Azure cloud
The link provides a way of provisioning an HDInsight Hive cluster on Azure Cloud https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-linux-create-cluster-get-started-portal
Once the cluster is configured and running, the cluster size with the header and the worker nodes must be indicated along with the status as 'running'.
Configure HTTP Request connector on Mule
Configure the HTTP Request Connection to specify the HDInsight host and port location, as shown in the snapshot below. The authentication is provided through the basic authentication option with credentials. The configuration is also provided below.
Query database names from Hive
The Mule flow returns a list of all databases within the Hadoop cluster as shown in the postman response below.
The snapshot shows the configuration of the WebHCat REST API on mule.
The response from Postman showing the list of database names within the HDInsight cluster.
Query tables from Hive
The Mule flow returns a list of all tables from the Hive database as shown in the postman response below. In this case, the request is sent with database name as 'default'.
The response from postman shows a list of tables within the database ‘default’.
Query columns from Hive
The Mule flow returns a list of all columns for a specified database and table as shown in the postman response below. In this case, the request is sent with database name as 'default' and table name as 'my_current_table_v1'.
The response from postman shows the list of columns in the table ‘my_current_table_v1’
MuleSoft integration with Collibra
This section illustrates the steps to configure the Collibra connector and use the bulk upsert operation for each of the assets.
Configure the DGC connector
Collibra offers a DGC connector for Mule known as 'Collibra Connect'. The snapshot below shows the configuration details needed such as the credentials and base application URL.
The following sections demonstrate how to upsert the Collibra to database, row, column assets after Hive's retrieval.
Upsert database assets
Collibra connector has a bulk upsert functionality that has been leveraged for this implementation. We will choose the option "Upsert Assets by external entity ID" for upserting database properties. In this case, we must include an external system ID ('hdInsighthive' in this case), resource type Id (database) and Collibra domain ID (referenced from global property).
The transformation needed for payload before calling the Collibra connector's upsert operation is given in the snapshot.
To enable the creation of database assets in Collibra, the name or ID of the default domain and the external system ID is provided as shown in the snapshot below.
Upsert table assets
The snapshots shows the configuration used for upserting the table assets. Care must be taken to specify the external system ID as shown below.
To enable the creation of table assets in Collibra, the ‘Asset Type Id’ must have ‘Table’ selected along with the ‘Default Domain Id’ and ‘external system id’ as shown in the snapshot below.
Upsert column assets
The snapshot shows the configuration details for upserting the column assets. Here, specifying the external system id is mandatory.
To enable the upsert of column assets in Collibra, the ‘Asset Type Id’ must have ‘Column’ selected along with the ‘Default Domain Id’ and ‘External System Id’ as shown in the snapshot below.
This post provides several options for MuleSoft to integrate with various systems. In the absence of a connector as in the case of Azure HDInsight Hive, we can use the available REST API which can be easily invoked from MuleSoft. In the case of integration with Collibra, a connector was leveraged to make the bulk upsert of assets to the data governance platform.