Type to SearchView Tags

Dynamic Ingestion & Curate Framework using Talend (Sketch-t)
Kuntal Chowdhury Senior Technical Architect, Digital & Analytics | November 11, 2020

Summary of the Solution Framework

In a traditional ETL or data warehouse solution, the data ingestion into your data lake needs to happen from various source systems and then cleansed before they can be processed further by downstream applications. Additionally, in the current scenario, data migration from on-premises systems to the cloud is becoming more and more popular. The whole idea is to leverage the data warehouse solution with built-in Talend to ingest data from any structured and semi-structured data sources into any destination by adding some metadata information into a metadata file/table.

The talend-based accelerator provides a customized framework for data processing

Business Benefits

  • This data ingestion framework helps in reducing the development and data migration costs as compared to traditional ETL methods and directly increase the performance of the IT team (ETL development and support)
  • One-step recovery from the failover point with ETL job load statistics, easy to troubleshoot, and fix issues easier than the traditional approach
  • Easy to maintain the configurable metadata because this approach requires a lower number of ETL jobs
  • This accelerator supports schema evolution; any change in the schema of any existing feed does not have any impact on the solution framework thus reducing any need of any code change, leading to lesser effort required for data migration and change management
  • Supports data ingestion as well as data cleansing with configurable rules in the data store or in the drools engine based on the rule complexity
  • Job run statistics are captured for monitoring and auditing purposes and can be used to create dashboards for reporting and visualization purposes
  • No dependency on the Talend SME to use this accelerator

Solution Framework Overview

Solution Framework
Figure 1: Solution Framework Overview

Sample Metadata File Contents and Description

Here is how a metadata file looks like for ingesting a MySQL table into a Snowflake DB:

Field Name Description Sample Value Optional
Source DB DB name of the source mydb  
Source Name Source Table Name employee  
Source Type Type of source to be ingested MySQL  
Target Type Type of destination Snowflake  
Target DB DB name of the target Ingestion  
Target Name Target Table Name employee_cleansed  
Extract Query Source Extraction Query   Y
Ingestion Flag Should be set for the ingestion to happen Y  
Logs to Kafka Flag Should be set if the audit logs are sent to a Kafka topic N  
Parm File Contains source and target connection strings C:/config/mysql_snowflake_ingestion_parm.txt  
Delta Flag Should be set for Incremental data extraction N  
Comparison Field Should be set if Delta Flag is set to Y last_update_dt|date|YYYYMMDD Y
Curate Flag Should be set if data cleansing is to be done Y  
Curate Type Should be set if curate flag is set Y Basic Y
Rule File A file holding the curate rules C:/data/emp_data_schema_mysql.txt Y
Table 1: Metadata File Contents and Description

Monitoring Dashboard

Monitoring Dashboard
Figure 2: Monitoring Dashboard


This framework can be extended to ingest data from any structured data source system (RDBMS like Oracle, MySQL etc. , Local File, FTP server pulls etc.) and store data to any destination (AWS S3, Azure ADLS, RDS etc.). The cleansing rules can be created on a more advanced rule engines - drools for writing any rule as per the requirement. The ABaC tables data can be used to create dashboards in any BI tool such as Power BI, Tableau etc. to understand and monitor various KPIs of a batch run daily. The objective of the above write-up is to help create an ingestion and cleansing framework in Talend so that the same can be leveraged to on-board any source system feed with minimal/no development changes reducing the overall TTM.