Data is growing faster than ever. The creation of new digital data on top of all that exists is seeing exponential growth. Data movement from point X to point Y can be a difficult problem to solve with proprietary tooling. Data sharing is critical in today’s world as enterprises look to exchange data securely with customers, suppliers, and partners. For example, a trader wants to publish sales data to its distributor in real-time, or a distributor wants to share real-time inventory. So far data sharing has been severely limited.
As per one of the renowned technological research and consulting firms’ data sharing in real-time will generate more revenue and bring more value to the business than those who did not. It plays a pivotal role to plan for more in-depth business strategies, campaigns, and multiple business benefits.
One of the key challenges for enterprises to overcome will be to be able to securely share data for analytics–both internally and outside of the organization. By prioritizing secure data-sharing practices as a business capability, analytics leaders will be equipped with the right data at right time to provide business insights, recommendations to businesses, and benefits.
One of the significant issues that have been observed in many organizations with data is sharing data between distinct perspectives and across organizations.
- Let us consider an example where an automobile engine manufacturer wants to access engine performance data from all the different automobiles it produces. Since every automobile company uses different sets of systems to store and manage data, acquiring data from all sources requires a complex setup and collaboration
- Over the last couple of decades, there have been two forms of data-sharing solution: homegrown (SFTP, SSH), and third-party commercial solution which has become exceedingly difficult to manage, maintain, and scale as per new data requirements
- Various opinion polls and surveys conducted by technological research and survey firms have confirmed that data and analytics organizations that promote real-time data sharing in a dependable, secured, scalable, and optimized manner have more stakeholder engagement influences than those that do not
Delta Sharing is an open-source protocol created to solve the problem. This is the industry’s first-ever open protocol, an open standard for sharing data in a secure manner. Users can then access that data securely within and now between organizations.
Delta Sharing supports open data formats (apart from SQL) and can scale and support big data. Supporting Delta Lake storage structure will benefit a variety of features to consume data. Sharing and consuming data from external sources allows for collaboration with customers, establishing new partnerships, and generating new revenues.
Delta sharing objective and purpose
- Share real-time/batch data without physically copying it– Today, most enterprise data is stored in cloud data lakes and lake house systems. Delta Sharing assists in securely sharing any existing dataset in Delta Lake or Apache parquet format
- Scalability – It leverages the cost and elasticity of cloud storage systems to share massive datasets economically and efficiently
- Highly secured, tracked, and governed - It allows granting, tracking, and auditing of shared data from a single point of access
- Varied customer support – Data recipients can directly communicate to Delta Share from Pandas, Apache Spark, and other systems at ease without having any specific computing platform
Delta Sharing ecosystem
How it Works
Delta Sharing is a REST protocol that allows data to be shared across environments without the sharer and recipient being on the same cloud platform. It uses popular cloud repositories such as Azure Data Lake Storage, AWS S3 storage, and Google Cloud Storage to securely share large datasets.
It includes Data Providers and Recipients in the data-sharing process. Delta lake table is shared as a dataset which is a collection of parquet and JSON files. Parquet files store the data and JSON file store the transactional log. Data Provider decides what data they want to share and runs a sharing server that implements delta sharing protocol and manages access for Data Recipients As a data recipient, it requires delta sharing clients (Apache Spark, Python, Tableau, etc.) that support the protocol.
Azure Event/IoT Hubs. Event/IoT Hubs is an event consumer/producer service. Each data source sends a stream of data to the associated event hub.
Azure Databricks. Databrick is Spark based in-memory distributed analytical engine available in the Azure space. It reads the data from the Hubs using the relevant libraries and transforms the process and writes the data to the data lake in Delta format using the spark structure streaming mechanism.
Azure Data Lake Gen 2. Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.
Delta Sharing Server. Delta Sharing is a Linux Foundation open-source framework that performs the data sharing activity leveraging the protocol for secure data transfer.
The protocol works like this:
- Data Recipient client authenticates to the sharing server via token or other method and queries the specific table. It can also request a subset of the dataset from the table by using specific filter criteria
- Delta sharing server validates Client access, tracks the details, and decides which dataset needs to be shared
- Delta sharing server creates pre-signed registered URLs to the client or data recipient to read the data from the delta table parallelly
Capabilities of the design
- Data providers allocate one or more subsets of tables as required by Data recipients
- Data providers and recipients need not be on the same platform
- Data transfer is quick, low-cost, and parallelizable using underline cloud storage
- Data recipients always view data consistently as the data provider performs Atomicity, Consistency, Isolation, and Durability (ACID) transactions on delta lake
- Data Recipient verification is checked using the provider token to execute the query from the table
- Delta sharing server creates registered URLs to the client or data recipient to read the data from the delta table parallelly
- It has an inbuilt link to Unity Catalog, which helps with granular administrative and security controls, making it easy and secure to share data internally or externally
Data sharing by various cloud providers
Cloud providers are beginning to take note of this need and have begun introducing new features and capabilities to the market. Snowflake has provided the capability of sharing data through its data sharing and marketplace offering which enables sharing selected objects in a database in your account with other Snowflake accounts.
Below are the comparison details w.r.t Databricks and Snowflake.
|Cost||Consumption pricing model||Compute resources used to query the shared data|
|Commercial Clients||Support many Clients||Limited availability|
Databrick’s Delta Sharing provides similar features with the added advantage of a fully open protocol with Delta Lake support for Data Sharing
Challenges and difficulties
- Hierarchical queries have been a bottleneck area. It takes time to ingest small datasets via long-running spark jobs
- Difficulties to find the optimum combination of factors to determine the appropriate cluster configuration
- Revoking access to data once shared is painful
This blog provides insight into Delta Sharing and how it reduces the complexity of ELT and manual sharing and prevents any lock-ins to a single platform. It addresses the various aspects in detail along with the pain areas, and comparison to build a robust data-sharing platform across the same and different cloud tenants. All these secure and live data sharing capabilities of Delta Sharing promote a scalable and tightly coupled interaction between data providers and consumers within the Lakehouse paradigm.
https://docs.microsoft.com/en-us/azure/databricks/data-sharing/delta-sharing/?msclkid=62d96edbc53111ec8ab503db03808d4a https://github.com/delta-io/delta-sharing https://databricks.com/product/delta-sharing
Data Sharing is a Key Digital Transformation Capability (gartner.com)