Data Masking Techniques in Azure Data Platform | HCLTech

Data Masking Techniques in Azure Data Platform

Data Masking Techniques in Azure Data Platform
October 14, 2022

In today’s world, data is the new oil. Data is everywhere and keeps growing exponentially. Hence it is essential to protect sensitive data in a controlled way. Data breaches and cyberattacks can cause devastating damage to any organization.

According to Forbes (reference cited below), one of the major hotels in the UK faced data leakage while using a third-party application, affecting almost 339 million hotel guests. This caused a hefty penalty of 18.4 million pounds to the company as it failed to comply with General Data Protection Regulation (GDPR) requirements.

So, organizations must proactively protect their data and regularly update their protective measures.  Even access management of employees within the organization must be structurally controlled to avoid any data leakage.

With security compliance and data privacy regulations moving to the forefront of concerns, it becomes a mandatory requirement and compliance to keep Personally Identifiable Information (PII) data safe on the public cloud platform.

Data privacy legislation such as GDPR promotes data masking. The average cost of a data breach is $4 million, which motivates companies to invest in information security solutions such as data masking, which can be relatively cheaper to implement than some other encryption solutions.

In this article, we will discuss different approaches to implementing data masking on the Azure data platform based on the customer requirement and PII sensitivity of data.

“Data masking is replacing high-value data items with low-value tokens partially or fully” – Gartner.

Though Azure databases (SQL/Synapse) have a masking technique as a feature, we will talk about the entire process of data being protected at various stages within the data platform.

Need for data masking:

  • Expansion and maturation of privacy laws in global jurisdictions
  • Growth and increased data security-related awareness of advanced analytics and AI/ML projects, as well as migration of these projects to cloud-based data lakes
  • Increased attention to protecting data from insiders using zero-trust principles whereby all access must be precisely defined and authorized
  • Compliance requirements involve internal requirements, a broad spectrum of regulations, and industry standards, including the Payment Card Industry Data Security Standard (PCI DSS), the Health Insurance Portability and Accountability Act (HIPAA), GDPR, and privacy legislation. They aim to protect data from abuse, prevent fraud, and maintain privacy

Insider threats arise from a growing number of employees and contractors, including:

  • Users of nonproduction databases (for example – data scientists, programmers, testers, and database administrators)
  • Users of analytical and training databases (analysts, researchers, and trainees) and from users with access to insufficiently protected production environments
  • External threats include data exfiltration by various techniques that could exploit overexposed sensitive data and compromised insider identities that have access to sensitive data

Benefits

  • Increase trust in data
  • Improve process efficiencies
  • Speed up Operationalization of analytical models

Types of data masking

It is a technique used to protect sensitive data by creating a version of data that looks structurally like the original but hides (masks) sensitive information. So, it is not helpful if compromised by an attacker. The format of the data remains the same. However, the values change. The masked copy replaces accurate data for testing, demos, or other use cases that don’t require the data itself.

Types of masking

  1. Static data masking

This creates clean copies of the databases, modified to obfuscate all sensitive information, allowing them to be shared with non-production users. Once masked using a static masking approach, the data cannot be unmasked.

Example: Azure Data Lake

  1. Dynamic data masking(DDM):

This enables control of who can view masked data and who roles can view sensitive information. DDM is used to prevent unauthorized access to specific pieces of data by limiting the amount of sensitive data revealed. Data masked using dynamic data masking approaches can be unmasked and viewed without restrictions for authorized users.

Example: Azure Synapse Analytics, Azure SQL DB

Data confidentiality classification

  • Before ingesting the data into the cloud, it should be classified below
  • Classification E data should be ingested to the highly classified landing zone with just-in-time access, customer-managed keys for encryption, and inbound or outbound restrictions applied

Classification

Confidential

Description

A

Public

  • Anyone can access
  • Can be sent to anyone, for example, open government data

B

Internal use only

  • Employees only can access
  • Cannot share it outside the company

C

Confidential

  • Should be shared only if needed for a specific task
  • Cannot be sent outside the company without a non-disclosure agreement

D

Sensitive (personal data)

  • Personal data that must be masked and shared only on a need-to-know basis for a limited time can’t be sent to unauthorized personnel or outside the company

E

Restricted

  • Only to be shared with named individuals who are accountable for its protection, for example, legal documents, trade secrets

Approach 1: Mask at source

Approach 1

  • In this approach, the data are masked inside the source storage system
  • Only administrators can see the data, whereas unprivileged users get the masked data
  • Some popular tools are
    • IBM Infosphere optimum data privacy
    • Voltage secure data

Not all storage system supports data masking. Also, once the data is masked on-premise it cannot be unmasked in the cloud.

Approach 2: Mask at copy/transform

Approach2

  • If the target storage system doesn’t support data masking, we can apply this approach
  • In this approach, we’ll copy the unmasked data from the source storage, mask it, and copy it into the target. The masking logic should run inside the source storage boundary, i.e., within the same virtual private network or subscription to which the unprivileged user shouldn’t have access. Sensitive information is masked before it reaches the target environment

This option is simple and can be used when we’re not looking for advanced masking functions.

The option for data masking is minimal. While writing this below, Azure Data factory does not provide in-built data masking functionality. It can be implemented in a limited fashion using expression functions (such as dataflows), which might not suffice enterprise needs.

Approach 3: Mask at cloud landing zone (Azure data lake - static data masking(SDM))

Approach 3

  • This approach does not have native data masking capabilities. It must be custom implemented
  • SDM involves creating a duplicated dataset version containing fully or partially masked data. The dummy data is maintained separately from the production database
  • SDM permanently replaces sensitive data by altering data at rest, so ideally, we shouldn’t use it in production. This is more appropriate for producing production-grade data in development
  • This masking technique could be implemented along with Access Control List and Role Based Access Control at the data lake layer

Approach 4: Mask at target (Dynamic Data Masking-DDM)

Mask 4

  • It is called dynamic data masking as it alters real-time information when users access it
  • Objective: In a production environment
      • Masked data can be unmasked and seen by authorized users
      • Masked data cannot be unmasked and seen by unauthorized users

We need to define a “security policy” that hides the sensitive data in the result set of a query over designated database fields. In contrast, the data in the database is not changed.

This could be implemented with the help of RBAC at the data lake layer and the DDM at the SQL or synapse database.

Policy Aspect

Overview

SQL users are excluded from masking

  • A set of SQL users / Azure AD identifies that get unmasked data in the SQL query results
  • Users with administrator privileges are permanently excluded from masking and see the original data without a mask

Masking rules

  • A set of rules that define the designated fields to be masked and this masking function to be used
  • The designated fields can be defined using a database schema name, table name, and column name

Masking functions

  • A set of methods that control data exposure for different scenarios

Dynamic Data Masking Capabilities (Azure SQL / Synapse)

Masking Function

Masking Logic

Default

Complete masking according to the data types of the designated fields

 

Use XXXX or fewer Xs if the size of the field is less than four characters for the string data type (nchar, ntext, nvarchar)

Use a zero value for numeric data types (bigint, bit, decimal, int, money, numeric, smallint, smallmoney, tinyint, float, real)

Use 01-01-1900 for data/time data types (date, datetime2, datetime, datetimeoffset, smalldatetime, time).

For XML the document </masked> is used.

Use an empty value for particular data types (timestamp table, hierarchy id, GUID, binary, image, varbinary, spatial types)

Credit card

The masking method exposes the last four digits of the designated fields and adds a constant string as a prefix in the form of a credit card. – XXXX-XXXX-XXXX-1234

Email

The masking method exposes the first letter and replaces the domain with XXX.com using a constant string prefix in the form of an email address – aXXX@XXX.com

Random number

The masking method generates a random number according to the selected boundaries and actual data types. A masking function is a constant number if the designated boundaries are equal.

Custom text

The masking method exposes the first and last characters and adds a custom padding string in the middle. If the original string is shorter than the exposed prefix and suffix, only the padding string is used. Prefix[padding]suffix.

Business scenario

Based on the business needs and its PII sensitivity or criticality of data, the following decision tree could help to identify the right approach to be targeted.

Mask5

Summary

In the article, we have discussed how to implement data masking at various stages in the Azure data platform, starting from the source and copying/transforming to the target system. Based on the need of the client – the approach could be chosen. Each course is unique, basically, approaches one and two are suitable for non-production environments, whereas approach three could be implemented in production by using a masking feature at the database level. Approach four would be the right choice for implementing data masking in output without any data leakage starting from the source systems.

Considering data protection for any organization, to meet both confidentiality and usability requirements, the underlying data must combine with other security features. These include auditing and row-level security; also, security vs. performance could be a trade-off concerning the pipeline latency and cost, which must be considered.

References:

  1. Marriott Hit With £18.4 Million GDPR Fine Over Massive 2018 Data Breach (forbes.com)
  2. GDPR Fines & Data Breach Penalties (gdpreu.org)
  3. https://github.com/uglide/azure-content/blob/master/articles/sql-database/sql-database-dynamic-data-masking-get-started.md

Get HCL Technologies Insights and Updates delivered to your inbox