Digitization, Internet of things, connected humans, connected machines, and connected world are all generating large volumes of data at an unprecedented scale. IDC predicts that by 2025, there will be 175 zettabytes of data continuously growing at an exponential scale. These trends would give an impression that data is available at an unbelievable scale at everyone’s disposal. It’s not exactly the case. Many artificial intelligence (AI) projects and innovations get stalled due to non-availability of data or because the data collection process is too cumbersome and time consuming. Believe it or not, but great ideas get nipped in the bud due to data scarcity!
What are the industry challenges around data?
- Internet companies, financial service providers, telecom service providers, health care providers, Retailers etc. have access to a huge amount of data since they have a large number of touchpoints with their customers. However, this data is not available to others. There is abundancy on one side and scarcity on the other.
- Even though some organizations possess large volumes of data, there are stringent regulations in accessing data under GDPR, which stipulates that data can be processed only for lawful purposes with user consent. Even the sharing of data across lines of business within an organization is a challenge. Fragmented and siloed data does not provide a holistic view of the underlying data to help solve business problems.
- Hackers are prowling in the darknet and are constantly looking out to get unauthorized access and steal valuable information. This makes organizations paranoid about data security and be extra conscious in sharing data. Data loss is a big reputation issue and can make or break a brand.
- Research and Development (R&D) requires data to test hypothesis to drive inventions. Bringing life into new ideas is always challenging since prior data does not exist. New product introduction is always constrained by the lack of real data from the field.
- Deep learning and AI projects need large volumes of structured and unstructured data to train models. Labelled training data is scarce and expensive to prepare.
The above challenges constrain an organization to pursue its innovation journey. It decelerates data monetization and realization of business benefits. When there is an abundance of data, data masking, obfuscation, anonymization, or other privacy enhancing techniques can be used to ensure data security and privacy. However, when there is a scarcity of data or data does not exist to pursue innovation, synthetic data is a good alternative.
What is synthetic data?
It is a type of data which is generated based on the characteristic of different variables, understanding of the probability distribution of the expected data, and the relationship between the different variables. There are three types of synthetic data, i.e., fully synthetic, partially synthetic, and hybrid synthetic.
Fully synthetic data: The data is fully generated from scratch, based on assumptions around the probability density function of the data, the correlation among the variables, and constraints/rules on the variance of data.
Partially synthetic data: If there are some variables which are sensitive for disclosure in a large data set, such variables can be replaced by synthetic data. Imputations are applied to mask the data and make it untraceable. E.g.: Personally identifiable information (PII)
Hybrid synthetic data: A limited volume of original data or data prepared by domain experts are used as inputs for generating hybrid data. The underlying distribution of original data is studied and the nearest neighbor of each data point is created, while ensuring the relationship and integrity between other variables in the dataset. This ensures that the generated synthetic data closely resembles the original data.
How synthetic data can be generated?
Generating structured data:
Hybrid synthetic data can be generated with limited initial data with the help of non-parametric CART (Classification and regression trees) based methods or parametric regression-based methods. Classification and regression trees are based on the conditional probabilities which can describe the relations between the various attributes in the dataset. These relationships prove to be very helpful in generating synthetic data. CART models are basically used for estimating the conditional probability distribution of binary or multi-class outcomes, based on multivariate predictors. Class imbalance in the generated data can be addressed through SMOTE (Synthetic minority oversampling techniques) method. Data for regression can be generated by ensuring random data generated is a part of the probability density function, and the correlation between the variables is retained.
On similar lines, time series data can also be generated. Multivariate time series data can have seasonality, trend, and correlation between variables. Decomposing these features from the limited base data or user input can be used for generating time series data.
Generating unstructured data:
Computer vision and deep learning (DL) are widely used these days to realize AI use cases. However, getting large volumes of quality images to train DL models has always been a challenge. Image augmentation is one of the techniques to multiply the number of input images available. Applying permutation and combination of various techniques like rotation, flipping, translation, applying Gaussian noise, color space transformation, channel rotation, cropping, padding, blurring, distortions, etc., can be used to augment existing images.
There has been a lot of advancements in recent years on GANs (Generative Adversarial Networks) for image generation. These networks contain two networks, a generator network and a discriminator (adversary) network. Based on random noise, the generator generates sample output images. The discriminator tries to distinguish between the generated image and the original image. The more the generated image is closer to the original image, the output is accepted as realistic. While GANs are very useful, it has its own limitations as it requires a reasonably good volume of input data to create output data. Deep convoluted GANs, Self-attention GANs, and Big GANs are being adopted for generation of images. By applying transfer learning on augmented and generated images, projects can continue to pursue their AI innovations. These techniques help to move from data scarce to data adequate situation.
Synthetic data also holds a lot of promise to generate other types of unstructured data like text, audio, etc. Recurrent Neural Networks and Sequential GANs are applied for textual data generation, and Wavenets for audio generation.
Ensuring the quality of synthetic data:
It is important to ascertain the quality of the generated data. For structured data, it can be done by plotting the data, analyzing summary statistics, checking the data distribution, comparing with expected distributions, (e.g.: normal, skewed, bimodal, multimodal, etc.), hypothesis tests, comparing between regression coefficients of original and generated data, model performance metrics, etc.
The quality of the generated images can be ascertained by using them to train deep learning models, monitoring the convergence of learning, and measuring the improvement in the mean average precision (mAP) and Intersection over Union (IoU) of the inferences.
HCLTech DataGenie:
HCLTech has incubated a solution for synthetic data generation called DataGenie. This solution focuses on generating structured tabular data and images. A minimal input data prepared in consultation with an SME or collected from the field is required for generating large volumes of tabular data, which are of the categorical, continuous, or time series type. Such data can be used for modeling classification, regression, or forecasting problems. DataGenie can also help to augment images and generate images using GANs. Around a hundred images are required as input for generating thousands of output images.
HCLTech has incubated a solution for synthetic data generation called DataGenie.
DataGenie has been deployed in generating data for the following use cases which helped in training the models with a reasonable amount of data, and resulted in improved model performance. The generated data helped in kickstarting innovation, which otherwise, would not have been possible.
- For a disease detection use case from the medical vertical, it created over 50,000 rows of patient data from just 150 rows of data. The models created with synthetic data provided a disease classification accuracy of 90%.
- For a medical device, it generated reagent usage data (time series) to forecast expected reagent usage.
- It created 1000+ images from just 50 images for an automobile dent/scratch detection use case. It also augmented medical images for detection of knee joint detection and localization.
Conclusion
Synthetically generated data holds a lot of promise in highly regulated industries like financial services, medical, health care, clinical trials etc. Image augmentation and generation in combination with transfer learning, holds huge promise when getting real data from the field is a challenge. Though synthetic data is not a complete substitute for real data, since it would be practically impossible to cover all real-world scenarios, it helps to kickstart AI projects, while real data collection progresses on the field. Labelling effort can also be saved since labelling can be a part of the generation process itself. Training deep learning models with synthetic data and real data will help to protect the model against adversarial attacks and improve data security and the robustness of the models. The model is exposed to new types of data which is a little different from real data so that overfitting issues are taken care of. Innovation should never be constrained for lack of data. Time is of essence and if synthetic data helps, let’s go for it!
For more details on HCLTech DataGenie, please contact Jayachandran.ki@hcl.com