Digitally connected world is generating, collecting and analyzing large volumes of data. All the data which is getting generated is not fully collected and all the collected data is not fully analyzed. There are lot of constraints around data privacy and security which needs to be addressed if all the data which we collect is made available for analysis as well. For example, GDPR applies multiple restrictions in capturing, storing, processing, sharing and disposing individual data. The other challenge is that the data collection and analysis happens in siloes in many cases. An ecosystem of an organization would consist of the many entities like employees, customers, partners, suppliers, regulators, and society at large. However, when it comes to data, all these are viewed in isolation and this leads to incomplete data collection, partial analysis and uncertain actionable insights. If these entities can share data without compromising the privacy and security concerns of the participants, it would vastly enhance the quality of analysis and insights. For every step an organization takes to make data available for third parties, it takes two steps backward when there are data breaches (e.g.; CapitalOne, Equifax, Cambridge Analytica etc.). A lot of effort is being spent in securing the data than sharing. With rapid advances in cloud technology, organizations are increasingly adopting cloud for data storage and computing. Data security and privacy is a key area of concern which inhibits the journey towards a world of “open data”. It is important to know how different stakeholders will be benefited or impacted by sharing of data. Following diagram provides an illustrative view of the same.
Privacy Enhancing Techniques (PETs): Democratization of data, machine learning models and insights across ecosystem participants will help to create substantial value among all. A fine balance must be kept between sharing of data across participants and at the same time ensuring compliance to data security and privacy laws. Privacy Enhancing Techniques is a breakthrough approach towards this. Though these techniques have been in existence for some time, it has not been applied at the scale at which is required in a “data hungry” machine learning world. Using PETs data can be shared in a secure and trustworthy manner. Models can be built and deployed without disclosing the actual data. Following are some of the widely used PETs
Differential Privacy: It is a method for publicly sharing information about a dataset by describing the patterns within the dataset, while withholding information about individuals in the dataset. E.g.: In a study of cancer patients, information about a patient’s specific condition will be protected. There is no additional information which would be disclosed apart from what has been generally available about the patient. The inference from the study would not materially change when the patient is removed from the study.
Federated learning: Typical approach is to have all the data being made available from different sources into a centralized data lake or warehouse where machine learning models are built. The data moves from different sources across systems and environments. In federated learning, data resides where it was generated. The models are individually trained at the source of data generation and model is sent to the centralized server. The difference is that instead of data moving from one location to another, the model moves in federated learning. Google pioneered this and has lot of potential when sharing data across networks is risky and an area of concern. E.g.: AI in radiology is an emerging trend. In scenarios where X-ray, CT and MRI studies are done for patients, hospitals and diagnostic companies are constrained to share such data to central servers for analysis. In such instances the ML models are trained on on-premise servers and the trained model is sent to central model server. The enriched model is then sent to another hospital where it is further trained on new set of data and this helps in continuously maturing the model. While the model is being shared, the data resides within the hospital thereby the privacy of patient is secured.
Homomorphic Encryption: Homomorphic encryption is the transformation of data into ciphertext while preserving relationships between elements. Data can be processed, analyzed and insights can be gleaned like how it can be done on the original data. Homomorphic encryptions allow mathematical operations like addition, multiplication, and polynomial transformations which can be performed on original data and it can be encrypted. By applying this technique, we can securely store data on the cloud and make use of the out of box analytics capabilities the platforms offer.
Zero Knowledge Proof: It is a way of encryption by which the actual data of the subject is not revealed as is. Instead a response is provided which either validates or invalidates a query from a third party. This has wider application in securing personally identifiable information (PII) in health care, financial services, telecom etc. It also has wider application in securing blockchain transactions. E.g.: If a third party would like to check if the salary exceeds $100,000 for processing a loan, the KYC provider only responds with a Yes or No, without disclosing the actual salary of the customer.
Secure multiparty computation: Secure multiparty computation (MPC) is a method which enables the safe sharing of data between multiple parties. None of the participants would need to reveal their data in this approach. MPC can also facilitate private multi-party data analysis and machine learning. in this case, different parties send encrypted data to each other and they can train a machine learning model on the consolidated data, without disclosing the actual data. This helps to overcome the need of a centralized data aggregator.
Libraries and frameworks for secure machine learning: While we secure data, it is also equally important that we secure the machine learning models as well. Models can be subject to different types of attacks like member inference, model inversion, model parameter extraction. PySyft, TF Encrypted, TF Privacy are some of the frameworks which help in securing models. PySyft extends PyTorch, Tensorflow and Keras with capabilities for remote execution, differential privacy, homomorphic encryption, secure multi-party computation and federated learning. TF Encrypted is a framework for encrypted machine learning in TensorFlow. TF Encrypted integrates secure multi-party computation and homomorphic encryption. TF Encrypted also offers a high-level API, TF Encrypted Keras. TF Encrypted Keras aims to make privacy-preserving and secure machine learning possible, without requiring expertise in building customized encryption algorithms. TF Privacy is an important library which applies differential privacy concepts to ensure models don’t memorize specific training data and are more generic in nature.
By adopting the principles of privacy by design (i.e. ensuring privacy in design and throughout the life cycle of the system) and secure machine learning, it helps to bring in trustworthiness in AI solutions. Since data is transformed or hashed, it makes it easier to make data public for audit trail and explainability of models. It also creates opportunities for democratizing data which leads to larger participation among stakeholders.
As more organizations start adopting cloud technologies for data storage, data analysis and proceed to embed AI models in their business process, it is important to ensure that concerns around data and model security are taken care of. With many AI applications moving from proof of concept stage to production, the risks around data and model security are very high. Data scientists and data engineers should ensure that security and privacy are integral features in model development, and it can never be an add on. This must be key consideration when AI application architecture reviews and testing are undertaken. Though there are some limitations around model training time, inference time and performance metrics in secure machine learning, the benefits of ensuring a secure environment far outweigh the drawbacks. While many feel regulations like GDPR as a bane for innovation, there is a flip side to that. Regulations provide clarity on the art of the possible so that innovators are very clear on the boundaries they need to operate in. To be successful in the AI journey, it is important for organizations to be more responsible with customer data and walk that extra mile to gain the confidence and trust of its customers. For more details on secure and collaborative machine learning, please contact NEXT.firstname.lastname@example.org
Data scientists and data engineers should ensure that security and privacy are integral features in model development, and it can never be an add on.