Background
Column level encryption has widespread use case, when it comes to encrypting business critical information in the database world. For example, a certain company may consider customer name and their details (like address and phone number) to be a critical piece of business information and would be reluctant to share this information in its original form with all the users of the table. They would thus prefer to keep this information stored in the table in a form so that general users are not able to figure it out easily. However, at the same time certain users like senior management or sales team may require this information in its original form to contact customer, pursue certain leads etc. It is not difficult to perceive that encryption of selective columns is very suitable to handle such scenario because encrypted values shall be made available to the general users of the table and selective users like sales and management will have this information in a decrypted form (original text).
In certain cases though, when we do not want the information to be decrypted at all, then other techniques like one way hash can come to use, but that is not the topic of our discussion today.
The reason for writing this blog is to discuss a simple technique to allow encryption and decryption of the column values in Hive without dealing with key management issues, and at the same time, provide capability to only selective users (like senior management/sales in above example) to decrypt.
Existing Solutions
We explored around a bit to make use of any existing solutions around the Cloudera Stack:
- HDFS Encryption: This is Filesystem level encryption. Typically, we can create encryption zone for Hive warehouse directory and encrypt all our data, as explained here. However, this approach doesn’t allow us to do encryption at the column level, and instead, encrypt/decrypt the entire table (or database) at once. Having established that, we can use it as a supplement to enhance overall encryption, but not for column level encryption in Hive.
- HIVE UDF: Hive provides UDF like aes_encrypt which can encrypt the data of the given column, however shortcoming of this approach is that the encryption key needs to be managed by the user, because that needs to be supplied as an argument to the UDF. This will not suit our requirement either, because we to do not want to share encryption keys with any user to keep the key secure. In fact, key management should be done internally and the user need not worry about providing key etc.
We further explored some ongoing work like ref1, ref2 which maybe suitable for our requirement, but none of these features have yet made it to production release of Hive. Finally, we decided to come up with a simple solution for column level encryption in Hive, which is nothing but a blend of what we have already discussed.
Solution in Cloudera Stack
Step 1: Create two custom Hive UDF i.e. encrypt Hive and decrypt_Hive which will perform encryption and decryption of the column values respectively. Internally these UDF will first retrieve the secure key (key management and secure key access are explained in next step). Once the secure key is retrieved successfully, UDF will encrypt/decrypt the column values using AES or any other Standard algorithm of encryption. To learn how to create a custom Hive UDF, refer here. For sample Java program on how to encrypt/decrypt using AES algorithms, refer here.
Step 2: Create a new Hive table with one column and store the secure key as a value for that column. To retrieve this key from inside the UDF (i.e. encrypt hive and decrypt hive) created in step1, we can use a jdbc connection. However to ensure only selective users are able to retrieve this key there are 2 things required:
- LDAP/Kerberos authentication: Hiveserver2 needs to be configured to use LDAP or Kerberos authentication over JDBC connection, so that users with LDAP/Kerberos credentials can perform successful JDBC connections to HiveServer2. To learn how to configure HiveServer2, refer here.
- Sentry for Table level authorization: Using Sentry, we can authorize only selective users to have read only access over the Hive table with secure key. This way we can restrict access of secure key to senior management people. To learn more about how to create new role and provide authorization over Hive tables using Sentry and map to Linux (or LDAP) groups, refer here.
Note: Sentry policy file will authorize the roles which can map to Linux groups, but eventually, Linux groups will contain multiple users, and hence user level authorization is achieved.
Step 3: Finally it is of utmost importance to secure these UDF because it contains important details like table with secure key and connection details to retrieve the secure key. So, to secure UDF we can make use of sentry again. To understand further on how to secure a UDF through sentry, ref here.
Step 4: Depending on the requirement, we can have separate roles in Sentry to authorize encryption and decryption operations (or in other words access to encrypt hive UDF and decrypt hive UDF). For example, as explained earlier, only management role can decrypt the column values. But we can have another role which allows a new user “etl” to do encryption. This user can then be dedicatedly used for running the etl jobs, so that we can encrypt the data while loading/importing into Data Lake.
Pros & Cons
PROS:
- Simplicity : This approach is simple to implement /understand
- Scalable: Any number of users can be mapped to authorized group to access secure key.
- The user does not handle the secure key. In other words, key management is hidden from the user.
CONS:
- UDF must be secured: Since UDF has connection details to table with secure key, extra care is necessary to make sure that UDF is not exposed in anyway.
- Overhead of creating additional user group, setting up LDAP or Kerberos for authentication.
Summary
Finally to summarize our approach, to achieve column level encryption in Hive we can make use of custom Hive UDF and Sentry to ensure secure key management by storing the secure key in a separate Hive table. You can use this custom UDF (UDF encryption) to perform encryption and decryption of column values of any Hive table.
References
- Cloudera: http://www.cloudera.com
- Hive Wiki: https://cwiki.apache.org/confluence/display/Hive/Home