Solution to the take home assignment for the DE Engineering interview at Fetch
- Install the necessary python modules for the assignment
make pip-install
- Install docker images
make docker-start
- Run alter_table.sql file to convert "app_version" from integer to varchar
make alter-table
This step would require you to enter the password for the database
- Initiate the process of extracting messages, encryption and data loading
make run
In order to check the results, we can type the following in the terminal:
psql -d postgres -U postgres -p 5432 -h localhost -W
After entering the password, we type the following
select COUNT(*) from user_logins;
You sould see 99 records in the database
Our code consist of 5 python files
- SQS.py which contains code to load SQS Queue object and retrieve messages from the queue. For this, I used Boto3 module in Python.
- AESCipher.py which is used to create a cipher for AES Encryption of an attribute
- PDatabase.py which contains code to feed encrypted messages to our database. I used psycopg for connecting and quering the database.
We also have the following files
- alter_table.sql in the scripts folder which contains SQl statement to change datatype of "app_version" attribute.
- docker-compose.yaml containing the docker config to be used by our application
- config.cfg which contains all the config details for AWS, SQS and PostgreSQL Database
For PII Masking, we wanted a method that would allow analysts to easily recognize duplicate values while securely hiding the infrmation.
- We need some kind of encryption method where the same plaintext would produce the same result all the time. This is called deterministic encryption.
- It's mentioned in the assignment that PII may be need to be recovered later on.
Based on the above criteria, I came up with two methods :
- SHA256 hashing which is deterministic, but recovering original PII values is practically impossible unless we maintain a lookup table for every value in the table.
- AES encryption with ECB mode, which is also deterministic, but unline hashing, requires us to only create a random key for each unique attribute we want to encrypt.
Ultimately, I went with the second one because it is more memory and space efficient.
-
How will the application run in production?
For production, I would want to
- dockerize the application instead of directly running a python script
- Use module level logging to log any error or warning we get at any step of the pipeline
- Create a log every time the application is run for easier debugging
-
What other components to add to make this production ready?
- Use Apache Airflow or any other orchestration tool for managing the whole pipeline
- Consider replacing postgreSQL with a cloud data warehouse as data grows.
- Have a highly secure cloud storage to store & retrieve ciphers when done using the application.
-
How can this application scale with growing dataset?
- Moving from postgreSQL to a cloud data warehouse as data grows.
- Use Spark for all data-preprocessing and data transformations.
- The current application works on batches of data, with growing data we may consider moving to streaming ETL.
-
How can PII be recovered later on?
Since we are using key-based encryption, we can use the same key to decrypt our data. Please note that we have one unique key per attribute, so even if data grows, we don't need to recompute the key and can use the same cipher to encrypt/decrypt.
-
What are the assumptions you made?
There was an error in our database where the "app_version" attribute table was assigned an integer type. In the messages, it was of the form "x.x.x", where x is a number. I assume this was intentional and I was able to correct it with an SQL script.