Client is a FinTech company, aimed at simplifying access to financial services for over two billion un-banked/under banked people in the emerging markets of Africa, Asia, and Latin America using the mobile devices.
Client is looking to create data pipeline solution on Hadoop environment that can support ingestion of large volumes of daily data from multiple sources and handle regular batch processing workloads. Solution also needs to be designed keeping in mind security constraints.
Data solution to be implemented needed to address multiple needs
- Data Encryption: To encrypt the data in HDFS, it needed to be moved to a Unix machine that was responsible for data encryption and encrypted data (subset of columns) written back to HDFS. we used an external library, spark-sftp with spark to import and export data to and fro between hdfs layer and unix box. This pipeline was coded in python giving us flexibility to make it generic to support data from multiple locations, schema and compression.
- Data lake on HDFS : Current data, of around 100 GB across multiple tables, was hosted on DB2 Unix box and sftp paths. Historical data had to be imported first followed by delta data for all tables at daily, weekly and monthly intervals for different tables. Data stored on HDFS further needed to be encrypted.
- Existing nifi based pipeline that fetched data from DB2 > HDFS > spark-sftp > HDFShad an overhead of spark-sftp. We needed a solution that could support encryption on the fly. To enable this, we wrote a custom nar file, which basically is a custom processor for nifi and translated the java code that did the encryption on unix box to it.
- Post this we were able to fetch data and nifi being a flow-based tool, data now was encrypted on the fly and then stored on HDFS.
- More improvements were realized:
- Compression: To save disk space on on-premises Hadoop cluster we chose parquet as our file format and also enabled snappy compression in NiFi while writing data to disk.
- Throughput: Data was pulled, encrypted and stored on a hdfs layer via a single node nifi setup. We upgraded our nifi setup to cluster mode of 2 nodes. With this we increased our parallel computations, thread to process data faster, hence throughput was substantially improved.
- Data quality: Of the all file formats, CSV was pretty notorious for its control over schema. Hence, we made some improvements like making its schema specific by converting it to Json and then saving it down the line in parquet. Timestamps and date were also critical. We used Nifi capability to keep our dates and timestamps specific considering time zone changes as well.
- Data quality validation: Once the data had been ingested, we had to make sure number of records/ day are correct. Correctness factors being number of records ingested, schema, actual data, mandatory columns, encrypted columns and compression.
A custom spark-scala based code scheduled by airflow was implemented to check if the data had been ingested in the designated layer. After confirming data availability, validation code goes on to check multiple parameters, namely, record count, schema, mandatory columns, datetime and timestamps correctness, encrypted columns correctness, compression. We made it generic to work with multiple tables
- Compression helped save disk usage in TBs.
- Optimized NiFi pipelines saved enormous time, increasing throughput and decreasing latency.
- Establishment of ELK helped save time and increase correctness of validation.
- Usage of JSON helped maintain consistent schema around the platform.