Client is an Indian e-commerce company that connects Indian manufacturers or suppliers with buyers and provides B2C, B2B and C2C sales services through its web portal.
Client has deployed an on-premise Oracle RDBMS infrastructure to store its data in order of 500 GB at present which grows 20GB on a monthly basis. The current Oracle RDBMS infrastructure is used to manipulate data through Stored Procedures, extract the data manually through a WEBERP system and use that extracted data to create the business reports manually in MS Excel. The reports were based on monthly data with limited view into daily trends.
The client wanted to revamp the data processing and reporting process so that reports can be generated daily or on demand.
Our data engineering team had multiple meetings with client stakeholders to understand their current data setup & future technology roadmap. Client’s other business functions had already initiated shifting to AWS cloud and hence, it made sense to develop proposed data warehouse using AWS Redshift.
Additionally, data pipeline & reporting solution was also proposed to be built using serverless components of AWS, namely, AWS Glue & AWS QuickSight.
Below is the high-level architecture diagram of the proposed solution:
Client provided us with a set of 50 business views (reports), covering 90% of business queries that client wanted data warehouse to answer. It was also desired for data warehouse to be able to provide any adhoc information that could be requested in the future. We followed below steps to build the solution:
1. Since data store was supposed to answer any adhoc data requests in the future, we built in intermediate staging layer to capture data from input source as it is with limited transformation. This meant we would have entire historical data in one place that could be referred to later if needed. This data layer was provided by S3.
2. Created a logical view of data warehouse model to capture all facts and dimensions provided in business views. Logical model once approved was converted into physical AWS Redshift model with appropriate partition and sort keys defined.
3. Although data warehouse is supposed to be “write once, read many”; it was requested to allow the client team to make infrequent updates for records that are prone to be modified. Initial data for tables (multiple files for each table) was provided in pre agreed S3 location in CSV (Comma Separated Value) files. Each row in CSV file had a logical operator indicating record was meant for Insertion, Updating or Deletion. Tables further had hierarchical relationship and dependencies between themselves which meant that any update or deletion logic in ETL jobs had to handle complex sequential atomic updating scenarios.
4. Glue jobs were created to move data from S3 to staging layer (AWS Redshift tables acting as temporary storage for daily processing). At this stage, only columns needed for business reports were moved with changes in data types for some columns.
5. Post having data in staging tables in AWS Redshift, AWS Redshift queries (as a part of cron job deployed on EC2 instance) were used to move data into warehouse tables in AWS Redshift itself. At this stage, aggregations and selections were performed. Job performance metrics and associated logs were stored in a metadata table and notifications were sent via email to developer team on completion or failure of any step of data pipeline.
For going live
1. 3 years of historical data was moved into AWS Redshift and resulting numbers in data warehouse were tallied with existing reports. This took couple of weeks. Meanwhile, incremental data in CSV files for each day was collected in S3 location.
2. Once historical data was verified, backlogged incremental data of 2 weeks was processed in a day and incremental process was set up next day itself.
3. Currently, we run daily incremental data process successfully and any error scenarios are captured and promptly triggered.
4. System is designed to recover from breakdown or unexpected shutdown for any number of days.