Client is an Indian e-commerce company that connects Indian manufacturers or suppliers with buyers and provides B2C, B2B and C2C sales services through its web portal.
Client has deployed an on-premise Oracle RDBMS infrastructure to store its data in order of 500 GB at present which grows 20GB on a monthly basis. The current Oracle RDBMS infrastructure is used to manipulate data through Stored Procedures, extract the data manually through a WEBERP system and use that extracted data to create the business reports manually in MS Excel. The reports were based on monthly data with limited view into daily trends.
The client wanted to revamp the data processing and reporting process so that reports can be generated daily or on demand.
Our data engineering team understood the client’s current data setup & future technology roadmap. It was decided to go with AWS Redshift for data warehousing needs.
Additionally, a data pipeline & reporting solution was also proposed to be built using serverless components of AWS, namely, AWS Glue & AWS QuickSight. We were provided with a set of 50 business views (reports), covering 90% of business queries. It was also desired for data warehouse design to support any future adhoc information request. Following development steps were followed
- We built an intermediate staging layer to capture data from the input source as it is with limited transformation. This meant we would have entire historical data in one place that could be referred to later if needed. This data layer was provided by S3.
- Created a logical view of the data warehouse model to capture all facts and dimensions provided in business views. Logical model once approved was converted into a physical AWS Redshift model with appropriate partition and sort keys.
- Although data warehouse is supposed to be “write once, read many”; it was requested to allow the client team to make infrequent updates for records that are prone to be modified. Initial data for tables (multiple files for each table) was provided in a pre-agreed S3 location in csv files. Each row in the csv file had a logical operator indicating record was meant for Insertion, Updating or Deletion. Tables further had hierarchical relationships and dependencies between themselves which meant that any update or deletion logic in ETL jobs had to handle complex sequential atomic update scenarios.
- Glue jobs were created to move data from S3 to the staging layer (AWS Redshift tables acting as temporary storage for daily processing). At this stage, only columns needed for business reports were moved with changes in data types for some columns.
- Post having data in staging tables in AWS Redshift, AWS Redshift queries (as a part of cron job deployed on EC2 instance) were used to move data into warehouse tables in AWS Redshift itself. At this stage, aggregations and selections were performed. Job performance metrics and associated logs were stored in a metadata table and notifications were sent via email to the developer team on completion or failure of any step of the data pipeline.
We went live with the data warehouse in 6 months with 3 years of historical data. Incremental processes were set up to ingest data on an ongoing basis with an automated mechanism for handling any failures.