Client is an Indonesia-based mobile telecommunications services operator. The operator’s coverage includes Java, Bali, and Lombok as well as the principal cities in and around Sumatra, Kalimantan and Sulawesi. Client offers data communication, broadband Internet, mobile communication and 3G services over GSM 900 and GSM 1800 networks.
Current Scenario & Business Need:
Currently, client has an on premise Teradata Warehouse infrastructure to store their multi Terabytes of data which is increasing in volume day by day.
- The current Teradata infrastructure has already been exhausted by as much as 70% which prompted the client to look for alternative solution apart from scaling the Teradata warehouse vertically.
- The maintenance of on-premise infrastructure at such a scale is also a concern for client which made the client look for a cloud based data warehousing solution.
- The existing client setup used Ab Initio for ETL requirements which increased the cost of operations. The client wanted a low cost ETL solution for the cloud infrastructure.
- Client had an existing PowerBI licence which the client wanted to utilize to generate reports to data-driven business decisions.
Our team met business stakeholders to understand their current setup, overall business requirements and propose a solution. We decided to develop a scalable Data Warehouse, ETL and reporting solution for the client using AWS Cloud Services. We used
- AWS Redshift to create a Data Warehouse
- AWS Glue for the ETL process
- Existing PowerBI for reporting purposes
Here is the architecture diagram of the proposed solution:
Data Extraction & Storage:
We extracted the relevant data from Teradata warehouse and stored it into a Virtual Machine in CSV, Avro and Parquet formats to test the compatibility of different data storage formats with the ETL process – AWS Glue. The files were uploaded to S3 from the virtual machine via a secure channel to make sure that the data is not compromised during transit.
Our next step was to prepare the data, i.e., perform the logical operations on the data so that it can be stored in the data warehouse as well as could be used for reporting purposes with least/no manipulation. We wrote ETL scripts using pyspark and scheduled those scripts to be executed at some specific time using AWS Glue.
AWS Glue processed the data from S3, made necessary transformations and stored the resultant data into AWS Redshift. We wrote more data manipulation queries in SQL to further process the data stored in AWS Redshift and prepare it for reporting.
We used PowerBI to present the reporting data. We installed powerBI desktop on an EC2 instance so that the data can be accessed from AWS Redshift with minimal network latency and reports can be generated in seconds.
The reports developed were populated with data within 10 seconds due to current setup.