Bulk Load Data into Aurora RDS

The Challenge

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, that combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.

Amazon Aurora is up to five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases. It provides the security, availability, and reliability of commercial databases at 1/10th the cost. Amazon Aurora is fully managed by Amazon Relational Database Service (RDS), which automates time-consuming administration tasks like hardware provisioning, database setup, patching, and backups.

Amazon Aurora features a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB per database instance. It delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across three Availability Zones (AZs).

The Client communicates with many manufacturers by receiving data in several formats (FTP, API, e-mail etc.). MuleSoft flow transforms the received data in a unified file format. One file can contain several hundreds to millions of records or usually about ~ 5 – 20 GB worth of data. We wanted to speed up data load to database in secured manner and decided to setup data pipeline to automate the process and use S3 Buckets for file uploads from MuleSoft flow.

In theory it’s a very simple process of setting up data pipeline to load data from S3 Bucket into Aurora Instance. Setting up this process is very convoluted multi step process and not so simple as it sounds. Welcome to Managed services world.

Solution

InterWorks’ analysis and development team proposed and built Data Load solution that provides parallel data load from multiple files into Aurora RDS with keeping track of file status. Here are AWS services involved in solution:

  • S3 bucket structure for receiving files from different vendors;
  • Trigger that publish message to SNS topic about new file in S3;
  • Lambda function that loads the data to Aurora RDS (MySQL engine);
  • Lambda function that index data in ElasticSearch;
  • Process that keeps track of number of processed files using DynamoDB;
  • Database and ElasticSearch are placed under Private subnet and available only to internal Apps with special permissions. The data is completely secured.

Our solution is based on Amazon Web Services serverless Lambda functions for tracking, loading and indexing the data. MuleSoft flow uploads the files in S3 buckets with a predefined key. SNS topic triggers two Lambda functions – one that uses S3 LOAD command to load bulk data to Aurora RDS and the second one that uses ElasticSearch to index data keys and provide user friendly data search for client. We leverage atomicity of DynamoDB item update operation i.e. at particular moment only one process is allowed to modify particular item. If multiple parallel db clients try to update the same item, DynamoDB serializes the write access to that item. This allows us precision even in our case of multiple paralleled Lambda functions runs.

The solution is built under custom secured VPC, deployed in separate public, private and data subnets. Assigned execution roles and policies have minimum required permissions. Using CloudFormation templates the solution can be quickly deployable to different AWS accounts or as different environments in the same AWS account.

Benefits and Results

  • Fast data transformation using MuleSoft and fast data load using AWS managed services (S3, Lambda function and Aurora RDS).
  • Unifying data using MuleSoft provides AWS solution to be built once and used for different client’s providers.
  • Data load can be done to Aurora RDS as a transactional or operation database, but also to ETL ready databases as Redshift.
  • We build up tracking process that track and notify the client for file processing status and problems.

 

See the complete case study here >>