Split/Join integration pattern in case of distributed systems

The Challenge

Each processing tool is dependent on the memory of the machine where it is running on. Even in cases of serverless processing using Lambda functions, we are constrained by the time that Lambda function can run. Some integration tools first read complete file into memory and then process it. When it comes to large files, there is not always enough memory or time. Natural approach in this case is a Split/Join integration pattern. What we are solving here is a ‘distributed’ approach of Split/Join using S3 based file data load and AWS services.

We can split one large file, process the chunks and upload them to S3 bucket. With the chunking approach, we solved the time and memory problem. Multiple chunks can be processed in parallel independently. For example, Lambda functions that will execute Aurora LOAD DATA FROM S3. Here is the tricky Join part. All chunks will be processed in parallel and we need to find a way how to proceed after they all complete.

 

 

 

Solution

InterWorks’ team came up with a solution based on:

  • Smart file naming of the CSV files created by integration tool,
  • DynamoDB table and its ADD update command

 

The integration tool will create files in the following format <<Client_Name>>_<<date>>-<<chunk_number>>-<<total_number_of_chunks>>.csv

Each DynamoDB item update operation is atomic, i.e. at a particular moment only one is allowed to modify particular item. If multiple parallel db clients try to update the same item, DynamoDB serializes the write access to that item. Each Lambda function that processes the S3 events will receive the filename. It will parse the file name and extract the total_number_of_chunks. Once the database load has been completed, the Lambda function will execute a command similar to the following:

var params = {

TableName:table,

Key:  {“file_name”: { S: file_name }   },

UpdateExpression: “SET total_parts = if_not_exists(total_parts, :total_parts) ADD processed_parts :processed_parts”,

ExpressionAttributeValues:{ “:total_parts”:{N:total_parts},  “:processed_parts”:{N:’1′}   },

ReturnValues:”UPDATED_NEW”

};

The Lambda function that will receive response where total_parts=processed_parts is our virtual Joint point. This Lambda function will continue the flow.

Benefits and Results

  • Proposed tracking mechanism enables out-of-the-box solution for Split/Join integration pattern in case of distributed systems;
  • Approach can be used in all scenarios where there is not enough memory or time for processing large files;
  • AWS solution is quickly available and you pay only for the running time of the resources.

 

See the complete case study here >>