Efficient large-scale serverless data processing for slow downstream systems

Public sector agencies around the world are facing an ever-growing deluge of data. A significant portion of this data is semi-structured and organized into formats such as CSV or JSON files. Managing and analyzing this data efficiently is crucial for effective governance and service delivery. Traditional data processing methods often struggle to handle this volume efficiently, leading to processing delays and system overload. This post demonstrates how to build serverless workflows on Amazon Web Services (AWS) that process such data using AWS Step Functions and integrate with downstream systems that have concurrency limitations.

The education data management challenge

Consider a state-wide education department that needs to process and analyze millions of student records, including academic performance data, attendance records, standardized test scores, and administrative information. These records must be processed efficiently to:

Track student progress across multiple academic years
Identify at-risk students who might need additional support
Generate insights for resource allocation across schools
Create personalized learning recommendations
Comply with federal and state reporting requirements

The challenge becomes particularly complex when the department needs to integrate this processing with existing student information systems and legacy educational databases that have limited capacity to handle concurrent requests.

AWS Step Functions distributed map

AWS Step Functions provides a powerful tool for running workflow steps concurrently called Map. The Map feature offers two distinct processing modes: Inline and Distributed.

In Inline mode, the Map state accepts an array of JSON objects as input and starts the set of steps in parallel for each item in the array. This mode supports up to 40 concurrent iterations, which can be limiting for use cases involving large-scale data processing or scenarios where high volumes of data need to be processed efficiently and quickly. With this limitation, processing larger datasets in Inline mode would take longer because the work would need to be done in smaller batches, resulting in increased overall processing time.

To address the limitations of Inline mode, Step Functions offers a more scalable alternative called Distributed Map. This mode allows customers to run a piece of logic multiple times in parallel, but at a much higher scale than Inline mode. Distributed Map can run up to 10,000 concurrent executions of the Map state, significantly enhancing performance for large-scale data processing. Additionally, it offers more flexibility in data input, capable of processing data directly from Amazon Simple Storage Service (Amazon S3) in the form of S3 objects, CSV files, or JSON files. These capabilities make Distributed Map particularly well-suited for scenarios involving high volumes of data that need to be processed efficiently and quickly. You can find more details about this feature at Using Map state in Distributed mode for large-scale parallel workloads in Step Functions and Step Functions Distributed Map – A Serverless Solution for Large-Scale Parallel Data Processing.

For example, when processing end-of-semester grade reports for millions of students across thousands of schools, the Distributed Map capability can parallelize the processing, reducing what might have been days of sequential processing into hours or even minutes.

The integration challenge

A typical education data processing architecture might go as follows:

Large CSV files containing student data arrive in an S3 bucket (for example, nightly data dumps from schools)
Each CSV file triggers a Step Functions workflow with a distributed map
Each iteration processes a student record and needs to:
- Validate the data format
- Check for missing information
- Update the student information system
- Generate progress reports
- Update analytics API

The following diagram illustrates this architecture.

Figure 1. Example architecture showing an education file processing system with Step Functions that is integrated with legacy system. Note: For the purpose of this blog, let’s assume that the Legacy worker is running on EKS.

Many existing student information systems and educational databases weren’t designed to handle thousands of concurrent requests. When processing education records at scale, we face two levels of concurrency challenges:

Within-workflow concurrency: Step Functions Distributed Map can process thousands of records concurrently within a single workflow execution, potentially overwhelming downstream systems.
Concurrent workflow executions: In production environments, multiple workflow execution might run simultaneously (e.g., processing different batches of student data, handling different school districts, or running different types of processing on the same student database). This compounds the concurrency issue as these separate workflows are unaware of each other but impact the same downstream systems.

Step Functions provides mechanisms like setting MaxConcurrency parameter and batching the input items of the Map state to control concurrency within a single workflow. Step Functions also provides built-in capabilities to configure automatic retries and error handling when dealing with slower downstream systems. However, when multiple instances of the same workflow run simultaneously, additional concurrency controls across these parallel executions become necessary.

This post explores three approaches to handle this challenge:

Concurrency control with external data store
Concurrency control with a queue
Concurrency control with activities

Concurrency control with external data store

In the context of a state-wide education department, legacy student information systems and educational databases often have limited capacity to handle high volumes of concurrent requests. To address this, you can use an external data store such as Amazon DynamoDB to implement a locking mechanism. This prevents the Step Functions Distributed Map from overwhelming these systems while processing millions of student records.

For example, when processing end-of-semester grade reports for millions of students, the Step Functions workflow can first attempt to acquire a lock stored in DynamoDB. If the lock is unavailable (indicating that another workflow is already processing data), the workflow waits for a specified time before retrying. After the lock is acquired, the Distributed Map step begins processing student records in parallel. After completing the processing, the workflow releases the lock, allowing the next batch of records to be processed.

This approach protects legacy systems from overload, even during large-scale data processing tasks like updating student records, generating progress reports, or syncing data with external systems. It proves especially valuable when the department must meet federal and state reporting requirements while maintaining system stability.

The following diagram is the architecture showing a Step Functions distributed map triggered by a file arriving in an S3 bucket. Step Functions implements locking logic to control the number of concurrent executions using a counter stored in the database.

Figure 2. Example architecture showing concurrency control with external data store. Note: For the purpose of this blog, let’s assume that the Legacy worker is running on EKS.

To explore this option with a working sample and learn how to handle failure scenarios, refer to Controlling concurrency in distributed systems using AWS Step Functions in the AWS Compute Blog. This architecture presents an effective way to control the concurrency in a distributed system by integrating with an external data store such as DynamoDB.

This approach offers a configurable mechanism for managing concurrency across multiple Step Functions workflows. It also provides easy tracking of current concurrent executions through simple queries to the external data store.

Concurrency control with a queue

An alternate approach is to introduce a buffering mechanism such as an Amazon Simple Queue Service (Amazon SQS) queue between Step Functions and the downstream system. Amazon SQS is a fully managed and highly scalable message queuing service. Standard SQS queues can handle unlimited throughput, automatic scaling and built-in resiliency. Integrating Step Functions Distributed Map with Amazon SQS presents a great way to handle highly concurrent requests.

In this architecture, we have two SQS queues, Queue 1 and Queue 2. They act as a buffer between the workflow steps and the corresponding external APIs. The Step Functions workflow’s distributed map can process the incoming file at a very high concurrency and queue up the requests to the APIs. You can configure an AWS Lambda function to consume the messages from the queue using event source mapping. Event source mapping provides a seamless way for Lambda functions to integrate and consume messages from messaging and streaming sources such as Amazon SQS. When consuming messages from Amazon SQS, you can configure maximum concurrency for the Lambda function, which gives you a nice way to control the rate at which the messages are consumed from the queue. You can read more about how to use this mechanism in Introducing maximum concurrency of AWS Lambda functions when using Amazon SQS as an event source.

The following architecture diagram shows a potential implementation of this approach with our current example.

Figure 3. Example architecture showing a Step Functions distributed map triggered by a file arriving in an S3 bucket. Each distributed map step puts a message on an SQS queue. A Lambda function consumes the message and invokes the legacy API.

In this approach, you’re integrating Step Functions with the downstream system using an asynchronous mechanism. Step Functions provides an integration pattern for waiting for task tokens to handle these kinds of integrations. Step Functions will pass a callback token when the message is inserted into the SQS queue. The Lambda function that is consuming the message can use the task token to signal Step Functions that the downstream service finished handling the request.

This approach provides the flexibility to configure distinct concurrency settings for different downstream APIs, making it particularly suitable for scenarios with frequent temporary load spikes.

Concurrency control with activities

Another effective mechanism for rate limiting in Step Functions workflows is the use of Step Functions activities. Activities provide a way to coordinate tasks between Step Functions and external workers, allowing for fine-grained control over task execution rates. In this example, the legacy systems act as workers and poll the Step Functions workflow to receive the work. The architecture for this mechanism is illustrated in the following diagram.

Figure 4. Example architecture showing concurrency control with activity state in a Step Functions workflow. Note: For the purpose of this blog, let’s assume that the Legacy worker is running on EKS.

This approach is ideal for scenarios requiring precise control over downstream processing rates, offering a straightforward rate-limiting solution using native Step Functions features.

Conclusion

In this post, you’ve learned how to use AWS Step Functions to process large-scale semi-structured data without creating and maintaining underlying infrastructure. You’ve also learned how to control the concurrency of such processes when dealing with slow downstream systems. By implementing these patterns for education data management, school districts and education departments can efficiently process large volumes of student data while respecting the limitations of their legacy student information systems. This enables them to modernize their data processing capabilities without overwhelming existing infrastructure, ultimately leading to better educational outcomes through data-driven decision-making.

To learn more about distributed data processing with Step Functions, visit Serverless Land.

Subscribe to the AWS Public Sector Blog newsletter to get the latest information on AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.

AWS Public Sector Blog

Efficient large-scale serverless data processing for slow downstream systems

The education data management challenge

AWS Step Functions distributed map

The integration challenge

Concurrency control with external data store

Concurrency control with a queue

Concurrency control with activities

Conclusion

Resources

Follow

Learn

Resources

Developers

Help