AWS Step Functions - Scaling Large Tasks With Lambdas

Isaiah Weaver
Feb 15, 2024
3 min read

Updated: Apr 6

Recently, a client project required a set of files to be processed in S3. We needed to create an infrastructure that could scale well to complete this. For maximum efficiency, we chose AWS Step Functions for the infrastructure. While a single Lambda can only run for a maximum of 15 minutes, Step Functions provide a way to connect lambdas to complete larger tasks, making Lambda’s scalability attractive. Step Functions also help to keep costs down by not having a compute instance running 24/7. Using an EC2 instance would require payment for the entire running time rather than only paying for running the task.

Now that we have established the use case for Step Functions let’s look at how they work.

How Step Functions Work

Step Functions are essentially just a series of lambda functions conforming to a set of state rules. When chained together, it’s called a workflow. Workflows allow for cool functionality, like ensuring specific actions only run once and being able to reliably pass the output of one action as the input to the next to complete a series of dependent tasks.

Step Functions also have a feature called MapState, which allows a data-defined state to be mapped onto a Step Function workflow, and each iteration runs in parallel. You’re still restricted by the 15-minute runtime limit for lambdas, meaning your tasks must be small. However, the purpose of MapState is to divide tasks into steps. Therefore, this runtime limitation becomes irrelevant if the use case is suitable. This feature worked perfectly in our scenario as we could distribute the workload across multiple lambdas and consolidate the results afterward.

MapState has two main modes: Inline and Distributed. The main difference is the amount and size of the input data.

Inline mode is easy to set up for low concurrency needs or small amounts of data passed in and out of the MapState.
Inline mode takes input from a network request or a previous lambda in the workflow.
Distributed mode lets you map state from large files onto up to 10,000 parallel workers if you need high concurrency or your input data is significant.
It will read in a JSON or CSV file, parse each line (or item in the case of JSON), and give that data to the worker lambdas.

For our use case, we needed to process X number of text files with Y number of lines each, where X was assumed to be a small, though regularly occurring amount, and Y was unknown. We needed to make a network request for each line in the file and modify the output data accordingly. Given the time limit of Lambda runtime and the potentially large number of lines, we could not reliably have a single lambda process an entire file.

Using Step Functions To Solve The Problem

Our solution was to use a MapState in Distributed mode on the input file and run a group of lambdas in parallel to make the network requests and modify the output. Again, given the unknown and potentially large file size, Inline mode was not an option. Luckily, the client wanted to use S3 for file upload, so Distributed mode was easy to set up.

Our workflow had a 3-step design (graph from AWS console shown below): the pre-process, worker, and post-process tasks.

The pre-process task reads the file from S3, maps the input data to JSON format for compatibility with Distributed mode, and writes it back to S3.
Next, a Distributed MapState reads the JSON file from S3 and processes each line with a pool of worker-task lambdas, which means each Lambda only needs to make a single request, thus avoiding the 15-minute timeout.
Finally, the post-process task creates a new file based on the results from the worker task and uploads it to S3. It also takes care of some temporary files we needed to create in the worker task to keep track of the Distributed MapState output.

Step Function Workflow (Taken from AWS Console)

Lambda Step Functions proved the perfect solution for our specific use case. The ability to trigger intricate tasks on demand, without the overhead of a constantly running server, kept costs at bay while empowering us to accomplish tasks that would have been unfeasible with a single lambda. Several months later, our client continues to leverage this solution with no issues.

How Step Functions Work

Using Step Functions To Solve The Problem

References: