Best practices for Amazon Redshift Lambda User-Defined Functions

While working with Lambda User-Defined Functions (UDFs) in Amazon Redshift, knowing best practices may help you streamline the respective feature development and reduce common performance bottlenecks and unnecessary costs.

You wonder what programming language could improve your UDF performance, how else can you use batch processing benefits, what concurrency management considerations might be applicable in your case? In this post, we answer these and other questions by providing a consolidated view of practices to improve your Lambda UDF efficiency. We explain how to choose a programming language, use existing libraries effectively, minimize payload sizes, manage return data, and batch processing. We discuss scalability and concurrency considerations at both the account and per-function levels. Finally, we examine the benefits and nuances of using external services with your Lambda UDFs.

Background

Amazon Redshift is a fast, petabyte-scale cloud data warehouse service that makes it simple and cost-effective to analyze data using standard SQL and existing business intelligence tools.

AWS Lambda is a compute service that lets you run code without provisioning or managing servers, supporting a wide variety of programming languages, automatically scaling your applications.

Amazon Redshift Lambda UDFs allows you to run Lambda functions directly from SQL, which unlock such capabilities like external API integration, unified code deployment, better compute scalability, cost separation.

Prerequisites

AWS account setup requirements
Basic Lambda function creation knowledge
Amazon Redshift cluster access and UDF permissions.

Performance optimization best practices

The following diagram contains necessary visual references from the best practices description.

Use efficient programming languages

You can choose from Lambda’s wide variety of runtime environments and programming languages. This choice affects both the performance and billing. More performant code may help reduce the cost of Lambda compute and improve SQL query speed. Faster SQL queries could also help reduce costs for Redshift Serverless and potentially improve throughput for Provisioned clusters depending on your specific workload and configuration.

When choosing a programming language for your Lambda UDFs, benchmarks may help predict performance and cost implications. The famous Debian’s Benchmarks Game Team provides publicly available insights for different languages in their micro-benchmark results. For example, their Python vs Golang comparison shows up to 2 orders of magnitude run time improvement and twice memory consumption reduction if you could use Golang instead of Python. That may positively reflect on both Lambda UDF performance and Lambda costs for the respective scenarios.

Use existing libraries efficiently

For every language provided by Lambda, you can explore the whole collection of libraries to help you implement tasks better from the speed and resource consumption point of view. When transitioning to Lambda UDFs, review this aspect carefully.

For instance, if your Python function manipulates datasets, it might be worth considering using the Pandas library.

Avoid unnecessary data in payloads

Lambda limits request and response payload size to 6 MB for synchronous invocations. Considering that, Redshift is doing best effort to batch the values so that the number of batches (and hence the Lambda calls) would be minimal which reduces the communication overhead. So, the unnecessary data, like one added for future use but not immediately actionable, may reduce efficiency of this effort.

Keep in mind returning data size

Because, from the point of view of Redshift, each Lambda function is a closed system, it is impossible to know what size the returned data can possibly be before executing the function. In this case, if the returned payload is higher than the Lambda payload limit, Redshift will have to retry with the outbound batch of a lower size. That will continue until a fit return payload will be achieved. While it is the best effort, the process might bring a notable overhead.

In order to avoid this overhead, you might use the knowledge of your Lambda code, to directly set the maximum batch size on the Redshift side using the MAX_BATCH_SIZE clause in your Lambda UDF definition.

Use benefits of processing values in batches

Batched calls provide new optimization opportunities to your UDFs. Having a batch of many values passed to the function at once, allows to use various optimization techniques.

For example, memoization (result caching), when your function can avoid running the same logic on the same values, hence reducing the total execution time. The standard Python library functools provides convenient caching and Least Recently Used (LRU) caching decorators implementing exactly that.

Scalability and concurrency management

Increase the account-level concurrency

Redshift uses advanced congestion control to provide the best performance in a highly competitive environment. Lambda provides a default concurrency limit of 1,000 concurrent execution per AWS Region for an account. However, if the latter is not enough, you can always request the account level quota increase for Lambda concurrency, which might be as high as tens of thousands.

Note that even with a restricted concurrency space, our Lambda UDF implementation will do the best effort to minimize the congestion and equalize the chances for function calls across Redshift clusters in your account.

Restrict function concurrency with reserved concurrency

If you want to isolate some of the Lambda functions in a restricted concurrency scope, for example you have a data science team experimenting with embedding generation using Lambda UDFs and you don’t want them to affect your account’s Lambda concurrency much, you might want to set a reserved concurrency for their specific functions to operate with.

Learn more about reserved concurrency in Lambda.

Integration and external services

Call existing external services for optimal execution

In some cases, it might be worth considering using existing external services or components of your application instead of re-implementing the same tasks yourself in the Lambda code. For example, you can use Open Policy Agent (OPA) for policy checking, a managed service Protegrity to protect your sensitive data, there are also a variety of services providing hardware acceleration for computationally heavy tasks.

Note that some services have their own batching control with a limited batch size. For that we implemented a per-function batch row count setting MAX_BATCH_ROWS as a clause in the Lambda UDF definition.

To learn more on the external service interaction using Lambda UDFs refer the following links:

Conclusion

Lambda UDFs provide a way to extend your data warehouse capabilities. By implementing the best practices from this post, you may help optimize your Lambda UDFs for performance and cost efficiency.The key takeaways from this post are:

performance optimization, showing how to choose efficient programming languages and tools, minimize payload sizes, and leverage batch processing to reduce execution time and costs
scalability management, showing how to configure appropriate concurrency settings at both account and function levels to handle varying workloads effectively
integration efficiency, explaining how to benefit from external services to avoid reinventing functionality while maintaining optimal performance.

For more information, visit the Redshift documentation and explore the integration examples referenced in this post.

AWS Big Data Blog