AWS HPC Blog
Enhanced Performance for Whisper Audio Transcription on AWS Batch and AWS Inferentia
In our first blog, Whisper audio transcription powered by AWS Batch and AWS Inferentia, we presented a solution for reducing the cost of Whisper audio transcription by processing it at as a batch workload asynchronously using AWS Batch. Since our initial release, we’ve implemented several key optimizations that significantly boost performance and cost efficiency of running Whisper on AWS Inferentia.
In this post, we’ll review the key optimizations and performance gains for our Whisper audio transcription solution.
Performance Improvements at a Glance
- 50% reduction in processing time: Tasks that previously took 20 minutes now complete in 10 minutes.
- 50% reduction in resources: Achieve the same results with half the computing resources.
- Enhanced cost efficiency: Better resource utilization through multiple jobs per Inferentia2 chip.
Key Optimizations
Let’s look at each of the key optimizations.
Preloaded Model Files
Model file loading takes a significant portion of the time for each individual job, especially when bandwidth is capped when downloading from external resources by default. In the previous version, the model files were fetched remotely at runtime for every single Batch job. By caching the model files inside the container image, we’ve eliminated redundant loading operations, significantly reducing initialization overhead for each transcription job.
This improvement is achieved by preloading the model file while building the container image, which will be saved in Amazon Elastic Container Registry (Amazon ECR) with much better bandwidth compared to downloading from an external resource.
Projection Layer Optimization
In the projection layer of the model, we’ve implemented optimizations through sequence handling. The padding and unpadding mechanisms process all inputs at a fixed maximum length. This approach creates uniform-sized batches that are optimally processed by AWS Inferentia hardware accelerators. By combining efficient sequence length handling with hardware-specific optimizations, we’ve achieved a 30% performance improvement in model inference speed. The padding mechanism trades minimal overhead for significant gains in hardware utilization, while removing unnecessary padded tokens before downstream processing.
Improved Resource Utilization
From the Inferentia2 architecture, we know that each Inferentia2 chip consists of two Neuron cores. How to utilize both cores efficiently is critical to maximize performance with cost efficiency.
In our previous blog post, we demonstrated how to track Inferentia2 Neuron core utilization through CloudWatch metrics. Our analysis revealed a significant inefficiency: one of the two Neuron cores on each Inferentia2 chip remained largely idle as each Batch job was allocated all the host resources, resulting in suboptimal resource utilization. We’re pleased to announce that our latest solution update addresses this issue by optimizing host-level resource requirements, specifically reducing vCPU and memory allocations. This enhancement enables AWS Batch to intelligently allocate one Neuron core per transcription job, effectively doubling capacity by running two jobs concurrently on a single AWS Inferentia2 chip. As illustrated in Figure 1 below, this architectural improvement maximizes hardware utilization while delivering substantial cost savings for your inference workloads.

Figure 1. Improved Neuron core utilization with updated solution
Dynamic ECS Neuron AMI Integration
AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. We’ve used the Parameter Store to store the identifier to the latest recommended ECS Neuron AMI. This enables automatic retrieval and deployment of the most up-to-date and optimized base image: /aws/service/ecs/optimized-ami/amazon-linux-2023/neuron/recommended/image_id
Getting Started with the Enhanced Solution
As described in previous sections, several important optimizations significantly improved utilization, notably the reduction of host-level resource requirements (vCPU and memory), enabling AWS Batch to allocate one Neuron core per job and run two transcription jobs concurrently on a single Inferentia2 chip—doubling effective capacity. Together with model preloading and padding optimization, these enhancements reduced processing time by 50% for a one-hour audio file (from 20 minutes to 10 minutes per task) while simultaneously cutting resource consumption by 50%, resulting in substantial cost efficiency improvements for Whisper audio transcription workloads.
You can use the CloudFormation template to accelerate the deployment of this solution with the following steps. Alternatively, you can run the buildArch.sh script from the code repository to deploy the infrastructure automatically to the default Virtual Private Cloud (VPC).
- Choose the following Launch Stack link to launch the solution in your preferred AWS Region: Launch Stack
- For Stack name, enter a unique stack name.
- Set the parameters, which includes:
- VPC ID
- VPC Subnet IDs (Private subnets recommended)
- VPC security group IDs
- VPC route table IDs
- Default Queue min vCPU count
- Default Queue max vCPU count
- EBS boot size (GiB) of root volume
- Latest ECS Neuron AMI image ID
- Acknowledge the capabilities.
- Choose Create stack.
For a step-by-step guide for implementing this enhanced solution, refer to the README.md file of the accompanying AWS Sample repository on GitHub.
Conclusion
Watch this space for future updates as we explore additional ways to enhance performance and reduce costs. Do you have questions or feedback on implementing this update? Please email us at ask-hpc@amazon.com and tell us about your experience!