AWS HPC Blog

Category: AWS Batch

Rearchitecting AWS Batch managed services to leverage AWS Fargate

AWS service teams continuously improve the underlying infrastructure and operations of managed services, and AWS Batch is no exception. The AWS Batch team recently moved most of their job scheduler fleet to a serverless infrastructure model leveraging AWS Fargate. I had a chance to sit with Devendra Chavan, Senior Software Development Engineer on the AWS Batch team, to discuss the move to AWS Fargate and its impact on the Batch managed scheduler service component.

Blender on Batch

Efficient and cost-effective rendering pipelines with Blender and AWS Batch

This blog post explains how to run parallel rendering workloads and produce an animation in a cost and time effective way using AWS Batch and AWS Step Functions. AWS Batch manages the rendering jobs on Amazon Elastic Compute Cloud (Amazon EC2), and AWS Step Functions coordinates the dependencies across the individual steps of the rendering workflow. Additionally, Amazon EC2 Spot instances can be used to reduce compute costs by up to 90% compared to On-Demand prices.

Getting Started with NVIDIA Clara Parabricks on AWS Batch using AWS CloudFormation

In this blog post, we’ll show how you can run NVIDIA Parabricks on AWS Batch leveraging AWS CloudFormation templates. Parabricks is a GPU-accelerated tool for secondary genomic analysis. It reduces the runtime of variant calling on a 30x human genome from 30 hours to just 30 minutes, and leverages AWS Batch to provide an interface that scales compute jobs across multiple instances in the cloud.

Understanding the AWS Batch termination process

In this blog post, we help you understand the AWS Batch job termination process and how you may take actions to gracefully terminate a job by capturing SIGTERM signal inside the application. It provides you with an efficient way to exit your Batch jobs. You also get to know about how job timeouts occur, and how the retry operation works with both traditional AWS Batch jobs and array jobs.

Bayesian ML Models at Scale with AWS Batch

Ampersand is a data-driven TV advertising technology company that provides aggregated TV audience impression insights and planning on 42 million households, in every media market, across more than 165 networks and apps and in all dayparts (broadcast day segments). The Ampersand Data Science team estimated that building their statistical models would require up to 600,000 physical CPU hours to run, which would not be feasible without using a massively parallel and large-scale architecture in the cloud. AWS Batch enabled Ampersand to compress their time of computation over 500x through massive scaling while optimizing their costs using Amazon EC2 Spot. In this blog post, we will provide an overview of how Ampersand built their TV audience impressions (“impressions”) models at scale on AWS, review the architecture they have been using, and discuss optimizations they conducted to run their workload efficiently on AWS Batch.

Figure 2: Identification of redun jobs and grouping them into Array Jobs to run on AWS Batch. (Top) redun represents the workflow as an Expression Graph (top-left), and identifies reductions (red boxes) that are ready to be executed. The redun Scheduler creates a redun Job (J1, J2, J3) for each reduction and dispatches those jobs to Executors based on the task-specific configuration. The Batch Executor allows jobs to accumulate for up to three seconds (default) in order to identify compatible jobs for grouping into an Array Job, which are then submitted to AWS Batch (top-right). (Bottom) As jobs complete in AWS Batch, the success (green) and failure (red) is propagated back to Executors, the Scheduler, and eventually substituted back into the Expression Graph (bottom-left).

Data Science workflows at insitro: how redun uses the advanced service features from AWS Batch and AWS Glue

Matt Rasmussen, VP of Software Engineering at insitro, expands on his first post on redun, insitro’s data science tool for bioinformatics, to describe how redun makes use of advanced AWS features. Specifically, Matt describes how AWS Batch’s Array Jobs is used to support workflows with large fan-out, and how AWS Glue’s DynamicFrame is used to run computationally heterogenous workflows with different back-end needs such as Spark, all in the same workflow definition.

Figure 1: Evaluating a sequence alignment workflow using graph reduction.** In redun, workflows are represented as an Expression Graph (left) which contain concrete value nodes (grey) and Expression nodes (blue). The redun scheduler identifies tasks that are ready to execute by finding subtrees that can be reduced (red boxes), substituting task results back into the Expression Graph (red arrows). The scheduler continues to find reductions until the Expression Graph reduces to a single concrete value (grey box, far right). If any reduction has been done before (determine by comparing an Expression's hash), the redun scheduler can replay the reduction from a central cache and skip task re-execution.

Data Science workflows at insitro: using redun on AWS Batch

Matt Rasmussen, VP of Software Engineering at insitro describes their recently released, open-source data science framework, redun, which allows data scientists to define complex scientific workflows that scale from their laptop to large-scale distributed runs on serverless platforms like AWS Batch and AWS Glue. I this post, Matt shows how redun lends itself to Bioinformatics workflows which typically involve wrapping Unix-based programs that require file staging to and from object storage. In the next blog post, Matt describes how redun scales to large and heterogenous workflows by leveraging AWS Batch features such as Array Jobs and AWS Glue features such as Glue DynamicFrame.