Benchmarking PacBio whole genome sequencing variant pipeline analysis with AWS HealthOmics workflows

As genomics research continues to push the frontiers of medicine and population health, long-read sequencing is increasingly essential for resolving complex genomic regions, identifying structural variants, and understanding genetic diversity at scale. PacBio HiFi sequencing produces reads that are both highly accurate and long, making them suitable for comprehensive whole genome sequencing (WGS).

Running large-scale long-read WGS analysis requires scalable, secure, and production-grade compute environments. AWS HealthOmics is purpose-built for this, and bioinformaticians use it to run containerized workflows and process large genomic data volumes with high reliability and flexibility. PacBio’s integration with AWS HealthOmics workflow architecture has been pivotal for large-scale genomics programs. This implementation offers secure and efficient data processing while streamlining delivery of results at scale. By using HealthOmics robust security and scalable infrastructure, the PacBio workflow can process large-scale long-read sequencing data while maintaining the data governance standards required for national healthcare initiatives. This real-world deployment demonstrates the ability of HealthOmics to support PacBio’s nationwide precision medicine programs with enterprise-grade reliability and performance.

This guide demonstrates PacBio’s WGS variant pipeline implementation on AWS HealthOmics, offering performance optimization insights and evidence-based recommendations for cost-effective deployment at scale based on extensive benchmarking.

PacBio HiFi sequencing produces long reads—typically 15–25 kilobases in length—with high base accuracy (90 percent of bases score over Q30). This combination means that researchers can cover repetitive and Guanine-Cytosine (GC)-rich regions, phase haplotypes accurately, and detect structural variants that are often missed by short-read technologies. Key applications of long-read sequencing analysis, including:

Single nucleotide variants (SNVs) and small insertions and deletions – Nucleotide base level changes and small multi-base alterations
Structural variant discovery – Resolving large insertions, deletions, and rearrangements with base-level precision
Haplotype phasing – Assigning variants to maternal or paternal alleles across long haplotype blocks
Complex locus assembly – Disentangling repetitive elements and gene duplications in disease-associated loci
De novo assembly and pan-genome construction – Generating high-quality, contiguous genome assemblies that reflect population-specific diversity
DNA methylation – Identifying active or inactive regions from 5-methylcytosine marks at CpG sites (5mCpG) across the genome

Technical overview of the PacBio HiFi WGS variant pipeline

PacBio’s WGS variant pipeline, defined in Workflow Description Language (WDL), offers a modular, containerized solution for secondary and tertiary analysis. The pipeline incorporates HiFi alignment tools alongside variant callers designed for SNVs, small insertions and deletions, copy number variants (CNVs), and structural variants (SVs). Additionally, it includes capabilities for tandem repeat (TR) genotyping, gene typing within segmental duplications, phasing variants into haplotypes, and consensus 5mCpG probability. For multi-sample cohorts, the pipeline provides joint-calling capabilities for both small and structural variants. Comprehensive annotation tools support both variant types of small and structural variants. The following are the stages for analyzing human HiFi data:

Read alignment – Align long reads to a reference genome using pbmm2, a mapper and aligner optimized for HiFi reads. The aligner handles split mappings and large gaps efficiently, which is critical for accurately mapping structural variants and segmental duplications. The output is a sorted and indexed Binary Alignment Map (BAM) file with rich metadata for downstream processing.
SNVs and small variants calling – For detecting SNVs and small insertions or deletions (indels), the workflow uses a variant caller (DeepVariant) trained specifically on HiFi reads. It applies machine learning (ML) models tailored to the error profiles and read characteristics of long-read data, producing high-confidence variant calls with low false-positive rates, even in difficult-to-map regions.
Structural variant detection – This step identifies larger genomic alterations—such as deletions, insertions, inversions, and translocations—by comparing read signatures and alignment patterns. The PacBio structural variant caller (pbsv) is specifically designed for long-read data and can detect complex breakpoints, tandem repeats, and multi-allelic events that are often missed by short-read approaches.
Variants in duplicated regions – Paraphase calls small variants in segmental duplications by realigning reads from gene families to a single reference copy and identifying haplotypes.
Tandem repeat variant caller – Tandem repeat genotyping tool (TRGT) analyzes and genotypes tandem repeat variation in HiFi reads. Beyond standard-size genotyping, this tool provides comprehensive analysis of sequence composition, repeat mosaicism, CpG methylation patterns, and visual representation of repeat-spanning reads.
Phasing and haplotype resolution – To support downstream interpretation, the workflow includes a phasing stage that assigns variants to haplotypes. HiPhase uses the long-range information in HiFi reads to resolve haplotype blocks spanning tens of kilobases, improving interpretability in clinical and population genomics contexts.
5mCpG detection – pb-CpG-tools generates haplotype-specific site methylation probabilities for CpGs from aligned HiFi reads.
Cohort-level genotyping (optional) – For cohorts with multiple samples, the workflow can be extended with a joint-genotyping step to harmonize variant calls across samples using glnexus and pbsv.
Annotation – The pipeline includes two annotations tools, slivar and svpack, to provide comprehensive annotation for small variants and structural variants.

The workflow consists of seven key steps (numbered 1–7 in the following architecture diagram):

Deploy HiFi-WGS pipeline to AWS CloudFormation, triggering container migration
AWS Lambda functions orchestrate AWS CodeBuild for container processing
Input data flow from unaligned BAM files to Amazon Simple Storage Service (Amazon S3)
Pipeline execution monitoring through Amazon CloudWatch
Event-driven workflow management using Amazon EventBridge
Result storage in output S3 bucket
Performance analysis using HealthOmics tools for cost and utility tracking

This serverless, managed architecture offers scalable, cost-effective genomic data processing with operational visibility and optimization capabilities.

Figure 1: Architecture diagram illustrating the AWS HealthOmics WGS pipeline workflow and integration points

PacBio WGS variant pipeline on AWS HealthOmics at scale

AWS HealthOmics manages the pipeline’s deployment and execution, handling infrastructure, container orchestration, and WDL support. It runs each task as a containerized job with customizable resources, manages dependencies, and handles data storage in Amazon S3 with AWS Identity Access Management (IAM) governed permissions. Workflow inputs, outputs, and logs are traceable through the HealthOmics console and Amazon CloudWatch.

The WGS pipeline on AWS HealthOmics solution deployment is based on HiFi-human-WGS-WDL v2.1.2, offering a streamlined approach to genomic analysis. The CloudFormation solution stack deployment automates Docker images migration, setting Amazon Elastic Container Registry (Amazon ECR) policies and HealthOmics workflow IAM roles. The pipeline follows these steps:

The CloudFormation template in the aws-samples repository builds an AWS stack that migrates required Docker images to ECR private repositories with appropriate permissions. This customizable stack can be adjusted to follow the principle of least privilege access for the S3 buckets and ECR images needed by the HealthOmics service.
The stack utilizes CodeBuild for docker image migration. Check the CodeBuild project to achieve a SUCCEEDED status before proceeding with your pipeline operations.
Store unaligned HiFi BAM files in either Amazon S3 or HealthOmics sequence store. Although both are compatible, HealthOmics sequence store provides additional genomics-specific features and better metadata management. Our implementation uses PacBio’s reference data resources and public HiFi datasets on HG002 for validation.
Clone the HiFi-human-WGS-WDL repository and confirm Docker images match your chosen pipeline version (the template is configured for v2.1.2). For different versions, adjust image hash values in the CloudFormation template. Use the provided sample template in the aws-samples repository to create workflow parameters for the variant analysis pipeline.
HealthOmics enables parallel processing of multiple samples. As a managed service, it handles run submissions and integrates with Amazon EventBridge for notifications and downstream pipeline triggers. Refer to the AWS blog for Amazon EventBridge rules configuration examples.
HealthOmics workflows integrate with CloudWatch for monitoring pipeline progress. Workflow logs are available using CloudWatch streams and copied to output S3 buckets, where results are organized by run ID for sample tracking.
The HealthOmics run_analyzer tool provides detailed cost and resource usage insights per sample, offering recommendations for optimal instance types. Our benchmarking results are based on these analytics. Install aws-healthomics-tools using pypi or follow the instructions later in this post.

How to create and run PacBio WGS variant pipeline in HealthOmics private workflows

The CloudFormation stack automates the creation of necessary docker images for PacBio WGS variant pipeline analysis. Follow the steps outlined in at PacBio WGS analysis with HealthOmics workflows. Key steps to create the PacBio WGS variant pipeline private workflow:

Download HiFi-human-WGS-WDL v2.1.2 repository from PacBio repository; or your preferred version from the PacBio repository, and you might need to modify the CloudFormation template docker paths
For prerequisites for cloud formation deployment, you need to create an AWS Key Managed Service (AWS KMS) key and two public subnets available in an Amazon Virtual Private Cloud (Amazon VPC) connection for the Lambda function to perform the operations. Create self-referencing security groups allowing inbound HTTPS traffic through port 443.

Pipeline evaluation

After run completion, use the run_analyzer tool to evaluate compute utilization and costs:

pip install aws-healthomics-tools
aws-healthomics-tools run_analyzer <RUN_ID> -o Pacbio-WGS-run_analyser_outputs.csv

Benchmarking results for end-to-end pipeline HiFi analyses

We use the PacBio public HG002 HiFi datasets to explore various optimization strategies on HealthOmics in the US East (N. Virginia) us-east-1 AWS Region. The WGS variant analysis pipeline includes tasks with varying resource demands, from single CPUs to 64 CPUs and up to 256 GB RAM. HealthOmics provisions compute resources dynamically. The most intensive task, DeepVariant calling, requires 64 CPUs and 239 GB RAM for HiFi reads, with GPU acceleration available. Our benchmarking focused on cost-performance optimization through task accelerators and the impact of different storage types on pipeline performance, aiming to establish the most efficient production configuration.

The most demanding tasks were pbmm2 reads alignment and DeepVariant variant calling. We evaluated DeepVariant processing using multiple GPU accelerators: NVIDIA Tesla T4 (omics.g4), NVIDIA Tesla A10G (omics.g5), and NVIDIA L4 (omics.g6). These were benchmarked against CPU-based acceleration (omics.m) using both static and dynamic HealthOmics storage options.

The optimal configuration used omics.g5.2xlarge with NVIDIA Tesla A10G, completing the pipeline in 8.67 hours at $21.26 compute cost using dynamic file storage. GPU containers for DeepVariant delivered a 21.3 percent cost reduction and 8.5 percent faster completion compared to CPU-based deployment, demonstrating clear benefits in both performance and cost-efficiency.

Storage configuration analysis showed that NVIDIA Tesla T4 (omics.g4) and L4 (omics.g6) instances benefited significantly from dynamic storage, while standard omics and Tesla A10G (omics.g5) instances showed minimal variation across storage types. This suggests storage optimization strategies should be instance-specific.

The HealthOmics run-analyzer tool identified potential cost optimization to $19.15 using Tesla A10G with static storage. Analysis revealed opportunities to optimize memory allocation across several tasks: the pbmm2 aligner could operate with 64 GB, DeepVariant make_examples with 16 GB, pbsv_call with 16 GB, DeepVariant postprocess_variants with 8 vCPUs and 64 GB, and hiphase with 32 GB. It’s important to note that these requirements may fluctuate based on sequencing depth and genetic diversity, as populations with higher genetic diversity typically present more variants relative to the reference. In certain scenarios, particularly when scaling with the number of variants, PacBio’s default compute requirements remain essential.

The following bar graphs show analysis presented in three panels: run time in hours, actual cost in dollars, and optimal cost in dollars after implementing run-analyzer recommendations. The NVIDIA Tesla A10G with static storage demonstrates the best optimal cost at $19.15 while also maintaining competitive runtime performance at 8.67 hours. The optimal cost is achievable with HealthOmics tool run_analyzer recommended compute configurations.

Figure 2: Price-performance comparison across different accelerator types and storage configurations for WGS variant analysis pipeline on AWS HealthOmics

AWS HealthOmics provides a robust security framework essential for genomic data, implementing comprehensive measures aligned with regulations such as HIPAA, GDPR, and ISO 27001. This includes end-to-end encryption, KMS keys, role-based access controls, and secure logging and auditing. This sophisticated security architecture enables organizations to scale genomic analysis workflows while adhering to strict data governance and privacy standards, making HealthOmics well-suited for handling sensitive data at scale while adhering to regulatory requirements.

Conclusion

This analysis demonstrates the successful implementation of PacBio’s HiFi WGS variant analysis pipeline on AWS HealthOmics, achieving significant performance and cost optimizations. GPU acceleration, particularly with NVIDIA Tesla A10G (omics.g5.2xlarge), delivered a 21.3 percent cost reduction and 8.5 percent faster pipeline completion compared to CPU-based processing. Our evaluation revealed that NVIDIA Tesla T4 and L4 instances benefit from dynamic storage, while Tesla A10G maintained consistent performance across storage types. The HealthOmics run-analyzer tool identified an optimal cost configuration of $19.15 using Tesla A10G with static storage, achieved through appropriate resource allocation. The ability of HealthOmics to dynamically provision resources, coupled with its robust security framework compliant with HIPAA, GDPR, and ISO 27001, makes it suitable for processing sensitive genomic data at scale. These findings provide valuable guidance for implementing efficient, secure, and scalable genomic analysis workflows while maintaining strict data governance and privacy standards.

Get Started

To begin analyzing PacBio HiFi data at scale:

Explore the PacBio open source HiFi-human-WGS-WDL pipeline
Learn how AWS HealthOmics enables secure and scalable genomics analysis
Contact the AWS Genomics team to request support or a tailored workshop

AWS Public Sector Blog

Benchmarking PacBio whole genome sequencing variant pipeline analysis with AWS HealthOmics workflows

Technical overview of the PacBio HiFi WGS variant pipeline

PacBio WGS variant pipeline on AWS HealthOmics at scale

How to create and run PacBio WGS variant pipeline in HealthOmics private workflows

Pipeline evaluation

Benchmarking results for end-to-end pipeline HiFi analyses

Conclusion

Get Started

Resources

Follow

Learn

Resources

Developers

Help