Categories
Misc

Democratizing and Accelerating Genome Sequencing Analysis with NVIDIA Clara Parabricks v4.0

The field of computational biology relies on bioinformatics tools that are fast, accurate, and easy to use. As next-generation sequencing (NGS) is becoming…

The field of computational biology relies on bioinformatics tools that are fast, accurate, and easy to use. As next-generation sequencing (NGS) is becoming faster and less costly, a data deluge is emerging, and there is an ever-growing need for accessible, high-throughput, industry-standard analysis.

At GTC 2022, we announced the release of NVIDIA Clara Parabricks v4.0, which brings significant improvements to how genomic researchers and bioinformaticians deploy and scale genome sequencing analysis pipelines.

  • Clara Parabricks software is now free to researchers on NGC as individual tools or as a unified container. A licensed version is available through NVIDIA AI Enterprise for customers requiring enterprise-grade support.
  • Clara Parabricks is now easily integrated into common workflow languages such as Workflow Description Language (WDL) and NextFlow, for the interweaving of GPU-accelerated and third-party tools, and scalable deployment on-premises and in the cloud. The Cromwell workflow management system from the Broad Institute is also supported. 
  • Clara Parabricks can now be deployed on the Broad Institute’s Terra SaaS platform, making it available to the 25,000+ Terra scientists. Genome analysis is reduced to just over one hour with Clara Parabricks compared to 24 hours in a CPU environment, while reducing costs by 50% for whole genome sequencing analysis.
  • Clara Parabricks continues to focus on GPU-accelerated, industry-standard, and deep-learning-based tools and has included the latest DeepVariant v1.4 germline caller. Development in the areas of sequencer-agnostic tooling and deep learning approaches are a focus of Clara Parabricks.
  • Clara Parabricks is now available through more cloud providers and partners, including Amazon Web Services, Google Cloud Platform, Terra, DNAnexus, Lifebit, Agilent Technologies, UK Biobank Research Analysis Platform (RAP), Oracle Cloud Infrastructure, Naver Cloud, Alibaba Cloud, and Baidu AI Cloud.

License-free use for research and development

Clara Parabricks v4.0 is now available entirely free of charge for research and development. This means fewer technical barriers than ever before, including the removal of the install scripts and the enterprise license server present in previous versions of the genomic analysis software. 

This also means significant simplification in deployment, with the ability to pull and run Clara Parabricks Docker containers quickly and easily, on any NVIDIA-certified systems, with maximum ease of use on-premises or in the cloud.

Commercial users that require enterprise-level technical and engineering support for their production workflows, or to work with NVIDIA experts on new features, applications, and performance optimizations, can now subscribe to NVIDIA AI Enterprise Support. This support will be available for Parabricks v4.0 with the upcoming release of NVIDIA AI Enterprise v3.0.

An NVIDIA AI Enterprise Support subscription comes with full-stack support (from container-level, through to full on-premises and cloud deployment), access to NVIDIA Parabricks experts, security notifications, enterprise training in areas such as IT or data science, and deep learning support for TensorFlow, PyTorch, NVIDIA TensorRT, and NVIDIA RAPIDS. Learn more about NVIDIA AI Enterprise Support Services and Training

A table showing Clara Parabricks license options.
Figure 1. Access all the tools within Clara Parabricks at no cost, including the pipelines and workflows

Deploying in WDL and NextFlow workflows

You can now pull Clara Parabricks directly from NGC collection containers with no licensing server, meaning that it can easily be run as part of scalable and flexible bioinformatics workflows on a variety of systems and platforms.

This includes popular bioinformatics workflow managers WDL and NextFlow that are available on the new Clara-Parabricks-Workflows GitHub repo for general use by the bioinformatics community. You can find WDL and NextFlow workflows or modules for the following:

  • BWA-MEM alignment and processing with Clara Parabricks FQ2BAM
  • A germline calling workflow running accelerated HaplotypeCaller and DeepVariant, with the option to apply the GATK best practices
  • A BAM2FQ2BAM workflow to extract reads and realign to new reference genomes (such as the T2T completed human genome)
  • A somatic workflow using accelerated Mutect2, with an optional panel of normals
  • A workflow to generate a new panel of normals for somatic variant calling from VCFs
  • A workflow to build reference indexes (required for several of the workflows and tasks listed earlier)

In addition, a workflow for calling de novo mutations in trio data developed in collaboration with researchers at the National Cancer Institute will be available later this year.

These workflows bring impressive flexibility, enabling users to interweave the GPU-accelerated tools of Clara Parabricks with third-party tooling. They can specify individual compute resources for each task, before deploying at a massive scale on local clusters (on SLURM, for example) or on cloud platforms. See the Clara-Parabricks-Workflows GitHub repo for example configurations and recommended GPU instances.

A diagram showing how to pull directly from the Clara Parabricks Docker and specify gpuType and gpuCount compute requirements.
Figure 2. Pull directly from the Clara Parabricks Docker container and specify gpuType and gpuCount compute requirements

Run on-premises or in the cloud

Clara Parabricks is well-suited to cloud deployment. It is available to run on several cloud platforms, including Amazon Web Services, Google Cloud Services, DNAnexus, Lifebit, Baidu AI Cloud, Naver Cloud, Oracle Cloud Infrastructure, Alibaba Cloud, Terra, and more.

Clara Parabricks v4.0 WDL workflows are now integrated into the Broad Institute’s Terra platform for its 25,000+ scientists to run accelerated genomic analyses. Terra’s scalable platform runs on top of Google Cloud, which hosts a fleet of NVIDIA GPUs. A FASTQ to VCF analysis on a 30x whole genome takes 24 hours in a CPU environment compared to just over one hour with Clara Parabricks in Terra. In addition, costs are reduced by over 50%, from $5 to $2 (Figure 3).

In the Terra platform, researchers can gain access to a wealth of data much more easily than in an on-premises environment. They can access the Clara Parabricks workspace at the push of a button, rather than manually managing and configuring the hardware. Get started at the Clara Parabricks page on the Terra Community Workbench.

Graph showing time and cost comparison between CPU and GPU for 30x whole genome sequencing in Terra.
Figure 3. FASTQ to VCF runs in Terra

Runtimes and compute cost (preemptible pricing) for germline analysis of a 30x whole genome (including BWA-MEM, MarkDuplicates, BQSR, and HaplotypeCaller) are greatly reduced when using Clara Parabricks and NVIDIA GPUs.

Clara Parabricks v4.0 tools and features

Clara Parabricks v4.0 is a more focused genomic analysis toolset than previous versions, with rapid alignment, gold standard processing, and high accuracy variant calling. It offers the flexibility to freely and seamlessly intertwine GPU and CPU tasks and prioritize the GPU-acceleration of the most popular and bottlenecked tools in the genomics workflow. Clara Parabricks can also integrate cutting-edge deep learning approaches in genomics.

Diagram showing the NVIDIA Clara Parabricks v4.0 toolset.
Figure 4. The NVIDIA Clara Parabricks v4.0 toolset

The individual Clara Parabricks tools are also now offered in individual containers in the Clara Parabricks collection on NGC or as a unified container that encompasses all tools in one. For the individual containers, bioinformaticians can access lean containers, and the Clara Parabricks team can push more frequent agile per-tool releases to give access to the latest versions. 

The first of these releases is for DeepVariant v1.4. This latest version of DeepVariant increases accuracy across multiple genomics sequencers. There is an additional read insert size feature for Illumina whole genome and whole exome models, which reduces errors by 4-10%, and direct phasing for more accurate variant calling in PacBio sequencing runs. This means that you can now perform the high-accuracy process of phased variant calling for PacBio data directly in DeepVariant, with pipelines such as DeepVariant-WhatsHap-DeepVariant or PEPPER-Margin-DeepVariant.

DeepVariant v1.4 is also compatible with multiple custom DeepVariant models for emerging genomics sequencing instruments. The models have been GPU-accelerated in collaboration with the NVIDIA Clara Parabricks team to provide rapid and high-accuracy variant calls across sequencing instruments. DeepVariant v1.4 is now available in the Clara Parabricks collection on NGC.

Deep learning approaches to genomics and precision medicine is a big focus for Clara Parabricks and is highlighted in the GTC 2022 NVIDIA and Broad Institute announcement on further developments on the Genome Analysis Toolkit (GATK) and large language models for DNA and RNA.

Get started with Clara Parabricks v4.0 

To start using Clara Parabricks for free, visit the Clara Parabricks collection on NGC. You can also request a free Clara Parabricks NVIDIA LaunchPad lab to get hands-on experience running accelerated industry-standard tools for germline and somatic analysis for an exome and whole genome dataset.

For more information about Clara Parabricks, including technical details on the tools available, see the Clara Parabricks documentation.

Leave a Reply

Your email address will not be published. Required fields are marked *