Categories
Misc

Expanding Accelerated Genomic Analysis to RNA, Gene Panels, and Annotation

Clara Parabricks v3.7 supports gene panels, RNA-Seq, short tandem repeats, and updates to GATK (4.2) and DeepVariant (1.1), and PON support for Mutect2.

The release of NVIDIA Clara Parabricks v3.6 last summer added multiple accelerated somatic variant callers and novel tools for annotation and quality control of VCF files to the comprehensive toolkit for whole genome and whole exome sequencing analysis.

In the January 2022 release of Clara Parabricks v3.7, NVIDIA expanded the scope of the toolkit to new data types while continuing to improve upon existing tools:

  • Added support for analysis of RNASeq analysis
  • Added support for UMI-based gene panel analysis with an accelerated implementation of Fulcrum Genomics’ fgbio pipeline
  • Added support for mutect2 Panel of Normals (PON) filtering to bring the accelerated mutectcaller in line with the GATK best practices for calling tumor-normal samples
  • Incorporated a bam2fq method that enables accelerated realignment of reads to new references
  • Added support for short tandem repeat assays with ExpansionHunter
  • Accelerated post-calling VCF analysis steps by up to 15X
  • Updated HaplotypeCaller to match GATK v4.1 and updated DeepVariant to v1.1.

Clara Parabricks v3.7 significantly broadens the scope of what Clara Parabricks can do while continuing to invest in the field-leading whole genome and whole exome pipelines.

Enabling reference genome realignment with bam2fq and fq2bam

To address recent updates to the human reference genome and make realigning reads tractable for large studies, NVIDIA developed a new bam2fq tool. Parabricks bam2fq can extract reads in FASTQ format from BAM files, providing an accelerated replacement for tools like GATK SamToFastq or bazam.

Combined with Parabricks fq2bam, you can fully realign a 30X BAM file from one reference (for example, hg19) to an updated one (hg38 or CHM13) in 90 minutes using eight NVIDIA V100 GPUs. Internal benchmarks have shown that realigning to hg38 and rerunning variant calling captures several thousand more true-positive variants in the Genome in a Bottle HG002 truth set compared to relying solely on hg19.

The improvements in variant calls from realignment are almost the same as initially aligning to hg38. While this workflow was possible before, it was prohibitively slow. NVIDIA has finally made reference genome updates practical for even the largest of WGS studies in Clara Parabricks.

View this code snippet on GitHub.

More options for RNASeq transcript quantification and fusion calling in Clara Parabricks

With version 3.7, Clara Parabricks adds two new tools for RNASeq analysis as well.

Transcript quantification is one of the most performed analyses for RNASeq data. Kallisto is a rapid method for expression quantification that relies on pseudoalignment. While Clara Parabricks already included STAR for RNASeq alignment, Kallisto adds a complementary method that can run even faster.

Fusion calling is another common RNASeq analysis. In Clara Parabricks 3.7, Arriba provides a second method for calling gene fusions based on the output of the STAR aligner. Arriba can call significantly more types of events than STAR-Fusion, including the following:

  • Viral integration sites
  • Internal tandem duplications
  • Whole exon duplications
  • Circular RNAs
  • Enhancer-hijacking events involving immunoglobulin and T-cell receptor loci
  • Breakpoints in introns and intergenic regions

Together, the addition of Kallisto and Arriba make Clara Parabricks a comprehensive toolkit for many transcriptome analyses.

Simplifying and accelerating gene panel and UMI analysis

While whole genome and whole exome sequencing are increasingly common in both research and clinical practice, gene panels dominate the clinical space.

Gene panel workflows commonly use unique molecular identifiers (UMIs) attached to reads to improve the limits of detection for low-frequency mutations. NVIDIA accelerated the Fulcrum Genomics fgbio UMI pipeline and consolidated the eight-step pipeline into a single command in v3.7, with support for multiple UMI formats.

Workflow diagram shows the support for multiple UMI formats, with the single command Pbrun umi on Clara Parabricks.
Figure 1. The Fulcrum Genomics Fgbio UMI pipeline accelerated with a single command on Clara Parabricks

Detecting changes in short tandem repeats with ExpansionHunter

Short tandem repeats (STRs) are well-established causes of certain neurological disorders as well as historically important markers for fingerprinting samples for forensic and population genetic purposes.

NVIDIA enabled genotyping of these sites in Clara Parabricks by adding support for ExpansionHunter in version 3.7. It’s now easy to go from raw reads to genotyped STRs entirely using the Clara Parabricks command-line interface.

Improving MuTect somatic mutation calls with PON support

It is common practice to filter somatic mutation calls against a set of mutations from known normal samples, also called a Panel of Normals (PON). NVIDIA added support for both publicly available PON sets and custom PONs to the mutectcaller tool, which now provides an accelerated version of the GATK best practices for somatic mutation calling.

Accelerating post-calling VCF annotation and quality control

In the v3.6 release, NVIDIA added the vbvm, vcfanno, frequencyfiltration, vcfqc, and vcfqcbybam tools that made post-calling VCF merging, annotation, filtering,filtering, and quality control easier to use.

The v3.7 release improved upon these tools by completely rewriting the backend of vbvm, vcfqc and vcfqcbybam, all of which are now more robust and up to 15x faster.

In the case of vcfanno, NVIDIA developed a new annotation tool called snpswift, which brings more functionality and acceleration while retaining the essential functionality of accurate allele-based database annotation of VCF files. The new snpswift tool also supports annotating a VCF file with gene name data from ENSEMBL, helping to make sense of coding variants. While the new post-calling pipeline looks similar to the one from v3.6, you should find that your analysis runs even faster.

View this code snippet on GitHub.

Summary

With Clara Parabricks v3.7, NVIDIA is demonstrating a commitment to making Parabricks the most comprehensive solution for accelerated analysis of genomic data. It is an extensive toolkit for WGS, WES, and now RNASeq analysis, as well as gene panel and UMI data.

For more information about version 3.7, see the following resources:

Try out Clara Parabricks for free for 90 days and run this tutorial on your own data.

Leave a Reply

Your email address will not be published. Required fields are marked *