Categories
Offsites

X+Y, in probability, is a beautiful mess | Visualizing continuous convolutions

Categories
Misc

GPU-Accelerated Single-Cell RNA Analysis with RAPIDS-singlecell

Single-cell sequencing has become one of the most prominent technologies used in biomedical research. Its ability to decipher changes in the transcriptome and…

Single-cell sequencing has become one of the most prominent technologies used in biomedical research. Its ability to decipher changes in the transcriptome and epigenome on a cell level has enabled researchers to gain valuable new insights. As a result, single-cell experiments have grown in size and complexity by a factor of over 100, with experiments involving more than 1 million cells becoming increasingly common. 

However, the resulting data must be analyzed in a highly iterative process. It is crucial that fast algorithms are used for these iterative steps to enable quick turnaround times.

For more consistent single-cell analysis using Python, developers at scverse have worked toward building an entire ecosystem to help researchers perform analyses. At the core of this ecosystem lies a data structure that maintains annotations of various transformations throughout the data processing pipeline during single-cell analysis. 

AnnData, a Python package for handling annotated data matrices in memory and on disk, is the core structure used in the Scanpy library, which is the main single-cell analysis suite within the scverse ecosystem. Scanpy builds on top of other libraries common to the PyData ecosystem—such as NumPy, SciPy, Numba, and Scikit-learn—for just about all of the typical analysis steps. 

However, Scanpy algorithms are mostly CPU-based and slow down significantly with larger experiments. The highly iterative nature of the single-cell analysis process only compounds this problem. 

GPUs for single-cell analysis

General feasibility of running downstream single-cell RNA sequencing (scRNA-seq) analysis on the GPU with RAPIDS and Scanpy has been shown in Accelerating Single-Cell Genomic Analysis with GPUs. This work resulted in the rapids-single-cell-examples GitHub repo, which contains a series of example notebooks written by the RAPIDS and NVIDIA Parabricks teams. RAPIDS is an open-source suite of libraries for GPU-accelerated data science with Python. Parabricks is a free suite of GPU-accelerated, industry-standard genomic analysis tools based on deep learning.

While these example notebooks demonstrate a few typical single-cell RNA workflows on the GPU, they were never intended for daily use nor as a GPU-accelerated replacement for libraries like Scanpy. 

Drawing inspiration from prior work on rapids-single-cell-examples is an emerging library called rapids-singlecell, a GPU-accelerated tool for scRNA analysis. This tool aims to be a daily drivable single-cell analysis suite that is compatible with the scverse ecosystem. It uses RAPIDS and CuPy to provide GPU-accelerated functions that are near-drop-in replacements for the corresponding functions in Scanpy.

On average, users can expect performance to increase between 10x and 20x by using rapids-singlecell. For more details, see Accelerating Single Cell Genomic Analysis Using RAPIDS.

Faster single-cell analysis using RAPIDS

rapids-singlecell follows a similar usability model as the scverse Python libraries. It is also written in Python but puts many of the performance-critical pieces on the GPU, hiding all of the complexities that are normally associated with writing CUDA applications (the language typically used to write accelerated algorithms for NVIDIA GPUs). 

rapids-singlecell consists of five categories, which are described in the sections that follow. Each category accelerates a different piece of the typical single cell analysis workflow.

For more information, including the various APIs offered by rapids-singlecell, see the rapids-singlecell documentation

cunnData

The AnnData, or annotated data object, is a widely used data structure for handling single-cell RNA sequencing data. As shown in Figure 1, it consists of several attributes, including a count matrix attribute, X, which represents the expression levels of genes in each cell. AnnData objects also contain annotation dataframes for cells (.obs attribute) and genes (.var attribute), which store additional information such as cell type and gene names.

Diagram illustrating the structure of the cunnData class, similar to the annData Python package, but with data stored in GPU memory. The main components include: .X matrix and its layers stored in GPU memory, .obs DataFrame, .var DataFrame, .uns dictionary, .obsm Mapping, and .varm Mapping.
Figure 1. Schematic representation of the cunnData class structure

In contrast, cunnData (Figure 1) is a minimized and lightweight version of the AnnData object for the GPU that replaces the scverse standard for preprocessing. Instead of storing the count matrix .X on the CPU, cunnData stores it on the GPU as CuPy sparse matrices. This makes it faster and more efficient to perform computations on the count matrix.

cunnData also includes additional features, such as the ability to store different versions of the count matrix, such as raw integer counts, in .Layers. Unlike AnnData, which stores .Layers in Host (CPU) memory, cunnData also stores .Layers on the GPU, reducing the need to copy data from the host to GPU memory and enabling accelerated computations.

cunnData supports unstructured annotations in the .uns attribute, as well as multidimensional annotations of cells and genes in the .obsm and .varm attributes, which are stored in the host memory. These annotations enable users to include additional information about their data, such as spatial coordinates or principal component analysis (PCA) embeddings.

Similarly, cunnData supports slicing like AnnData. These slices, however, are always full copies of the original data, as opposed to views. Overall, cunnData enables a faster approach to preprocessing scRNA-seq data compared to the more feature-rich CPU-bound AnnData object.

The Python snippet below demonstrates the conversion of an AnnData object (a standard data structure for handling single-cell RNA sequencing data) into a cunnData object.

import scanpy as sc
import rapids_singlecell as rsc
adata = sc.read("PATH TO DATASET")
cudata = rsc.cunnData.cunnData(adata=adata) 

Preprocessing

The preprocessing functions are stored in cunnData_funcs, which provides accelerated alternatives to the Scanpy preprocessing functions. These functions work on the cunnData object and use RAPIDS cuML and CuPy to dramatically accelerate Scanpy functions based on Scikit-learn, Numpy and SciPy.

Filtering cells and genes can be accomplished with filter_cells and filter_genes functions, respectively. Quality control is handled with the calculate_qc_metrics function.

# Basic QC rapids-singlecell
rsc.pp.flag_gene_family(cudata,gene_family_name="MT", gene_family_prefix="mt-")
rsc.pp.calculate_qc_metrics(cudata,qc_vars=["MT"])
cudata = cudata[cudata.obs["n_genes_by_counts"] > 500]
cudata = cudata[cudata.obs["pct_counts_MT"] 



To normalize your data, cunnData_funcs provides GPU alternatives to the normalize_total, log1p, and the recently introduced normalize_pearson_residuals functions from Scanpy. Annotating highly variable genes is accelerated for all flavors supported in Scanpy (including seurat, cellranger, seurat_v3, pearson_residuals), as well as poisson_gene_selection, which is adapted from scvi-tools.

# log normalization and highly variable gene selection
cudata.layers["counts"] = cudata.X.copy()
rsc.pp.normalize_total(cudata,target_sum=1e4)
rsc.pp.log1p(cudata)
rsc.pp.highly_variable_genes(cudata,n_top_genes=5000,flavor="seurat_v3",layer = "counts")
cudata = cudata[:,cudata.var["highly_variable"]==True]

The regress_out function, used to remove unwanted sources of variation, is accelerated with the cuML linear regression estimator. It also supports multitarget regression, which was introduced in cuML in version 22.12, while staying backwards compatible with prior versions. 

Principal component analysis wraps cuML PCA, Truncated SVD, and Incremental PCA to give you the same options offered by Scanpy. With the PCA version in cunnData_funcs, you can choose which layer you want to use for the analysis, an additional feature not currently supported by the scanpy PCA function.

# Regression, scaling and PCA
rsc.pp.regress_out(cudata,keys=["total_counts", "pct_counts_MT"])
rsc.pp.scale(cudata,max_value=10)
rsc.pp.pca(cudata, n_comps = 100)

sc.pl.pca_variance_ratio(cudata, log=True,n_pcs=100)

cunndata_funcs can accelerate preprocessing by a factor of 10 to 20x (Tables 1-3). After preprocessing, the cunnData object is transformed into an AnnData object.

adata_preprocessed = cudata.to_AnnData()

Tools

scanpy_gpu provides functions that work on the AnnData object, with the goal of providing accelerated functions. To keep the syntax as close as possible between Scanpy and rapids-singlecell, metadata is also written to the .uns attribute. This attribute is useful for storing trained parameters such as the variance ratio, which is computed during the PCA computation. scanpy_gpu provides a PCA function for the AnnData object equivalent to cunnData_funcs.

Scanpy already includes support for computing UMAP and nearest neighbors on the GPU using cuML. scanpy_gpu extends Scanpy GPU support by adding more algorithms, such as accelerated graph-based clustering using Leiden and Louvain from cuGraph, as well as the Force Atlas 2 algorithm for visually laying out graph data. scanpy_gpu also uses PCA and kernel density estimation (KDE)  from cuML and diffusion maps are computed using the CuPy library in a similar manner to how Scanpy uses SciPy and Numpy for scientific computing. 

For batch correction, scanpy_gpu provides a GPU port of Harmony Integration, called harmony_gpu. PyMDE (minimum distortion embedding), a function that enables embedding single-cell data while jointly learning the graph and the low-dimensional representation in a probabilistic manner, has also been adapted from scvi-tools.

The near-drop-in replacement nature of rapids-singlecell relies on Scanpy for visualization. It is intuitive to use Scanpy for plotting directly within the scverse framework.

Decoupler

The decoupler tool uses a unified framework to implement several different statistical methods with a focus on biological activity. (Cellular, molecular, and physiological processes in living organisms, for example, gene sets and transcription factors activity.) decoupler_gpu re-implements and accelerates the weighted sum (run_wsum) and multivariate linear model (run_mlm) methods. The GPU port in rapids-singlecell uses the same nets/models as decoupler. Table 1 shows a performance increase of up to 37x for wsum.

Squidpy developments

rapids-singlecell is continually expanding with new accelerated functions for the scverse ecosystem. Comprehensive tests have been added to the library to ensure the correctness and reliability of the code. Squidpy enables detailed analysis and visualization of spatial molecular data. It facilitates understanding of complex cell interactions and spatial patterns, greatly contributing to the expansion of the scverse ecosystem. 

Some functions have been accelerated with rapids-singlecell. Spatial auto-correlation with Moran’s I and Geary’s C promises a performance increase of up to 100x. The ligand-receptor (ligrec) interaction analysis in Squidpy has also been optimized and accelerated, resulting in a performance increase of more than 10x.

Benchmarks

Our benchmark results show that using GPU acceleration with the rapids-singlecell package and the decoupler functions can significantly improve the performance of scRNA-seq analysis. 

For example, running a sample rapids-singlecell notebook with about 90 K cells end-to-end on a server node with two AMD Epyc Milan 7543, 500 GB memory, and an NVIDIA A100 80 GB GPU was completed in just 51 seconds using the rapids-singlecell package, compared to 1,106 seconds with the traditional scanpy CPU workflow. 

Similarly, the decoupler functions also demonstrated significant speed improvements, with the mlm function running in just 12 seconds on the GPU compared to 83 seconds on the CPU, and the wsum method completing in just 26 seconds on the GPU compared to 16 minutes and 10 seconds on the CPU. 

Overall, these results demonstrate the potential for GPU acceleration to make scRNA-seq analysis faster and more efficient. These benchmark results are summarized in Table 1.

Function CPU  GPU Speedup
Whole notebook(excluding decoupler functions) 1,106 s  (18.5 min) 51 s 21x
Preprocessing 74 s  8 s 9x
Regress out 27 s 1.6 s 16x
PCA 35 s 0.7 s 50x
HVG (Seurat v3) 3.2 s 0.4 s 8x
Harmony 417 s 18 s 23x
Neighbors 22 s 5.1 s 4.3x
UMAP 36 s 0.4 s 90x
TSNE 133 s 2.4 s 55x
Louvain 17 s 0.6 s 28x
Leiden 14 s 0.2 s 70x
Logistic regression 58 s 3.7 s 15x
Draw graph (FA2) 256 s 0.3 s 850x
run_mlm (DoRothEA) 83 s 12 s 7x
Run_wsum (PROGENy) 970 s (16 min) 26 s 37x
Table 1. Server node benchmark for a dataset of 90,000 cells

In addition to the previous benchmark results, running a sample rapids-singlecell notebook with 500 K cells on the server node takes about 2 minutes when using rapids-singlecell. The same analysis takes about 41 minutes on the CPU.

Furthermore, using pearson_residuals for highly variable gene selection and normalization can also be accelerated using GPUs, providing additional speed improvements in scRNA-seq analysis. These benchmark results are summarized in Table 2.

rapids-singlecell is not only capable of accelerating single-cell data analysis on high-end server nodes, but also on consumer-grade hardware. Running the same notebook end-to-end with 50,000 cells on a desktop system with an AMD 5950x CPU, 64 GB memory, and an NVIDIA RTX 3090 GPU, takes around 5 minutes using rapids-singlecell. Although the system was using the RAPIDS Memory Manager (RMM) and unified memory to oversubscribe the GPUs memory, it saw a significant speedup compared to the CPU server. These benchmark results are summarized in Table 2.

Function CPU GPU (A100) GPU (3090) Speedup
Whole notebook(excluding PR functions) 2,460 s (41 min) 110 s 290 s 22x
Preprocessing  305 s  28 s 169 s 10x
HVG (Seurat v3) 48 s 1.5 s 13 s 32x
Regress out 104 s 5.1 s 16 s 20x
scale 8.4 s 1.3 s 5 s 6.4x
PCA 86 s 3.7 s 35 s 23x
Neighbors 74 s 17.1 s 18.3 s 4.3x
UMAP 281 s (4.6 min) 6.7 s 7.6 s 60x
TSNE 786 s (13 min) 10 s 12.9 s 105x
Louvain 283 s (4.5 min) 4.5 s 5.7 s 62x
Leiden 282 s (4.5 min) 0.6 s 0.9 s 470x
Logistic regression 452 s (7.5 min) 33 s 63 s 13x
Diffusion map 30 s 0.75 s 1.3 s 40x
HVG (PR) 104 s  2.1 s 15.6 s 50x
Normalize (PR) 22 s 0.3 s 1 s 73x
Table 2. Server node and consumer system benchmark for a dataset of 500,000 cells

Running the same sample notebook (Table 1) with about 90 K cells end-to-end on the desktop system takes only 48 seconds when using rapids-singlecell. In comparison, the traditional scanpy CPU workflow takes 774 seconds. The accelerated decoupler functions also demonstrate significant speed improvements on consumer-grade hardware. These benchmark results are summarized in Table 3.

Function CPU  GPU Speedup
Whole notebook(excluding decoupler functions) 774 s (13 min) 48 s 16x
Preprocessing 114 s  6 s 19x
Regress out 62 s 1.6 s 39x
PCA 42 s 0.7 s 60x
HVG (Seurat v3) 2.7s 0.4 s 6.7x
Harmony 175 s 21.7 s 8x
Neighbors 14.9 s 4.6 s 3.2x
UMAP 31 s 0.3 s 103x
TSNE 95 s 1.4 s 68x
Louvain 9.3 s 0.5 s 18x
Leiden 13.2 s 0.1 s 130x
Logistic regression 76 s 3.75 s 20x
Draw graph (FA2) 191 s 0.23 s 830x
run_mlm (DoRothEA) 55 s 12 s 4.5x
Run_wsum (PROGENy) 690 s (11.5 min) 28 s 26x
Table 3. Server node and consumer system benchmark for a dataset of 500,000 cells

Installation 

There are multiple methods for installing rapids-singlecell. The easiest method is to use one of the provided yaml files provided within the GitHub repository. These set up the entire environment with everything needed for running the example notebooks.

conda create -f conda/rsc_rapids_23.02.yml

You can also install rapids-singlecell from PyPI into a Conda environment and install RAPIDS from Conda. The default installer does not include RAPIDS or CuPy. Scanpy is also excluded because it is technically not necessary.

pip install rapids-singlecell

Finally, you can install the entire library, including the RAPDIS dependencies, from PyPI using the new experimental PyPI packages from RAPIDS. However, this method of installation requires the user to properly set up CUDA so that it can be found by RAPIDS and CuPy. 

To do this, you can use the following command:

pip install 'rapids-singlecell[rapids]’ --extra-index-url=https://pypi.nvidia.com

Conclusion

With the rapids-singlecell library, it is possible to run the complete analysis of 500 K cells in less time than it takes a CPU to compute only its UMAP embedding. Therefore, it enables a much faster iterative process in single-cell data analysis stages. 

rapids-singlecell also enables the bioinformatician to analyze the data with a physician or biologist in real time, leading to better collaboration and interpretation of the data. In our experience, it is possible to analyze 200 K cells without any issues, even on a consumer-class 3090 series graphics card. Even better, RMM enables the GPU memory to be oversubscribed and spilled to the main memory, enabling scales well over 500 K cells.

With the datacenter-class NVIDIA A100 80 GB GPU, you can analyze matrices containing as many as 231-1 (approximately 2.15 billion) non-zero counts. (Note that this is the current limit imposed by the CuPy 32-bit integer-based indexing for sparse matrix calculations.) This powerful capability enables users to analyze datasets with over 1 million cells. 

The upwards of 20x speedup that rapids-singlecell provides enables researchers to focus more on analyzing and interpreting their single-cell data, rather than waiting for lengthy computational processes. In the true spirit of RAPIDS, this ultimately enhances productivity and fosters new insights into cellular biology that were not possible before.

Categories
Misc

Quicker Cures: How Insilico Medicine Uses Generative AI to Accelerate Drug Discovery

While generative AI is a relatively new household term, drug discovery company Insilico Medicine has been using it for years to develop new therapies for debilitating diseases. The company’s early bet on deep learning is bearing fruit — a drug candidate discovered using its AI platform is now entering Phase 2 clinical trials to treat Read article >

Categories
Misc

Snowflake and NVIDIA Team to Help Businesses Harness Their Data for Generative AI in the Data Cloud

Snowflake, the Data Cloud Company, and NVIDIA today announced at Snowflake Summit 2023 that they are partnering to provide businesses of all sizes with an accelerated path to create customized generative AI applications using their own proprietary data, all securely within the Snowflake Data Cloud.

Categories
Misc

Maximizing Network Performance for Storage with NVIDIA Spectrum Ethernet

As data generation continues to increase, linear performance scaling has become an absolute requirement for scale-out storage. Storage networks are like car…

As data generation continues to increase, linear performance scaling has become an absolute requirement for scale-out storage. Storage networks are like car roadway systems: if the road is not built for speed, the potential speed of a car does not matter. Even a Ferrari is slow on an unpaved dirt road full of obstacles.

Scale-out storage performance can be hindered by the Ethernet fabric that connects the storage nodes. NVIDIA accelerated Ethernet can remove performance bottlenecks, enabling maximum storage performance for applications in general, and AI/ML in particular.

Scale-out storage requires a strong network

Every second, 54,000 pictures are taken worldwide. By the time you read this, that number will be even higher. No matter what your business is, chances are you have massive amounts of data that must be stored and analyzed, with the amount growing every day. 

The old scale-up approach of using ever-larger storage filers has been replaced with a scale-out approach to deliver storage that scales linearly in both capacity and performance.

With scale-out storage, or distributed storage, several smaller nodes are configured and connected to act as one logical unit. A single file or object can be spread across many nodes. 

As more scale is needed, additional storage nodes are easily added to increase both storage capacity and performance. This applies to both a traditional enterprise storage vendor solution, or a software-defined solution, with the software and hardware sourced independently. 

Distributed storage enables flexible scaling and cost efficiency, but requires a high-performance network to connect the storage nodes. Many data center switches are ill-suited to the unique traffic characteristics of storage, and in fact can cripple the performance of scale-out storage solutions.  

How storage traffic is different from traditional traffic

For many use cases, network traffic is consistent and homogeneous, and traditional Ethernet suffices. However, traffic generated by storage devices may cause the issues detailed below.

Network stress 

Current storage solutions benefit from faster SSDs and storage interfaces, such as NVMe and PCIe Gen 4 (soon PCIe Gen 5), that are designed to provide higher performance. 

Congestion 

When the storage fabric is saturated, network congestion becomes inevitable, just like roadway congestion when too much traffic is on the highway. Network congestion is particularly problematic for scale-out storage because each storage node is expected to offer fast data delivery. But when congestion occurs, many data center switches have a fairness problem, where some nodes will be slowed much more than others. A single file or object is usually spread across many nodes, so anything that slows a single node effectively slows the whole cluster.

Bursty traffic

Most storage workloads are bursty, generating intense data transfers and repeatedly requiring large amounts of bandwidth for short periods. When that happens, the network switch must use its buffer to absorb the burst until the transient burst is over, thus preventing packet loss. Otherwise, that packet loss will require data retransmissions, significantly deteriorating application performance.

Storage jumbo frames 

Traditional data center network traffic uses a maximum packet size (MTU) of 1.5 KB. Scale-out storage nodes perform better when they can use 9 KB “jumbo frames,” which increase throughput while reducing ‌CPU processing overhead. Many data center switches built with commodity switch ASICs perform poorly or unpredictably with jumbo frames.

Low latency 

One of the ways storage IOPs have improved is through the orders-of-magnitude latency reduction for the read/write operations in flash-based media. ‌However, those costly performance improvements can be lost when the network introduces high latency—especially due to excessive buffering.

Both training and inference require adequate amounts of data with high-speed access, to make sure that GPU processors are fed quickly enough to keep them fully engaged. During training, WRITE operations are performed by all nodes to improve model accuracy. This results in a burst, making it imperative for switches to handle congestion effectively. Finally, lower storage latency enables GPUs to handle compute tasks more efficiently.

Why ASICs are suboptimal for storage traffic 

Most data center switches are built using commodity switch ASICs that were cost-optimized for traditional data traffic patterns and packet sizes. To keep costs low while achieving their bandwidth targets, Ethernet switch chip vendors compromised fairness by using a split buffer architecture.

Every switch has a buffer to absorb traffic bursts and prevent packet loss when congestion occurs. The common approach is to have a buffer that is shared across many ports. However, not all shared buffers are the same—there are different buffer architectures.

Commodity switches do not have a fully shared buffer—they use either an ingress-shared buffer, or an egress-shared buffer. 

With an ingress-shared buffer, there is a static mapping between a group of incoming ports and a specific memory slice. These ports can use only the memory in that assigned slice and not the whole buffer, not even if the rest of the buffer is available and no one is using it. 

With an egress-shared buffer, the mapping is between a group of outgoing ports and a specific buffer memory slice. Again, each group of egress ports can only use its assigned buffer slice, not the whole buffer.

With these two architectures, flows that stay within the same memory slice do not behave like flows that travel between memory slices. If many flows use ports with the same buffer, those ports will experience higher latency and lower throughput, while traffic using other slices of the buffer will enjoy higher performance. 

The storage performance depends on which ports the storage traffic (and other traffic) is using and how busy are the buffer slices for those ports. This is why switches that use split buffers often experience issues related to fairness, predictability, and microburst absorption.

Why deep-buffer switches are suboptimized for storage 

Deep-buffer switches usually refer to a switch that offers much more buffering (GBs rather than MBs). Deep-buffer switches are often promoted for use as routers, because they can absorb and hold large traffic bursts if there is a mismatch in network speeds or an incast situation. 

But in most data center applications, including scale-out storage, the deep-buffer switches negatively impact performance for the following reasons:

Job completion time 

With parallel file systems, the storage node with the slowest response dictates the time required to fetch a file. Unlike commodity switch ASICs that have a sliced on-chip buffer, deep-buffer switches have both on-chip and off-chip buffers, and they all are sliced, not fully shared, buffers. 

Think of how many ways flows can go before they leave the switch. They can stay within one on-chip memory slice (fastest), travel between on-chip memory slices (slower), or travel between on-chip and off-chip memory slices (very slow). 

All these flows will behave differently, and hence will cause fairness and predictability issues for storage traffic. Because these issues slow down one or more nodes, they adversely impact job completion time and slow down the whole distributed storage cluster.

Latency

The larger the switch buffer, the longer the queue each packet must go through and the greater the latency. The tested average port-to-port latency of a deep-buffer switch is more than 500 microseconds. Compared to a fully shared buffer switch from the same generation, NVIDIA Spectrum 1 latency is just 0.3 microseconds. It requires nanoseconds rather than microseconds to switch/route a packet. 

Deep-buffer latency is 1,000x higher. You may be wondering, is this just happening when congestion occurs? No. Under congestion, deep-buffer latencies will be much higher; in fact, up to 20 milliseconds, or 50,000x higher. While 500 microseconds of latency might be okay for a router between data centers, within a data center it spells death to flash storage performance.

Power and cost 

Deep-buffer switches need hundreds of watts of power to operate even when idling, making their ongoing operational cost higher. The initial purchase cost of deep-buffer switches is also much higher. This might be justified if performance was better, but real-world testing has proven just the opposite. 

Choosing an inappropriate network switch will severely slow down your storage workloads, making your expensive fast storage act like cheaper and slower storage. 

With NVIDIA Spectrum, both CapEx and OpEx can be reduced. Watts can also be used for other purposes within a rack.  

NVIDIA Spectrum switches are optimized for storage

With commodity switch ASICs, flows are either staying at the same memory slice or traveling between memory slices. 

With NVIDIA Spectrum switches, all flows behave the same due to the fully shared buffer. The value of this architecture is maximum burst absorption capacity as well as optimal fair and predictable performance. All traffic flows through a switch receive the same treatment and generally enjoy the same good performance, regardless of which ingress and egress ports they use.

Benchmarking the deep-buffer switch and NVIDIA Spectrum

The first case uses a common storage benchmarking FIO tool for WRITE operations from two initiators to one target while background traffic is running. This is a typical storage scenario. 

The team measured the time required for the FIO job to complete (shorter is better). With the deep-buffer switch, the FIO job took 87 seconds. With the NVIDIA Spectrum switch, the job ran 40% faster and completed after just 51 seconds.

Bar graph comparing NVIDIA Spectrum and deep-buffer switch showing that NVIDIA Spectrum is 40% faster at job completion time.
Figure 1. Storage write operations are 40% faster with the NVIDIA Spectrum switch compared to a deep-buffer switch

Deep-buffer switches greatly increase latency, which slows down your storage and reduces your application performance. But how high can the latency go?

For the second case, the team took the deep-buffer switch and tested how latency is impacted under different congestion use cases. The maximum buffer occupancy is only around 10% of the whole buffer size.

Two line graphs comparing real and projected latency compared to buffer size and buffer occupancy.
Figure 2. Real and projected latency compared to buffer size and buffer occupancy

Two meaningful insights can be derived from the graph on the left of Figure 2. First, deep-buffer switch latency is 50,000x higher than Spectrum switches (2–19 milliseconds compared to only 300 nanoseconds for Spectrum). 

Second, linear dependency is apparent between buffer occupancy and latency. In other words, testing proved that the larger the occupied buffer, the greater the latency.

With that understanding, the graph on the right of Figure 2 projects the maximum latency per deep-buffer ASIC (such as Jericho 1, Jericho 2, or Ramon). These very high latency numbers are incompatible with data center applications in general and fast storage solutions in particular.

For the third case, the team used two Windows machines and simultaneously copied a file from each to the same target storage.

With the deep-buffer switch, one Windows machine had three times the bandwidth of the other (830 MBps compared to 290 MBps). With Spectrum switch, each machine had 584 MBps (50% as expected).

Real-world testing showed that deep-buffer switches do not have a positive impact on data center applications, such as absorbing packets and preventing drops.

Deep-buffer switches may be needed for long haul or WAN connections; however, they are suboptimal for data center applications and will have negative effects, particularly when the workload is scaled beyond just two nodes, as in this use case. 

Two stacked graphs comparing the deep-buffer switch and NVIDIA Spectrum, where the bandwidth performance is better with the NVIDIA Spectrum and noted in green.
Figure 3. The deep-buffer switch provides unfair bandwidth per node (left) while the NVIDIA Spectrum switch provides equal bandwidth (right)

These three use cases demonstrate proof points for why deep-buffer switches adversely impact AI/ML and storage workloads, while Spectrum switches provide maximized performance.

Summary

NVIDIA Spectrum Ethernet switches are built for AI/ML and storage workloads, and perform better than switches with split buffers or deep buffers. They handle congestion better, prevent packet loss, and outperform with jumbo frames (preferred for storage). NVIDIA Spectrum Ethernet switches provide overall good application performance with consistently low network latency.

Learn more about NVIDIA Spectrum Ethernet switches. Dive deeper into networking storage in the NVIDIA Developer Forums

Categories
Offsites

Preference learning with automated feedback for cache eviction

Caching is a ubiquitous idea in computer science that significantly improves the performance of storage and retrieval systems by storing a subset of popular items closer to the client based on request patterns. An important algorithmic piece of cache management is the decision policy used for dynamically updating the set of items being stored, which has been extensively optimized over several decades, resulting in several efficient and robust heuristics. While applying machine learning to cache policies has shown promising results in recent years (e.g., LRB, LHD, storage applications), it remains a challenge to outperform robust heuristics in a way that can generalize reliably beyond benchmarks to production settings, while maintaining competitive compute and memory overheads.

In “HALP: Heuristic Aided Learned Preference Eviction Policy for YouTube Content Delivery Network”, presented at NSDI 2023, we introduce a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. The Heuristic Aided Learned Preference (HALP) framework is a meta-algorithm that uses randomization to merge a lightweight heuristic baseline eviction rule with a learned reward model. The reward model is a lightweight neural network that is continuously trained with ongoing automated feedback on preference comparisons designed to mimic the offline oracle. We discuss how HALP has improved infrastructure efficiency and user video playback latency for YouTube’s content delivery network.

Learned preferences for cache eviction decisions

The HALP framework computes cache eviction decisions based on two components: (1) a neural reward model trained with automated feedback via preference learning, and (2) a meta-algorithm that combines a learned reward model with a fast heuristic. As the cache observes incoming requests, HALP continuously trains a small neural network that predicts a scalar reward for each item by formulating this as a preference learning method via pairwise preference feedback. This aspect of HALP is similar to reinforcement learning from human feedback (RLHF) systems, but with two important distinctions:

  • Feedback is automated and leverages well-known results about the structure of offline optimal cache eviction policies.
  • The model is learned continuously using a transient buffer of training examples constructed from the automated feedback process.

The eviction decisions rely on a filtering mechanism with two steps. First, a small subset of candidates is selected using a heuristic that is efficient, but suboptimal in terms of performance. Then, a re-ranking step optimizes from within the baseline candidates via the sparing use of a neural network scoring function to “boost” the quality of the final decision.

As a production ready cache policy implementation, HALP not only makes eviction decisions, but also subsumes the end-to-end process of sampling pairwise preference queries used to efficiently construct relevant feedback and update the model to power eviction decisions.

A neural reward model

HALP uses a light-weight two-layer multilayer perceptron (MLP) as its reward model to selectively score individual items in the cache. The features are constructed and managed as a metadata-only “ghost cache” (similar to classical policies like ARC). After any given lookup request, in addition to regular cache operations, HALP conducts the book-keeping (e.g., tracking and updating feature metadata in a capacity-constrained key-value store) needed to update the dynamic internal representation. This includes: (1) externally tagged features provided by the user as input, along with a cache lookup request, and (2) internally constructed dynamic features (e.g., time since last access, average time between accesses) constructed from lookup times observed on each item.

HALP learns its reward model fully online starting from a random weight initialization. This might seem like a bad idea, especially if the decisions are made exclusively for optimizing the reward model. However, the eviction decisions rely on both the learned reward model and a suboptimal but simple and robust heuristic like LRU. This allows for optimal performance when the reward model has fully generalized, while remaining robust to a temporarily uninformative reward model that is yet to generalize, or in the process of catching up to a changing environment.

Another advantage of online training is specialization. Each cache server runs in a potentially different environment (e.g., geographic location), which influences local network conditions and what content is locally popular, among other things. Online training automatically captures this information while reducing the burden of generalization, as opposed to a single offline training solution.

Scoring samples from a randomized priority queue

It can be impractical to optimize for the quality of eviction decisions with an exclusively learned objective for two reasons.

  1. Compute efficiency constraints: Inference with a learned network can be significantly more expensive than the computations performed in practical cache policies operating at scale. This limits not only the expressivity of the network and features, but also how often these are invoked during each eviction decision.
  2. Robustness for generalizing out-of-distribution: HALP is deployed in a setup that involves continual learning, where a quickly changing workload might generate request patterns that might be temporarily out-of-distribution with respect to previously seen data.

To address these issues, HALP first applies an inexpensive heuristic scoring rule that corresponds to an eviction priority to identify a small candidate sample. This process is based on efficient random sampling that approximates exact priority queues. The priority function for generating candidate samples is intended to be quick to compute using existing manually-tuned algorithms, e.g., LRU. However, this is configurable to approximate other cache replacement heuristics by editing a simple cost function. Unlike prior work, where the randomization was used to tradeoff approximation for efficiency, HALP also relies on the inherent randomization in the sampled candidates across time steps for providing the necessary exploratory diversity in the sampled candidates for both training and inference.

The final evicted item is chosen from among the supplied candidates, equivalent to the best-of-n reranked sample, corresponding to maximizing the predicted preference score according to the neural reward model. The same pool of candidates used for eviction decisions is also used to construct the pairwise preference queries for automated feedback, which helps minimize the training and inference skew between samples.

An overview of the two-stage process invoked for each eviction decision.

Online preference learning with automated feedback

The reward model is learned using online feedback, which is based on automatically assigned preference labels that indicate, wherever feasible, the ranked preference ordering for the time taken to receive future re-accesses, starting from a given snapshot in time among each queried sample of items. This is similar to the oracle optimal policy, which, at any given time, evicts an item with the farthest future access from all the items in the cache.

Generation of the automated feedback for learning the reward model.

To make this feedback process informative, HALP constructs pairwise preference queries that are most likely to be relevant for eviction decisions. In sync with the usual cache operations, HALP issues a small number of pairwise preference queries while making each eviction decision, and appends them to a set of pending comparisons. The labels for these pending comparisons can only be resolved at a random future time. To operate online, HALP also performs some additional book-keeping after each lookup request to process any pending comparisons that can be labeled incrementally after the current request. HALP indexes the pending comparison buffer with each element involved in the comparison, and recycles the memory consumed by stale comparisons (neither of which may ever get a re-access) to ensure that the memory overhead associated with feedback generation remains bounded over time.

Overview of all main components in HALP.

Results: Impact on the YouTube CDN

Through empirical analysis, we show that HALP compares favorably to state-of-the-art cache policies on public benchmark traces in terms of cache miss rates. However, while public benchmarks are a useful tool, they are rarely sufficient to capture all the usage patterns across the world over time, not to mention the diverse hardware configurations that we have already deployed.

Until recently, YouTube servers used an optimized LRU-variant for memory cache eviction. HALP increases YouTube’s memory egress/ingress — the ratio of the total bandwidth egress served by the CDN to that consumed for retrieval (ingress) due to cache misses — by roughly 12% and memory hit rate by 6%. This reduces latency for users, since memory reads are faster than disk reads, and also improves egressing capacity for disk-bounded machines by shielding the disks from traffic.

The figure below shows a visually compelling reduction in the byte miss ratio in the days following HALP’s final rollout on the YouTube CDN, which is now serving significantly more content from within the cache with lower latency to the end user, and without having to resort to more expensive retrieval that increases the operating costs.

Aggregate worldwide YouTube byte miss ratio before and after rollout (vertical dashed line).

An aggregated performance improvement could still hide important regressions. In addition to measuring overall impact, we also conduct an analysis in the paper to understand its impact on different racks using a machine level analysis, and find it to be overwhelmingly positive.

Conclusion

We introduced a scalable state-of-the-art cache eviction framework that is based on learned rewards and uses preference learning with automated feedback. Because of its design choices, HALP can be deployed in a manner similar to any other cache policy without the operational overhead of having to separately manage the labeled examples, training procedure and the model versions as additional offline pipelines common to most machine learning systems. Therefore, it incurs only a small extra overhead compared to other classical algorithms, but has the added benefit of being able to take advantage of additional features to make its eviction decisions and continuously adapt to changing access patterns.

This is the first large-scale deployment of a learned cache policy to a widely used and heavily trafficked CDN, and has significantly improved the CDN infrastructure efficiency while also delivering a better quality of experience to users.

Acknowledgements

Ramki Gummadi is now part of Google DeepMind. We would like to thank John Guilyard for help with the illustrations and Richard Schooler for feedback on this post.

Categories
Misc

Deep Learning Digs Deep: AI Unveils New Large-Scale Images in Peruvian Desert

Researchers at Yamagata University in Japan have harnessed AI to uncover four previously unseen geoglyphs — images on the ground, some as wide as 1,200 feet, made using the land’s elements — in Nazca, a seven-hour drive south of Lima, Peru. The geoglyphs — a humanoid, a pair of legs, a fish and a bird Read article >

Categories
Misc

Speech AI Spotlight: Visualizing Spoken Language and Sounds on AR Glasses

Image of glasses with computer screen reflected.Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people…Image of glasses with computer screen reflected.

Audio can include a wide range of sounds, from human speech to non-speech sounds like barking dogs and sirens. When designing accessible applications for people with hearing difficulties, the application should be able to recognize sounds and understand speech.

Such technology would help deaf or hard-of-hearing individuals with visualizing speech, like human conversations and non-speech sounds. Combining speech and sound AI together, you can overlay the visualizations onto AR glasses, making it possible for users to see and interpret sounds that they wouldn’t be able to hear otherwise. 

According to the World Health Organization, about 1.5B people (nearly 20% of the global population) live with hearing loss. This number could rise to 2.5B by 2050.

Cochl, an NVIDIA partner based in San Jose, is a deep-tech startup that uses sound AI technology to understand any type of audio. They are also a member of the NVIDIA Inception Program, which helps startups build their solutions faster by providing access to cutting-edge technology and NVIDIA experts.

The platform can recognize 37 environmental sounds, and the company went one step further by adding cutting-edge speech-to-text technology. This gives a truly complete understanding of the world of sound.

AR glasses to visualize any sound

AR glasses have the potential to greatly improve the lives of people with hearing loss as an accessible tool to visualize sounds. This technology can help enhance their communication abilities and make it easier for them to navigate and participate in the world around them.

Video 1. Cochl.Sense and NVIDIA Riva working on Microsoft HoloLens 2!

In this scenario, automatic speech recognition (ASR) is used to enable the glasses to recognize and understand human speech. This technology can be integrated into the glasses in several ways:

  • Using a microphone to capture the speech of a person talking to a deaf or hard-of-hearing individual and then using ASR algorithms to interpret and transcribe the speech into text. This text can then be displayed on the glasses, enabling the deaf or hard-of-hearing person to read and understand the speech.
  • ASR can also be used to enable the glasses to respond to voice commands so that users can control the glasses with their voice.
  • They are also able to display all conversations on the screen, such as transcribing voice directions from maps while you drive and any other sounds like horns or sirens from emergency vehicles and wind noise.

The technology behind the solution

Cochl used NVIDIA Riva to power its ASR capabilities within its software stack. Riva is a GPU-accelerated, fully customizable SDK for developing speech AI applications. By using Riva, the platform has been able to expand its capabilities to understand a wide range of sounds, including non-speech sounds.

“We’ve tested lots of speech recognition services, but only Riva provided exceptionally high and stable real-time performance. So now we can make our sound AI system be closer to human auditory perception,” said Yoonchang Han, co-founder and CEO at Cochl.

“As we have observed, AR glasses are most likely to be used in open spaces with noisy environments. NVIDIA Riva has helped us transcribe speech accurately even in noisy environments and has given us a seamless experience to integrate into our Cochl.Sense platform.”

Future of assistive technology

Creating a generalized AI system that perceives sounds like humans is a huge challenge. To make AR glasses more accessible, lighter wearable technology is required.

However, at this point, they are still an ideal medium for translating sounds and speech to visual information. By integrating machine listening functionality, AR glasses can bring safer, more convenient, and more enjoyable daily life to deaf or hard-of-hearing people all around the world.

Cochl is also exploring more use cases for speech AI, such as offering closed captioning for any videos on AR glasses and visualizing multi-speaker transcriptions. To provide the best experience for individuals with hearing difficulties, they are exploring ways to analyze and visualize music to help them understand the genre and emotion of the music at a minimum.

They are excited to experiment with more NVIDIA solutions including Riva, NVIDIA NeMo, and NVIDIA TensorRT.

Get started with speech AI today

Interested in adding speech AI to your VR applications? Browse these resources to get started:

Categories
Misc

Model Parallelism and Conversational AI Workshops

person typing at computer.Join these upcoming workshops to learn how to train large neural networks, or build a conversational AI pipeline.person typing at computer.

Join these upcoming workshops to learn how to train large neural networks, or build a conversational AI pipeline.

Categories
Offsites

SoundStorm: Efficient parallel audio generation

The recent progress in generative AI unlocked the possibility of creating new content in several different domains, including text, vision and audio. These models often rely on the fact that raw data is first converted to a compressed format as a sequence of tokens. In the case of audio, neural audio codecs (e.g., SoundStream or EnCodec) can efficiently compress waveforms to a compact representation, which can be inverted to reconstruct an approximation of the original audio signal. Such a representation consists of a sequence of discrete audio tokens, capturing the local properties of sounds (e.g., phonemes) and their temporal structure (e.g., prosody). By representing audio as a sequence of discrete tokens, audio generation can be performed with Transformer-based sequence-to-sequence models — this has unlocked rapid progress in speech continuation (e.g., with AudioLM), text-to-speech (e.g., with SPEAR-TTS), and general audio and music generation (e.g., AudioGen and MusicLM). Many generative audio models, including AudioLM, rely on auto-regressive decoding, which produces tokens one by one. While this method achieves high acoustic quality, inference (i.e., calculating an output) can be slow, especially when decoding long sequences.

To address this issue, in “SoundStorm: Efficient Parallel Audio Generation”, we propose a new method for efficient and high-quality audio generation. SoundStorm addresses the problem of generating long audio token sequences by relying on two novel elements: 1) an architecture adapted to the specific nature of audio tokens as produced by the SoundStream neural codec, and 2) a decoding scheme inspired by MaskGIT, a recently proposed method for image generation, which is tailored to operate on audio tokens. Compared to the autoregressive decoding approach of AudioLM, SoundStorm is able to generate tokens in parallel, thus decreasing the inference time by 100x for long sequences, and produces audio of the same quality and with higher consistency in voice and acoustic conditions. Moreover, we show that SoundStorm, coupled with the text-to-semantic modeling stage of SPEAR-TTS, can synthesize high-quality, natural dialogues, allowing one to control the spoken content (via transcripts), speaker voices (via short voice prompts) and speaker turns (via transcript annotations), as demonstrated by the examples below:

Input: Text (transcript used to drive the audio generation in bold)        Something really funny happened to me this morning. | Oh wow, what? | Well, uh I woke up as usual. | Uhhuh | Went downstairs to have uh breakfast. | Yeah | Started eating. Then uh 10 minutes later I realized it was the middle of the night. | Oh no way, that’s so funny!        I didn’t sleep well last night. | Oh, no. What happened? | I don’t know. I I just couldn’t seem to uh to fall asleep somehow, I kept tossing and turning all night. | That’s too bad. Maybe you should uh try going to bed earlier tonight or uh maybe you could try reading a book. | Yeah, thanks for the suggestions, I hope you’re right. | No problem. I I hope you get a good night’s sleep
         
Input: Audio prompt       

 

         
Output: Audio prompt + generated audio       

      

SoundStorm design

In our previous work on AudioLM, we showed that audio generation can be decomposed into two steps: 1) semantic modeling, which generates semantic tokens from either previous semantic tokens or a conditioning signal (e.g., a transcript as in SPEAR-TTS, or a text prompt as in MusicLM), and 2) acoustic modeling, which generates acoustic tokens from semantic tokens. With SoundStorm we specifically address this second, acoustic modeling step, replacing slower autoregressive decoding with faster parallel decoding.

SoundStorm relies on a bidirectional attention-based Conformer, a model architecture that combines a Transformer with convolutions to capture both local and global structure of a sequence of tokens. Specifically, the model is trained to predict audio tokens produced by SoundStream given a sequence of semantic tokens generated by AudioLM as input. When doing this, it is important to take into account the fact that, at each time step t, SoundStream uses up to Q tokens to represent the audio using a method known as residual vector quantization (RVQ), as illustrated below on the right. The key intuition is that the quality of the reconstructed audio progressively increases as the number of generated tokens at each step goes from 1 to Q.

At inference time, given the semantic tokens as input conditioning signal, SoundStorm starts with all audio tokens masked out, and fills in the masked tokens over multiple iterations, starting from the coarse tokens at RVQ level q = 1 and proceeding level-by-level with finer tokens until reaching level q = Q.

There are two crucial aspects of SoundStorm that enable fast generation: 1) tokens are predicted in parallel during a single iteration within a RVQ level and, 2) the model architecture is designed in such a way that the complexity is only mildly affected by the number of levels Q. To support this inference scheme, during training a carefully designed masking scheme is used to mimic the iterative process used at inference.

SoundStorm model architecture. T denotes the number of time steps and Q the number of RVQ levels used by SoundStream. The semantic tokens used as conditioning are time-aligned with the SoundStream frames.

Measuring SoundStorm performance

We demonstrate that SoundStorm matches the quality of AudioLM’s acoustic generator, replacing both AudioLM’s stage two (coarse acoustic model) and stage three (fine acoustic model). Furthermore, SoundStorm produces audio 100x faster than AudioLM’s hierarchical autoregressive acoustic generator (top half below) with matching quality and improved consistency in terms of speaker identity and acoustic conditions (bottom half below).

Runtimes of SoundStream decoding, SoundStorm and different stages of AudioLM on a TPU-v4.
Acoustic consistency between the prompt and the generated audio. The shaded area represents the inter-quartile range.

Safety and risk mitigation

We acknowledge that the audio samples produced by the model may be influenced by the unfair biases present in the training data, for instance in terms of represented accents and voice characteristics. In our generated samples, we demonstrate that we can reliably and responsibly control speaker characteristics via prompting, with the goal of avoiding unfair biases. A thorough analysis of any training data and its limitations is an area of future work in line with our responsible AI Principles.

In turn, the ability to mimic a voice can have numerous malicious applications, including bypassing biometric identification and using the model for the purpose of impersonation. Thus, it is crucial to put in place safeguards against potential misuse: to this end, we have verified that the audio generated by SoundStorm remains detectable by a dedicated classifier using the same classifier as described in our original AudioLM paper. Hence, as a component of a larger system, we believe that SoundStorm would be unlikely to introduce additional risks to those discussed in our earlier papers on AudioLM and SPEAR-TTS. At the same time, relaxing the memory and computational requirements of AudioLM would make research in the domain of audio generation more accessible to a wider community. In the future, we plan to explore other approaches for detecting synthesized speech, e.g., with the help of audio watermarking, so that any potential product usage of this technology strictly follows our responsible AI Principles.

Conclusion

We have introduced SoundStorm, a model that can efficiently synthesize high-quality audio from discrete conditioning tokens. When compared to the acoustic generator of AudioLM, SoundStorm is two orders of magnitude faster and achieves higher temporal consistency when generating long audio samples. By combining a text-to-semantic token model similar to SPEAR-TTS with SoundStorm, we can scale text-to-speech synthesis to longer contexts and generate natural dialogues with multiple speaker turns, controlling both the voices of the speakers and the generated content. SoundStorm is not limited to generating speech. For example, MusicLM uses SoundStorm to synthesize longer outputs efficiently (as seen at I/O).

Acknowledgments

The work described here was authored by Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour and Marco Tagliasacchi. We are grateful for all discussions and feedback on this work that we received from our colleagues at Google.