Categories
Misc

NVIDIA Grace Hopper Superchip Sweeps MLPerf Inference Benchmarks

In its debut on the MLPerf industry benchmarks, the NVIDIA GH200 Grace Hopper Superchip ran all data center inference tests, extending the leading performance of NVIDIA H100 Tensor Core GPUs. The overall results showed the exceptional performance and versatility of the NVIDIA AI platform from the cloud to the network’s edge. Separately, NVIDIA announced inference Read article >

Categories
Misc

Creating Immersive Events with OpenUSD and Digital Twins

Moment Factory is a global multimedia entertainment studio that combines specializations in video, lighting, architecture, sound, software, and interactivity to…

Moment Factory is a global multimedia entertainment studio that combines specializations in video, lighting, architecture, sound, software, and interactivity to create immersive experiences for audiences around the world. 

From live performances and multimedia shows to interactive installations, Moment Factory is known for some of the most awe-inspiring and entertaining experiences that bring people together in the real world. These include dazzling visuals at Billie Eilish’s Happier Than Ever world tour, Lumina Night Walks at natural sites around the world, and digital placemaking at the AT&T Discovery District.

With a team of over 400 professionals and offices in Montreal, Tokyo, Paris, New York City, and Singapore, Moment Factory has become a global leader in the entertainment industry.

Billie Eilish on stage during her Happier Than Ever world tour
Figure 1. Billie Eilish engaged Moment Factory to oversee creative direction, stage design, and content creation for her Happier Than Ever world tour

Streamlining immersive experience development with OpenUSD

Bringing these experiences to life requires large teams of highly skilled experts with diverse specialties, all using unique tools. To achieve optimal efficiency in their highly complex production processes, Moment Factory looked to implement an interoperable open data format and development platform that could seamlessly integrate all aspects, from concept to operation.

Moment Factory chose Universal Scene Description, also known as OpenUSD, as the solution. OpenUSD is an extensible framework and ecosystem for describing, composing, simulating, and collaborating within 3D worlds. NVIDIA Omniverse is a software platform that enables teams to develop OpenUSD-based 3D workflows and applications. It provides the unified environment to visualize and collaborate on digital twins in real time with live connections to Moment Factory’s tools.

Using OpenUSD with Omniverse enables Moment Factory to unify data from their diverse digital content creation (DCC) tools to form a digital twin of a real-world environment. Every member of the team can interact with this digital twin and iterate on their aspect of the project without affecting other elements

For example, a scenographer can work on a base set and unique scene pieces using Vectorworks, 3D design software. At the same time in the same scene, an AV (audio visual) and lighting designer can take care of lighting and projectors with Moment Factory’s proprietary live entertainment operating system and virtual projection mapping software, X-Agora.

Simultaneously, artists and designers can render and create eye-catching visuals in the scene using tools like Epic Games Unreal Engine, Blender, and Adobe Photoshop—without affecting layers of the project still in progress.

“USD is unique in that it can be fragmented into smaller pieces that enable people to work on their own unique parts of a project while staying connected,” said Arnaud Grosjean, solution architect and project lead for Moment Factory’s Innovation Team. “Its flexibility and interoperability allows us to create powerful, custom 3D pipelines.”

Diagram of USD scenes composition, including nondestructive layers such as venue, scenography, AV, and sensor data from diverse data sources.
Figure 2. USD scenes are composed of nondestructive layers such as venue, scenography, AV, and sensor data from diverse data sources

Digital twins simulate real-world experiences 

To simulate immersive events before deploying them in the real world, Moment Factory is developing digital twins of their installations in NVIDIA Omniverse. Omniverse, a computing platform that enables teams to develop OpenUSD-based 3D workflows and applications, provides the unified environment to visualize and collaborate on digital twins in real time with live connections to DCC tools.

The first digital twin they’ve created is that of Blackbox, which serves as an experimentation and prototyping space where they can preview fragments of immersive experiences before real-world deployment. It is a critical space for nearly every phase of the project lifecycle, from conception and design to integration and operation. 

To build the digital twin of the Blackbox, Moment Factory used USD Composer, a fully customizable foundation application built on NVIDIA Omniverse.

Moment Factory digital twin
Figure 3. Live video projection and lighting in the Blackbox is reflected in real time in the digital twin of the Blackbox, shown in the screen on the right

The virtual replica of the installation enables the team to run innumerable iterations on the project to test for various factors. They can also better sell concepts for immersive experiences to prospective customers, who can see the show before live production in a virtual environment.

One of the key challenges in the process for building large-scale immersive experiences is reaching a consensus among various stakeholders and managing changes. 

“Everyone has their own idea of how a scene should be structured, so we needed a way to align everyone contributing to the project in a unified, dynamic environment” explained Grosjean. “With the digital twin, potential ideas can be tested and simulated with stakeholders across every core expertise.”

As CAD drafters, AV designers, interactive designers, and others contribute to the digital twin of the Blackbox, artists and 2D/3D designers can render and experiment with beauty shots of the immersive experience in action.

To see the digital twin of the Blackbox in action, join the Omniverse Livestream with Moment Factory on Wednesday, September 13.

Developing Omniverse Connectors and extensions

Moment Factory is continuously building and testing extensions for Omniverse to bring new functionalities and possibilities into their digital twins.

They developed an Omniverse Connector for X-Agora, their proprietary multi-display software that allows you to design, plan and operate shows. The software now has a working implementation of a Nucleus connection, USD import/export, and an early live mode implementation.

Video projection is a key element of immersive events. The team will often experiment with mapping and projecting visual content onto architectural surfaces, scenic elements, and sometimes even moving objects, transforming static spaces into dynamic and captivating environments.

NDI, which stands for Network Design Interface, is a popular IP video protocol developed by NewTek that allows for efficient live video production and streaming across interconnected devices and systems. In their immersive experiences, Moment Factory typically connects a media system to physical projectors using video cables. With NDI, they can replicate this connection within a virtual venue, effectively simulating the entire experience digitally. 

To enable seamless connectivity between the Omniverse RTX Renderer and their creative content, Moment Factory developed an NDI extension for Omniverse. The extension supports more than just video projection and allows the team to simulate LED walls, screens, and pixel fields to mirror their real-world setup in the digital twin.

The extension, which was developed with Omniverse Kit, also enables users to use video feeds as dynamic textures. Developers at Moment Factory used the kit-cv-video-example and kit-dynamic texture-example to develop the extension.

Anyone can access and use Moment Factory’s Omniverse-NDI-extension on GitHub, and install it on the Omniverse Launcher or launch with:

$ ./link_app.bat --app create
$ ./app/omni.create.bat --/rtx/ecoMode/enabled=false --ext-folder exts --enable mf.ov.ndi

Extensions in Omniverse serve as reusable components or tools that developers can build to accelerate and add new functionalities for 3D workflows. They can be built for simple tasks like randomizing objects or used to enable more complex workflows like visual scripting.

The team also developed an extension for converting MPDCI, a VESA standard describing multiprojector rigs, to USD called the Omniverse-MPCDI-converter. They are currently testing extensions for MVR (My Virtual Rig) and GDTF (General Device Type Format) Converters to import lighting fixtures and rigs into their digital twins. 

Even more compelling is a lidar UDP simulator extension, which is being developed to enable sensor simulation in Omniverse and connect synthetic data to lidar-compatible software.

You can use Moment Factory’s NDI and MPDCI extensions today in your workflows. Stay tuned for new extensions coming soon.

To build extensions like Moment Factory, get started with all the Omniverse Developer Resources you’ll need, like  documentation, tutorials, USD resources, GitHub samples, and more.

Get started with NVIDIA Omniverse by downloading the standard license free, or learn how Omniverse Enterprise can connect your team

Developers can check out these Omniverse resources to begin building on the platform. 

Stay up to date on the platform by subscribing to the newsletter and following NVIDIA Omniverse on Instagram, LinkedIn, Medium, Threads, and Twitter.

For more, check out our forums, Discord server, Twitch, and YouTube channels.

Categories
Misc

Accelerating Vector Search: Using GPU-Powered Indexes with RAPIDS RAFT

In the AI landscape of 2023, vector search is one of the hottest topics due to its applications in large language models (LLM) and generative AI. Semantic…

In the AI landscape of 2023, vector search is one of the hottest topics due to its applications in large language models (LLM) and generative AI. Semantic vector search enables a broad range of important tasks like detecting fraudulent transactions, recommending products to users, using contextual information to augment full-text searches, and finding actors that pose potential security risks.

Data volumes continue to soar and traditional methods for comparing items one by one have become computationally infeasible. Vector search methods use approximate lookups, which are more scalable and can handle massive amounts of data more efficiently. As we show in this post, accelerating vector search on the GPU provides not only faster search times, but the index building times can also be substantially faster.

This post provides:

  • An introduction to vector search with a brief review of popular applications
  • An overview of the RAFT library for accelerating vector search on the GPU
  • Performance comparison of GPU-accelerated vectors search indexes against the state-of-the-art on the CPU

The second post in this series dives deeper into each of the GPU-accelerated indexes mentioned in this post and gives a brief explanation of how the algorithms work, along with a summary of important parameters to fine-tune their behavior. For more information, see Accelerating Vector Search: Fine-Tuning GPU Index Algorithms.

What is vector search?

Diagram shows a list of vectors that may have been encoded from sources like images, documents, or videos and a query vector for which you would like to find the closest vectors from the list.
Figure 1. Vector search process

Figure 1 shows that vector search entails creating an index of vectors and performing lookups to find some number of vectors in the index that are closest to a query vector. The vectors could be as small as three-dimensional points from a lidar point cloud or larger embeddings from text documents, images, or videos.

Vector search is the process of querying a database to find the most similar vectors. This similarity search is done on numerical vectors that can represent any type of object (Figure 2). These vectors are often embeddings created from multimedia like images, video, and text fragments or entire documents that went through a deep learning model to encode their semantic characteristics into a vector form.

Embedding vectors typically have the advantage of being a smaller object than the original document (lower dimensionality), while maintaining as much information about the source as possible. Therefore, two documents that are similar often have similar embeddings.

Image of a 3D point cloud such as one created from LIDAR.
Figure 2. Vectors represent data points in higher dimensions

The points in Figure 2 are 3D but they could be 500 dimensions or even higher.

This makes it easier to compare objects, as the embedding vectors are smaller and retain most of the information. When two documents share similar characteristics, their embedding vectors are often spatially close, or similar.

Approximate methods for vector search

To handle larger datasets efficiently, approximate nearest neighbor (ANN) methods are often used for vector search. ANN methods speed up the search by approximating the closest vectors. This avoids the exhaustive distance computation often required by an exact brute-force approach, which requires comparing the query against every single vector in the database.

In addition to the search compute cost, storing many vectors can also consume a large amount of memory. To ensure both fast searches and low memory usage, you must index vectors in an efficient way. As we outline a bit later, this can sometimes benefit from compression. A vector index is a space-efficient data structure built on mathematical models that is used for efficiently querying several vectors at a time.

Updating the indexes, such as from inserting and deleting vectors, can cause problems when indexes take hours or even days to build. It turns out that these indexes can often be built much faster on the GPU. We showcase this performance later in the post.

Vector search in LLMs

LLMs have become popular for capturing and preserving the semantic meaning and context of the original documents. This means that the vectors resulting from LLM models can be searched using vector similarity search. This search finds items that happen to contain similar words, shapes, or moving objects. It also finds vectors that contextually and semantically mean similar things.

This semantic search doesn’t rely on exact word matching. For example, searching for the term,  “I would like to buy a muscle car” in an image database should be able to contextualize the sentence to understand the following:

  • Buying a car is different from renting a car, so you’d expect to find vectors closer to car dealerships and reviews from car purchasers, rather than car rental companies.
  • A muscle car is different from a bodybuilder so you’d expect to find vectors about Dodge Chargers and not Arnold Schwarzenegger.
  • Buying a muscle car is different from buying muscle relaxers or economy vehicles.

More recently, large language transformer-based models like ChatGPT, LLaMa, NeMo, and BERT have provided significant technical leaps that are increasing the contextual awareness of the models and making them even more useful and applicable to more industries.

In addition to creating embedding vectors that can be stored and later searched, these new LLM models use semantic search in pipelines that generate new content from context gleaned by finding similar vectors. This content generation process, shown in Figure 3, is known as retrieval-augmented generative AI.

Using vector search in a vector database

A vector database stores high-dimensional vectors (for example, embeddings), and facilitates fast and accurate search and retrieval based on vector similarity (for example, ANN algorithms). Some databases are purpose-built for vector search (for example, Milvus). Other databases include vector search capabilities as an additional feature (for example, Redis).

Choosing which vector database to use depends on the requirements of your workflow.

Retrieval-augmented language models allow pretrained models to be customized for specific products, services, or other domain-specific use cases by augmenting a search with additional context that has been encoded into vectors by the LLM and stored in a vector database.

More specifically, a search is encoded into vector form and similar vectors are found in the vector database to augment the search. The vectors are then used with the LLM to formulate an appropriate response. Retrieval-augmented LLMs are a form of generative AI and they have revolutionized the industry of chatbots and semantic text search.

Workflow diagram shows how vector search is often combined with LLMs to perform semantic search.
Figure 3. Example workflow of a text retrieval application using RAPIDS RAFT for vector search

Other applications of vector similarity search

In addition to retrieval-augmented LLMs for generative AI, vector embeddings have been around for some time and have found many useful applications in the real world:

  • Recommender systems: Provide personalized suggestions according to what a user has shown interest in or interacted with.
  • Finance: Fraud detection models vectorize user transactions, making it possible to determine whether those transactions are similar to typical fraudulent activities.
  • Cybersecurity: Uses embeddings to model and search behaviors of bad actors and anomalous activities.
  • Genomics: Finds similar genes and cell structures in genomics analysis, such as single-cell RNA analysis.
  • Chemistry: Models molecular descriptors or fingerprints of chemical structures to compare them or find similar structures in a database.

We are always interested in learning about your use cases so don’t hesitate to leave a comment if you either use vector search already or would like to discuss how it could benefit your application.

RAPIDS RAFT library for vector search

RAFT is a library of composable building blocks for accelerating machine learning algorithms on the GPU, such as those used in nearest neighbors and vector search. ANN algorithms are among the core building blocks that comprise vector search libraries. Most importantly, these algorithms can greatly benefit from GPU acceleration.

For more information about RAFT’s core APIs and the various accelerated building blocks that it contains, see Reusable Computational Patterns for Machine Learning and Data Analytics with RAPIDS RAFT.

ANN for fast searches

In addition to brute-force for exact search, RAFT currently provides three different algorithms for ANN search:

  • IVF-Flat
  • IVF-PQ
  • CAGRA

The choice of the algorithm can depend upon your needs, as they each offer different advantages. Sometimes, brute force can even be the better option. More are being added in upcoming releases.

Because these algorithms are not doing an exact search, it is possible that some highly similar vectors are missed. The recall metric can be used to represent how many neighbors in the results are actual nearest neighbors of the query. Most of our benchmarks target recall levels of 85% and higher, meaning 85% (or more) of the relevant vectors were retrieved.

To tune the resulting indexes for different levels of recall, use various settings, or hyperparameters, when training approximate nearest-neighbors algorithms. Reducing the recall score often increases the speed of your searches and increasing the recall decreases the speed. This is known as the recall-speed tradeoff.

For more information, see Accelerating Vector Search: Fine-Tuning GPU Index Algorithms.

Performance comparison

GPUs excel at processing a lot of data at one time. All the algorithms just mentioned can outperform corresponding algorithms on the CPU when computing the nearest neighbors for thousands or tens of thousands of points at a time.

However, CAGRA was specifically engineered with online search in mind, which means that it outperforms the CPU even when only querying the nearest neighbors for a few data points at a time.

Figure 4 and Figure 5 show benchmarks that we performed by building an index on 100M vectors and querying only 10 vectors at a time. In Figure 4, CAGRA outperforms HNSW, which is one of the most popular indexes for vector search on CPU, in raw search performance even for an extremely small batch size of 10 vectors. This speed comes at a memory cost, however. In Figure 5, you can see that CAGRA’s memory footprint is a bit higher than the other nearest neighbors methods.

Bar chart compares throughput performance (queries per second) for RAFT’s GPU algorithms against HNSW on the CPU.
Figure 4. Vector search throughput at 95% recall on DEEP-100M dataset, batch size of 10

In Figure 5, the host memory of IVF-PQ is for the optional refinement step.

Bar chart compares memory usage for RAFT’s GPU algorithms against HNSW on the CPU.
Figure 5. GPU memory usage

Figure 6 presents a comparison of the index build times and shows that indexes can often be built faster on the GPU.

Bar chart compares index build time for RAFT’s GPU algorithms against HNSW on the CPU.
Figure 6. Index build time for the best-performing index at 95% recall

Summary

From feature stores to generative AI, vector similarity search can be applied in every industry. Vector search on the GPU performs at lower latency and achieves higher throughput for every level of recall for both online and batch processing.

RAFT is a set of composable building blocks that can be used to accelerate vector search in any data source. It has pre-built APIs for Python and C++. Integration for RAFT is underway for Milvus, Redis, and FAISS. We encourage database providers to try RAFT and consider integrating it into their data sources.

In addition to state-of-the-art ANN algorithms, RAFT contains other GPU-accelerated building blocks, such as matrix and vector operations, iterative solvers, and clustering algorithms. The second post in this series dives deeper into each of the GPU-accelerated indexes mentioned in this post and gives a brief explanation of how the algorithms work, along with a summary of important parameters to fine-tune their behavior. For more information, see Accelerating Vector Search: Fine-Tuning GPU Index Algorithms.

RAPIDS RAFT is fully open source and available on the /rapidsai/raft GitHub repo. You can also follow us on Twitter at @rapidsai.

Categories
Misc

Accelerating Vector Search: Fine-Tuning GPU Index Algorithms

The first post in this series introduced vector search indexes, explained the role they play in enabling a widespread range of important applications, and…

The first post in this series introduced vector search indexes, explained the role they play in enabling a widespread range of important applications, and provided a brief overview of vector search on the GPU with the RAFT library.

In this post, we dive deeper into each of the GPU-accelerated indexes mentioned in Part 1 and give a brief explanation of how the algorithms work, along with a summary of important parameters to fine-tune their behavior.

We then go through a simple end-to-end example to demonstrate RAFT’s Python APIs on a question-and-answer problem with a pretrained large language model and provide a performance comparison of RAFT’s algorithms against HNSW for a few different scenarios involving different numbers of query vectors being passed to the search algorithm concurrently.

This post provides:

  • An overview of vector search index algorithms that can be used with GPUs
  • An end-to-end example demonstrating how easy it can be to run vector search on the GPU with Python
  • Performance comparison of vector search on the GPU against HNSW, the current state-of-the-art method on the CPU

Vector search indexes

When working with vector search, the vectors are often converted to an indexed format that is optimized for fast lookups. Choosing the right indexing algorithm is important as it can affect both index build and search times. Furthermore, each different index type comes with its own set of knobs for fine-tuning the behavior, trading off index construction time, storage cost, search quality, and search speed.

When the right indexing algorithm is paired with the correct parameter settings, vector search on the GPU provides both faster build and search times for all levels of recall.

IVF-Flat

As it’s the simplest index type, start with the IVF-Flat algorithm. In this algorithm, a set of training vectors are first split into some clusters and then stored in the GPU memory organized by their closest cluster centers. The index-building step is faster than that of other algorithms presented in this post, even at high numbers of clusters.

To search an IVF-Flat index, the closest clusters to each query vector are selected, and the k-nearest neighbors (k-NN) are computed from each of those closest clusters. Because IVF-Flat stores the vectors in an exact, or flat format, meaning without compression, it has the advantage of computing exact distances within each of the clusters it searches. As we describe later in this post, this provides an advantage that often has a higher recall than IVF-PQ when the same number of closest clusters are searched. IVF-Flat index is a good choice when the full index can fit in GPU memory.

RAFT’s IVF-Flat index contains a couple of parameters to help trade off the query performance and accuracy:

  • When training the index, the n_lists parameter determines the number of clusters to partition the training dataset.
  • The search parameter n_probes determines the number of closest clusters to search through to compute the nearest neighbors for a set of query points.

In general, a smaller number of probes leads to a faster search at the expense of recall. When the number of probes is set to the number of lists, exact results are computed. However, in that case, a call to RAFT’s brute-force search is more performant.

IVF-PQ

When your dataset becomes too large to fit on the GPU, you gain some mileage by compressing the vectors using the IVF-PQ index type. Like IVF-Flat, IVF-PQ splits the points into a number of clusters (also specified by a parameter called n_lists) and searches the closest clusters to compute the nearest neighbors (also specified by a parameter called n_probes), but it shrinks the sizes of the vectors using a technique called product quantization.

Compressing the index ultimately allows for more vectors to be stored on the GPU. The amount of compression can be controlled with tuning parameters, which we describe later in this post, but higher levels of compression can provide a faster lookup time at the cost of recall. IVF-PQ is currently RAFT’s most memory-efficient vector index.

RAFT’s IVF-PQ provides two parameters that control memory usage:

  • pq_dim sets the target dimensionality of the compressed vector.
  • pq_bits sets the number of bits for each vector element after compression.

We recommend setting the former to a multiple of 32 while the latter is limited to a range of 4-8 bits. By default, RAFT selects a dimensionality value that minimizes quantization loss according to pq_bits, but this value can be adjusted to lower the memory footprint for each vector. It is useful to play with these parameters to see which settings work best for you.

When using large amounts of compression, an additional refinement step can be performed by querying the IVF-PQ index for a larger number of neighbors than needed and computing an exact search over the resulting neighbors to reduce the set down to the final desired number. The refinement step requires the original uncompressed dataset on the host memory.

For more information about building an IVF-PQ index, with in-depth details and recommendations, see the complete guide to RAFT IVF-PQ notebook on our GitHub repo.

CAGRA

CAGRA is RAFT’s new state-of-the-art ANN index. It is a high-performance, GPU-accelerated, graph-based method that has been specifically optimized for small-batch cases, where each lookup contains only one or a few query vectors. Like other popular graph-based methods, such as hierarchical navigable small-world graphs (HNSW) and SONG, an optimized k-NN graph is built at index training time with various qualities that yield efficient search at reasonable levels of recall.

CAGRA performs a search by first randomly selecting candidate vertices from the graph and then expanding, or traversing, those vertices to compute distances to their children, storing off the nearest neighbors along the way (Figure 1). Each time it traverses a set of vertices, it has performed one iteration.

Diagram shows how CAGRA can map subgraphs to separate thread blocks, enabling parallelism even for a single query.
Figure 1. CAGRA using multiple thread blocks

In Figure 1, CAGRA is using multiple thread blocks to visit more graph nodes in parallel. This is maximizing GPU utilization for single-query searches.

Because CAGRA returns the approximate nearest neighbors like the algorithms described earlier, it also provides a few parameters to control the recall and the speed.

The main parameter that can be adjusted to trade off search speed is itopk_size, which specifies the size of an internal sorted list that stores the nodes that can be explored in the next iteration. Higher values of itopk_size keep a larger search context in memory that improves recall at the cost of more time spent in maintaining the queue.

The parameter search_width defines the number of the closest parent vertices that are traversed to expand their children in each search iteration.

Another useful parameter is the number of iterations to perform. The setting is selected automatically by default, but this can be changed to a higher or lower value to trade off recall for a faster search.

CAGRA’s optimized graph is fixed-degree, which is tuned using the parameter graph_degree. The fixed-degree makes better use of GPU resources by keeping the number of computations uniform when searching the graph. It builds the initial k-NN graph by computing an actual k-NN, for example by using IVF-PQ explained earlier, to compute the nearest neighbors of all the points in the training dataset.

The number of k-nearest neighbors (k) of this intermediate k-NN graph can be tuned using a parameter called intermediate_graph_degree to trade off the quality of the final searchable CAGRA graph.

A higher quality graph can be built with a larger intermediate_graph_degree value, which means that the final optimized graph is more likely to find nearest neighbors that yield a high recall. RAFT provides several useful parameters to tune the CAGRA algorithm. For more information, see the CAGRA API documentation.

Again, this parameter can be used to control how thoroughly the overall space is covered by the search but again this comes at the cost of having to search more to find the nearest neighbors, which reduces the search performance.

Getting started with pylibraft

Pylibraft is the lightweight Python library of RAFT and enables you to use RAFT’s ANN algorithms for vector search right in Python. Pylibraft can accept any object that supports __cuda_array_interface__, such as a Torch or CuPy array.

The following example briefly demonstrates how you can build and query a RAFT CAGRA index with Pylibraft.

from pylibraft.neighbors import cagra
import cupy as cp

# On small batch sizes, using "multi_cta" algorithm is efficient
index_params = cagra.IndexParams(graph_degree=32)
search_params = cagra.SearchParams(algo="multi_cta")

corpus_embeddings = cp.random.random((1500,96), dtype=cp.float32)
query_embeddings = cp.random.random((1,96), dtype=cp.float32)

cagra_index = cagra.build(index_params, corpus_embeddings)
# Find the 10 closest vectors
hits = cagra.search(search_params, cagra_index, query_embeddings, k=10)

With the recent success of LLMs, semantic search is a perfect way to showcase vector similarity search in action using RAFT. In the following example, a DistilBERT transformer model combined with each of the three ANN indexes is used to solve a simple question retrieval problem. The Simple English Wikipedia dataset is used to answer the user’s search query.

The language model first transforms the training sentences into vector embeddings that are inserted into a RAFT ANN index. The inference is done by encoding the query and using our trained ANN index to find vectors similar to the encoded query vector. The answer that you return to the user is the nearest article in Simple Wikipedia, which you fetch using the closest vector from the similarity search.

You can get started with RAFT by using pylibraft and this notebook for a question-retrieval task:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

Benchmarks

Using GPU as a hardware accelerator for your vector search application can lead to an increase in performance, and it is best showcased on large datasets. The benchmarks can be fully reproduced by following RAFT’s end-to-end benchmark documentation. Our benchmarks consider that the data is already available for computation, which means that data transfer is not taken into consideration, although this should not be a significant difference thanks to the high transfer speed of recent NVIDIA hardware (over 25 GB/s).

We used the DEEP-100M dataset on an H100 GPU to compare RAFT indexes with HNSW running on an Intel Xeon Platinum 8480CL CPU.

Bar chart compares the throughput of HNSW, the state-of-the-art on CPU, against RAFT’s ANN algorithms for a single query at a time at various levels of recall.
Figure 2. Comparing ANN algorithms at various levels of recall and throughput, batch of 1

Figure 2 compares ANN algorithms at various levels of recall and throughput for a single query. At high levels of recall, RAFT’s methods demonstrate higher throughput than other alternative libraries.

We ran a performance comparison on queries for a single vector at a time, called online search. It’s one of the main use cases for vector search. RAFT-based indexes provide a higher throughput, measured in queries-per-second (QPS), than other libraries that are using CPU or GPU.

Bar chart compares throughput for HNSW, the state-of-the-art on CPU, against RAFT’s ANN algorithms for a query of 10 vectors at a time.
Figure 3. Comparing ANN algorithms at various levels of recall and throughput, batch of 10

Figure 3 compares ANN algorithms at various levels of recall and throughput with a batch size of 10 queries. RAFT’s methods demonstrate higher throughput than HNSW for all experiments.

The benefits of using GPU for vector search applications are most prevalent at higher batch sizes. The performance gap between CPU and GPU is significant and can scale up easily. Figure 3 shows that for a batch size of 10, only RAFT-based indexes are relevant when comparing the number of queries per second. For a batch size of 10k (Figure 4), CAGRA outperforms all other indexes by far.

Bar chart compares throughput for HNSW, the state-of-the-art on CPU, against RAFT’s ANN algorithms for a query of 10k vectors at a time.
Figure 4. Comparing ANN algorithms at various levels of recall and throughput

Figure 4 compares ANN algorithms at various levels of recall and throughput with a batch size of 10K query. RAFT’s methods demonstrate higher throughput than HNSW for all experiments.

Summary

Each different vector search index type has benefits and drawbacks which ultimately depend on your needs. This post outlined some of those benefits and drawbacks, providing a brief explanation of how each different algorithm works, along with a few of the most important parameters that can be tuned to trade off storage costs, build times, search quality, and search performance. In all cases, GPUs can improve both index construction and search performance.

RAPIDS RAFT is fully open source and available on the /rapidsai/raft GitHub repo. You can get started with RAFT by reading through the docs, running the reproducible benchmarking suite, or building upon the example vector search template project. Also be sure to look for options to enable RAFT indexes in Milvus, Redis, and FAISS. Finally, you can follow us on Twitter at @rapidsai.

Categories
Misc

Leading MLPerf Inference v3.1 Results with NVIDIA GH200 Grace Hopper Superchip Debut

NVIDIA Jetson Orin modules.AI is transforming computing, and inference is how the capabilities of AI are deployed in the world’s applications. Intelligent chatbots, image and video…NVIDIA Jetson Orin modules.

AI is transforming computing, and inference is how the capabilities of AI are deployed in the world’s applications. Intelligent chatbots, image and video synthesis from simple text prompts, personalized content recommendations, and medical imaging are just a few examples of AI-powered applications.

Inference workloads are both computationally demanding and diverse, requiring that platforms be able to process many predictions on never-seen-before data quickly as well as run inference on a breadth of AI models. Organizations looking to deploy AI need a way to evaluate the performance of infrastructure objectively across a breadth of workloads, environments, and deployment scenarios. This is true for both AI training and inference.

MLPerf Inference v3.1, developed by the MLCommons consortium, is the latest edition of an industry-standard AI inference benchmark suite. It complements MLPerf Training and MLPerf HPC. MLPerf Inference v3.1 measures inference performance across a variety of important workloads, including image classification, object detection, natural language processing, speech recognition, and recommender systems, across common data center and edge deployment scenarios.

MLPerf Inference v3.1 includes two important updates to better reflect modern AI use cases:

  • The addition of a large language model (LLM) test based on GPT-J–an open source, 6B-parameter LLM–to represent text summarization, a form of generative AI.
  • An updated DLRM test with a new model architecture and a substantially larger dataset that mirrors the DLRM update introduced in MLPerf Training v3.0. The update better reflects the scale and complexity of modern recommender systems.

Powered by the full NVIDIA AI Inference software stack, including the latest TensorRT 9.0, NVIDIA made submissions in MLPerf Inference v3.1 using a wide array of products. These included the debut submission of the NVIDIA GH200 Grace Hopper Superchip, which extended the great per-accelerator performance delivered by the NVIDIA H100 Tensor Core GPU. NVIDIA also submitted the NVIDIA L4 Tensor Core GPU for mainstream servers, as well as both the NVIDIA Jetson AGX Orin and Jetson Orin NX platforms for edge AI and robotics.  

The rest of this post provides highlights of the NVIDIA submissions as well as a peek into how these exceptional results were achieved.

Grace Hopper Superchip extends NVIDIA Hopper inference performance

The NVIDIA GH200 Grace Hopper Superchip combines the NVIDIA Hopper GPU and the NVIDIA Grace CPU through the coherent NVLink-C2C at 900 GB/s to create a single superchip. That’s 7x higher than PCIe Gen5 at 5x lower power. It also incorporates up to 576 GB of fast access memory through the combination of 96 GB of HBM3 GPU memory and up to 480 GB of low-power, high-bandwidth LPDDR5X memory.

The GH200 Grace Hopper Superchip has integrated power management features that enable the GH200 to take advantage of the energy efficiency of the Grace CPU to balance efficiency and performance. For more information, see NVIDIA Grace Hopper Superchip Architecture In-Depth and the NVIDIA Grace Hopper Superchip Architecture whitepaper.

Diagram shows the GH200 with 96 GB HBM3 was used for MLPerf Inference v3.1 submission.
Figure 1. Logical overview of the NVIDIA GH200 Grace Hopper Superchip

The NVIDIA GH200 Grace Hopper Superchip is designed for the versatility required to deliver leading performance across compute and memory-intensive workloads. It also delivers substantially higher performance on the most demanding frontier workloads, such as large transformer-based models with hundreds of billions or trillions of parameters, recommender systems with multi-terabyte embedding tables, and vector databases.

In addition to being built for the most intensive AI workloads, the GH200 Grace Hopper Superchip also shines on the popular, mainstream workloads tested by MLPerf Inference. It ran every test, demonstrating its seamless support for the full NVIDIA software stack. It extended the exceptional performance achieved by NVIDIA’s single H100 SXM submission on every workload.

Bar chart shows that NVIDIA Grace Hopper delivered up to 17% better performance than H100 SXM with the help of larger memory capacity, wider memory bandwidth, and sustaining higher GPU clock frequency.
Figure 2. NVIDIA Grace Hopper MLPerf Inference data center performance results compared to DGX H100 SXM

MLPerf Inference: Datacenter v3.1, Closed. Submission IDs: NVIDIA 3.1-0107(1xH100 SXM), 3.1-0110(1xGH200 Grace Hopper Superchip)
The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. For more information, see www.mlcommons.org.

The GH200 Grace Hopper Superchip incorporates 96 GB of HBM3 and provides up to four TB/s of HBM3 memory bandwidth, compared to 80 GB and 3.35 TB/s for H100 SXM, respectively. This larger memory capacity, as well as greater memory bandwidth, enabled the use of larger batch sizes for workloads on the NVIDIA GH200 Grace Hopper Superchip compared to the NVIDIA H100 SXM. For example, both RetinaNet and DLRMv2 ran with up to double the batch sizes in the Server scenario and 50% greater batch sizes in the Offline scenario.

The GH200 Grace Hopper Superchip’s high-bandwidth NVLink-C2C link between the NVIDIA Hopper GPU and the Grace CPU enables fast communication between the CPU and GPU, which can help boost performance.

For example, in the MLPerf DLRMv2 workload, transferring a batch of tensors over PCIe takes approximately 22% of the batch inference time on H100 SXM. The GH200 Grace Hopper Superchip, however, performed the same transfer using just 3% of the inference time as a result of NVLink-C2C.

Thanks to higher memory bandwidth and larger memory capacity, the Grace Hopper Superchip delivered up to 17% higher per-chip performance advantage compared to the H100 GPU on MLPerf Inference v3.1 workloads. These results showcase the performance and versatility of both the GH200 Grace Hopper Superchip and the NVIDIA software stack.

Optimizing GPT-J 6B for LLM inference

To represent LLM inference workloads, MLPerf Inference v3.1 introduces a new test based on the GPT-J 6B model: an LLM with 6B parameters. The task tested by the new benchmark is text summarization using the CNN/DailyMail dataset.

The NVIDIA platform delivered strong results on the GPT-J workload, with GH200 Grace Hopper Superchip delivered the highest per-accelerator performance on both the Offline and Server scenarios on a per-accelerator basis. The NVIDIA L4 GPU also delivered strong performance, outpacing the best CPU-only result up to 6x in a 1-slot PCIe card with a thermal design power (TDP) of just 72 Watts.

To achieve these results, NVIDIA software for LLM inference intelligently applies both FP8 and FP16 precisions to increase performance while also meeting target accuracy requirements.

A key challenge for performing GPT-J inference is the high memory consumption of the key-value (KV) cache in the transformer block. By storing the KV cache in the FP8 data format, the NVIDIA submission significantly increased the batch size used. This boosted GPU memory utilization and enabled better use of the immense compute performance of NVIDIA GPUs.

Diagram shows the architecture of the GPT-J model, including input, output, and internal mechanism.
Figure 3. GPT-J architecture

Enabling DLRM-DCNv2 submissions

MLPerf Inference v3.1 introduced an update to the DLRMv1 model used in prior versions of the benchmark. This DLRMv2 model replaces the interactions layer with a three-layer DCNv2 cross network. DLRMv2 also uses multi-hot categorical inputs rather than one-hot, which are synthetically generated from the Criteo Terabyte Click Logs Dataset.

One of the challenges of recommender inference arises from fitting the embedding tables on the system. By converting the model to FP16 precision, including the embedding table, we could both improve performance and halve the memory footprint of the embedding table, reducing it to 49 GB. This enables the entire embedding table to fit within a single H100 GPU. 

To enable our submission on the L4 GPU, which has 24 GB of memory, NVIDIA software intelligently splits the embedding table between GPU and host memory using row-frequency data obtained by analyzing the training dataset. Using this data, NVIDIA software can minimize memory transfers between the host CPU and GPU  by storing the most frequently used embedding table rows on the GPU.

The NVIDIA platform demonstrated exceptional results on DLRMv2, with GH200 showing up to a 17% increase compared to the great performance delivered by H100 SXM. 

Maximizing parallelism on NVIDIA Jetson Orin with Programmable Vision Accelerator

The Jetson AGX Orin series and Jetson Orin NX series are embedded modules for edge AI and robotics, based on the NVIDIA Orin system-on-chip (SoC). To deliver exceptional AI performance and efficiency across a range of use cases, Jetson Orin incorporates many compute engines:

These accelerators can be used to offload the GPU and enable additional AI inference performance on the Jetson Orin modules.

NVDLA is a fixed-function accelerator optimized for deep learning operations and is designed to do full hardware acceleration of convolutional neural network inferencing. 

Diagram of the NVIDIA Orin SoC shows the individual blocks, including CPU, GPU, dedicated accelerators, cache, and memory interface.
Figure 4. NVIDIA Orin system-on-chip

For the first time in MLPerf Inference v3.1, we demonstrate the concurrent use of the PVA alongside GPU and DLA for inference. The second-generation PVA provides dedicated hardware for various computer vision kernels such as filtering, warping, and fast Fourier transforms (FFT). It also supports advanced programmed kernels, which can serve as the backend runtime of TensorRT custom plug-ins.

With the 23.08 Jetson CUDA-X AI Developer Preview, we’ve included a sample PVA SDK. This package provides runtime support for a non-maximum suppression (NMS) layer. It demonstrates that the PVA can serve as a highly capable accelerator, complementing the powerful Jetson Orin GPU.

NVIDIA has developed a TensorRT custom NMS PVA plug-in as a reference for Jetson Orin users and it was included as part of the NVIDIA MLPerf Inference v3.1 submission.

In the NVIDIA MLPerf Inference v3.0 RetinaNet submission on NVIDIA Orin platforms, the GPU handled all outputs from the ResNext + FPN backbone from the GPU as well as from the two DLAs.

Diagram shows how inference queries are sent to the GPU and DLAs, and then the GPU handles the outputs from the DLAs.
Figure 5. GPU responsible for GPU and DLA outputs in MLPerf Inference v3.0

Figure 5 shows how, in MLPerf Inference v3.0 submissions, the GPU was responsible for outputs from the ResNext+FPN backbone from both the GPU and the DLAs.

By using the NMS PVA plug-in, the NMS operator is now offloaded from GPU to PVA, enabling three fully parallel inference flows on Jetson Orin AGX and Jetson Orin NX. The output from the ResNext and the FPN backbone running on the two DLAs is now consumed by the two PVAs running the NMS PVA plug-in inside the end-to-end RetinaNet TensorRT engine.

Diagram shows how inference queries are sent to the GPU and DLAs, the PVAs consume the outputs from the DLAs, and then the GPU and two PVAs create output.
Figure 6. Fully parallel computations in MLPerf Inference v3.1

In Figure 6, the NVIDIA MLPerf Inference v3.1 submission enables computations to run fully in parallel through optimized use of Jetson Orin PVAs.

This careful use of PVA along with GPU and DLA boosts performance by 30% on both the Jetson AGX Orin 64GB and the Jetson Orin NX 16GB modules. When this use of PVA is coupled with a newly optimized NMS Opt GPU plug-in, Jetson AGX Orin delivers 61% higher performance and 38% better power efficiency on the RetinaNet workload. The Jetson Orin NX 16GB showed an even larger gain, with an 84% performance boost on the same test.

Algorithmic optimizations further improve BERT performance

In MLPerf Inference v3.1, NVIDIA made a submission on the BERT Large workload using the L4 GPU in the open division using techniques developed by the OmniML team. OmniML is a startup acquired by NVIDIA in early 2023 that brought expertise in machine learning algorithmic model optimization for use cases spanning cloud platforms to edge devices.

The open division submission on BERT applied structured pruning with distillation techniques to improve the performance by up to 4.7x while maintaining 99% accuracy. This submission demonstrates the potential of algorithmic optimizations for enhancing significantly the already exceptional performance of the NVIDIA platform.

NVIDIA deployed a proprietary, automatic, structured pruning tool that uses a gradient-based sensitivity analysis to prune the model to the given target FLOPs and fine-tune it with distillation to recover most of the accuracy. The number of transformer layers, attention heads, and linear layer dimensions were pruned in all the transformer layers in the model while the embedding dimension was kept unchanged.

Compared to the original MLPerf Inference BERT INT8 model, our pruned model reduced the number of parameters by 4x and the number of FLOPs by 5.6x. This model has a varying number of heads and linear layer dimensions in each layer. The resulting TensorRT engine built from the pruned model is 3.4x smaller, 177 MB compared to 607 MB.

The fine-tuned model is quantized to INT8 precision using the same technique employed in the NVIDIA closed division submission. The submission also employed distillation during quantization-aware training (QAT) to achieve an accuracy that is 99% or higher.

Scenario Closed Division Open Division Speedup
Offline samples/sec 1029 4609 4.5x
Server samples/sec 899 4265 4.7x
Single Stream p90 Latency (ms) 2.58 0.82 3.1x
Table 1. BERT Large performance metrics for both closed division and open division

To understand better how each of the model optimizations affects performance, NVIDIA performed a stacking analysis and applied different model optimization methods individually (Figure 8).

Diagram stacks quantization performance on optimization and pruning (Closed Division) and then on distillation (Open Division). The accuracy baseline is the FP32 model (not listed).
Figure 7. Stacking performance analysis

Figure 7 shows that, through model pruning and distillation, the NVIDIA open division submission on the BERT workload using L4 provides a 4.5x speedup compared to the same GPU running the closed division workload in the offline scenario.

Each model optimization method applied can be easily integrated with each other. Together, they yielded a substantial performance improvement compared to the baseline model.

NVIDIA accelerated computing boosts performance for inference and AI training workloads

In its MLPerf debut, the GH200 Grace Hopper Superchip turned in exceptional performance on all workloads and scenarios in the closed division of the data center category, boosting performance by up to 17% on the NVIDIA single-chip H100 SXM submission. The NVIDIA software stack fully supports the GH200 Grace Hopper Superchip today. 

For mainstream servers, the L4 GPU showed delivery of a large performance leap over CPU-only offerings in a compact, low-power, PCIe add-in card.

For edge AI and robotics applications, the Jetson AGX Orin and Jetson Orin NX modules achieved great performance. Software optimizations helped to further unlock the potential of the powerful NVIDIA Orin SoC that powers those modules. It boosted performance on RetinaNet, a popular AI network for object detection, by up to 84%.

In this round, NVIDIA also submitted results in the open division, providing a first look at the potential for model optimizations to speed inference performance dramatically while still achieving excellent accuracy.

The latest MLPerf Inference v3.1 benchmarks show that the NVIDIA accelerated computing platform continues to deliver leadership performance and versatility. There’s innovation at every layer of the technology stack, from cloud to edge, at the speed of light.

Categories
Misc

Webinar: Boost Your AI Development with ClearML and NVIDIA TAO

Event promo card.On Sept. 19, learn how NVIDIA TAO integrates with the ClearML platform to deploy and maintain machine learning models in production environments.Event promo card.

On Sept. 19, learn how NVIDIA TAO integrates with the ClearML platform to deploy and maintain machine learning models in production environments.

Categories
Misc

NVIDIA Partners with India Giants to Advance AI in World’s Most Populous Nation

The world’s largest democracy is poised to transform itself and the world, embracing AI on an enormous scale. Speaking with the press Friday in Bengaluru, in the context of announcements from two of India’s largest conglomerates, Reliance Industries Limited and Tata Group, NVIDIA founder and CEO Jensen Huang detailed plans to bring AI technology and Read article >

Categories
Misc

NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

Large language models offer incredible new capabilities, expanding the frontier of what is possible with AI. But their large size and unique execution…

Large language models offer incredible new capabilities, expanding the frontier of what is possible with AI. But their large size and unique execution characteristics can make them difficult to use in cost-effective ways. 

NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML, now a part of Databricks, OctoML, Tabnine and Together AI, to accelerate and optimize LLM inference. 

Those innovations have been integrated into the open-source NVIDIA TensorRT-LLM software, set for release in the coming weeks. TensorRT-LLM consists of the TensorRT deep learning compiler and includes optimized kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives for groundbreaking performance on NVIDIA GPUs. It enables developers to experiment with new LLMs, offering peak performance and quick customization capabilities, without requiring deep knowledge of C++ or NVIDIA CUDA.

TensorRT-LLM improves ease of use and extensibility through an open-source modular Python API for defining, optimizing, and executing new architectures and enhancements as LLMs evolve, and can be customized easily.  

For example, MosaicML has added specific features that it needs on top of TensorRT-LLM seamlessly and integrated them into their inference serving. Naveen Rao, vice president of engineering at Databricks notes that “it has been an absolute breeze.”

“TensorRT-LLM is easy to use, feature-packed with streaming of tokens, in-flight batching, paged-attention, quantization, and more, and is efficient,” Rao said. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”

Performance comparison

Summarizing articles is just one of the many applications of LLMs. The following benchmarks show performance improvements brought by TensorRT-LLM on the latest NVIDIA Hopper architecture. 

The following figures reflect article summarization using an NVIDIA A100 and NVIDIA H100 with CNN/Daily Mail, a well-known dataset for evaluating summarization performance.  

In Figure 1, H100 alone is 4x faster than A100. Adding TensorRT-LLM and its benefits, including in-flight batching, result in an 8X total increase to deliver the highest throughput. 

GPT-J performance comparison between A100 and H100 with and without TensorRT-LLM.
Figure 1. GPT-J-6B  A100 compared to H100 with and without TensorRT-LLM

Text summarization, variable I/O length, CNN / DailyMail dataset | A100 FP16 PyTorch eager mode | H100 FP8 | H100 FP8, in-flight batching, TensorRT-LLM

On Llama 2—a popular language model released recently by Meta and used widely by organizations looking to incorporate generative AI—TensorRT-LLM can accelerate inference performance by 4.6x compared to A100 GPUs.

Llama 2 70B performance comparison between A100 and H100 with and without TensorRT-LLM.
Figure 2. Llama 2 70B, A100 compared to H100 with and without TensorRT-LLM

Text summarization, variable I/O length, CNN / DailyMail dataset | A100 FP16 PyTorch eager mode| H100 FP8 | H100 FP8, in-flight batching, TensorRT-LLM

LLM ecosystem explosion

The ecosystem is innovating rapidly, developing new and diverse model architectures. Larger models unleash new capabilities and use cases. Some of the largest, most advanced language models, like Meta’s 70-billion-parameter Llama 2, require multiple GPUs working in concert to deliver responses in real time. Previously, developers looking to achieve the best performance for LLM inference had to rewrite and manually split the AI model into fragments and coordinate execution across GPUs.

TensorRT-LLM uses tensor parallelism, a type of model parallelism in which individual weight matrices are split across devices. This enables efficient inference at scale–with each model running in parallel across multiple GPUs connected through NVLink and across multiple servers–without developer intervention or model changes.

As new models and model architectures are introduced, developers can optimize their models with the latest NVIDIA AI kernels available open source in TensorRT-LLM. The supported kernel fusions include cutting-edge implementations of FlashAttention and masked multi-head attention for the context and generation phases of GPT model execution, along with many others.

Additionally, TensorRT-LLM includes fully optimized, ready-to-run versions of many LLMs widely used in production today. This includes Meta Llama 2, OpenAI GPT-2 and GPT-3, Falcon, Mosaic MPT, BLOOM, and a dozen others, all of which can be implemented with the simple-to-use TensorRT-LLM Python API.

These capabilities help developers create customized LLMs faster and more accurately to meet the needs of virtually any industry.

In-flight batching

Today’s large language models are extremely versatile. A single model can be used simultaneously for a variety of tasks that look very different from one another. From a simple question-and-answer response in a chatbot to the summarization of a document or the generation of a long chunk of code, workloads are highly dynamic, with outputs varying in size by several orders of magnitude. 

This versatility can make it difficult to batch requests and execute them in parallel effectively—a common optimization for serving neural networks—which could result in some requests finishing much earlier than others.

To manage these dynamic loads, TensorRT-LLM includes an optimized scheduling technique called in-flight batching. This takes advantage of the fact that the overall text generation process for an LLM can be broken down into multiple iterations of execution on the model. 

With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the TensorRT-LLM runtime immediately evicts finished sequences from the batch. It then begins executing new requests while other requests are still in flight. In-flight batching and the additional kernel-level optimizations enable improved GPU usage and minimally double the throughput on a benchmark of real-world LLM requests on H100 Tensor Core GPUs, helping to minimize TCO.

H100 Transformer Engine with FP8

LLMs contain billions of model weights and activations, typically trained and represented with 16-bit floating point (FP16 or BF16) values where each value occupies 16 bits of memory. At inference time, however, most models can be effectively represented at lower precision, like 8-bit or even 4-bit integers (INT8 or INT4), using modern quantization techniques. 

Quantization is the process of reducing the precision of a model’s weights and activations without sacrificing accuracy. Using lower precision means that each parameter is smaller, and the model takes up less space in GPU memory. This enables inference on larger models with the same hardware while spending less time on memory operations during execution. 

NVIDIA H100 GPUs with TensorRT-LLM give users the ability to convert their model weights into a new FP8 format easily and compile their models to take advantage of optimized FP8 kernels automatically. This is made possible through Hopper Transformer Engine technology and done without having to change any model code. 

The FP8 data format introduced by the H100 enables developers to quantize their models and radically reduce ‌memory consumption without degrading model accuracy. FP8 quantization retains higher accuracy compared to other data formats like INT8 or INT4 while achieving the fastest performance and offering the simplest implementation.  

Summary

LLMs are advancing rapidly. Diverse model architectures are being developed daily and contribute to a growing ecosystem. In turn, larger models unleash new capabilities and use cases, driving adoption across all industries.

LLM inference is reshaping the data center. Higher performance with increased accuracy yields better TCO for enterprises. Model innovations enable better customer experiences, translating into higher revenue and earnings.

When planning inference deployment projects, there are still many other considerations to achieve peak performance using state-of-the-art LLMs. Optimization rarely happens automatically. Users must consider fine-tuning factors such as parallelism, end-to-end pipelines, and advanced scheduling techniques. And they require a computing platform that can handle mixed precision without diminishing accuracy.

TensorRT-LLM comprises TensorRT’s Deep Learning Compiler, optimized kernels, pre- and post-processing, and multi-GPU/multi-node communication in a simple, open-source Python API for defining, optimizing, and executing LLMs for inference in production. 

Get started with TensorRT-LLM

NVIDIA TensorRT-LLM is now available in early access and soon will be integrated into the NVIDIA NeMo framework—part of NVIDIA AI Enterprise, an enterprise-grade AI software platform with security, stability, manageability, and support. Developers and researchers will be able to access TensorRT-LLM through the NeMo framework on NGC or through the source repository on GitHub. 

Note that you must be registered in the NVIDIA Developer Program to apply for the early access release. You must also be logged in using your organization’s email address. We cannot accept applications from accounts using Gmail, Yahoo, QQ, or other personal email accounts.

To participate, fill out the short application form and provide details about your use case.

Categories
Misc

Workshop: Fundamentals of Deep Learning

An illustration of computer monitors showing data science models.Learn key techniques and tools required to train a deep learning model in this virtual hands-on workshop.An illustration of computer monitors showing data science models.

Learn key techniques and tools required to train a deep learning model in this virtual hands-on workshop.

Categories
Misc

Tata Partners With NVIDIA to Build Large-Scale AI Infrastructure

NVIDIA today announced an extensive collaboration with Tata Group to deliver AI computing infrastructure and platforms for developing AI solutions. The collaboration will bring state-of-the-art AI capabilities within reach to thousands of organizations, businesses and AI researchers, and hundreds of startups in India.