Categories
Misc

Accelerating Load Times for DirectX Games and Apps with GDeflate for DirectStorage

Photo of a woman looking at her monitor.Load times. They are the bane of any developer trying to construct a seamless experience. Trying to hide loading in a game by forcing a player to shimmy…Photo of a woman looking at her monitor.

Load times. They are the bane of any developer trying to construct a seamless experience. Trying to hide loading in a game by forcing a player to shimmy through narrow passages or take extremely slow elevators breaks immersion.

Now, developers have a better solution. NVIDIA collaborated with Microsoft and IHV partners to develop GDeflate for DirectStorage 1.1, an open standard for GPU compression. The current Game Ready Driver (version 526.47) contains NVIDIA RTX IO technology, including optimizations for GDeflate.

GDeflate: An Open GPU Compression Standard

GDeflate is a high-performance, scalable, GPU-optimized data compression scheme that can help applications make use of the sheer amount of data throughput available on modern NVMe devices. It makes streaming decompression from such devices practical by eliminating CPU bottlenecks from the overall I/O pipeline. GDeflate also provides bandwidth amplification effects, further improving the effective throughput of the I/O subsystem.

GDeflate Open Source will be released on GitHub with a permissive license for IHVs and ISVs. We want to encourage the quick embrace of GDeflate as a data-parallel compression standard, facilitating its adoption across the PC ecosystem and on other platforms.

To show the benefits of GDeflate, we measured system performance without compression, with standard CPU-side decompression, and with GPU-accelerated GDeflate decompression on a representative game-focused dataset, containing texture and geometry data.

A plot depicting the achieved bandwidth over varying staging buffer sizes using no compression, Zlib, a CPU implementation of GDeflate, and the GPU version of GDeflate.
Figure 1. Data throughput of various data compressed formats compared to varying staging buffer sizes
A plot depicting the processing cycles over varying staging buffer sizes using no compression, Zlib, a CPU implementation of GDeflate, and the GPU version of GDeflate.
Figure 2. Processing cycles of various data compressed formats compared to varying staging buffer sizes

As you can see from Figures 1 and 2, the data throughput of uncompressed streaming is limited by the system bus bandwidth at about ~3 GB/s, which happens to be the limit of a Gen3 PCIe interconnect.

When applying traditional compression with decompression happening on the CPU, it’s the CPU that becomes the overall bottleneck, resulting in lower throughput than would otherwise be possible with uncompressed streaming. Not only does it underutilize available I/O resources of the system, but it also takes away CPU cycles from other tasks needing CPU resources.

With GPU-accelerated GDeflate decompression, the system can deliver effective bandwidth well in excess of what’s possible without applying compression. It is effectively multiplying data throughput by its compression ratio. The CPU remains fully available for performing other important tasks, maximizing system-level performance.

GDeflate is available as a standard GPU decompression option in DirectStorage 1.1—a modern I/O streaming API from Microsoft. We’re looking forward to next-generation game engines benefiting from GDeflate by dramatically reducing loading times.

Resource streaming and data compression

Today’s video games feature extremely detailed interactive environments, requiring the management of enormous assets. This data must be delivered first to the end user’s system, and then, at runtime, actively streamed to the GPU for processing. The bulk of a game’s content package is made up of resources that naturally target the GPU: textures, materials, and geometry data.

Traditional data compression techniques are applicable to game content that rarely changes. For example, a texture that is authored only one time may have to be loaded multiple times as the player advances through a game level. Such assets are usually compressed when they are packaged for distribution and decompressed on demand when the game is played. It has become standard practice to apply compression to game assets to reduce the size of the downloadable (and its installation footprint).

However, most data compression schemes are designed for CPUs and assume serial execution semantics. In fact, the process of data compression is usually described in fundamentally serial terms: a stream of data is scanned serially while looking for redundancies or repeated patterns. It replaces multiple occurrences of such patterns with a reference to their previous occurrence. As a result, such algorithms can’t easily scale to data-parallel architectures or accommodate the need for faster decompression rates demanded by modern game content.

At the same time, recent advances in I/O technology have dramatically improved available I/O bandwidth on the end user system. It’s typical to see a consumer system equipped with a PCIe Gen3 or Gen4 NVMe device, capable of delivering up to 7 GB/s of data bandwidth.

To put this in perspective, at this rate, it is possible to fill the entire 24 GBs of frame buffer memory on the high-end NVIDIA GeForce RTX 4090 GPU in a little over 3 seconds!

To keep up with these system-level I/O speed improvements, we need dramatic advances in data compression technology. At these rates, it is no longer practical to use the CPU for data decompression on the end user’s system. That requires an unacceptably large fraction of precious CPU cycles to be spent on this auxiliary task. It may also slow down the entire system.

The CPU shouldn’t become the bottleneck that holds back the I/O subsystem.

Data-parallel decompression and GDeflate architecture

With Moore’s law ending, we can no longer expect to get “free” performance improvements from serial processors.

High-performance systems have long embraced large-scale data parallelism to continue scaling performance for many applications. On the other hand, parallelizing the traditional data compression algorithms has been challenging, due to fundamental serial assumptions “baked” into their design.

What we need is a GPU-friendly data compression approach that can scale performance as GPUs become wider and more parallel.

This is the problem that we set out to address with GDeflate, a novel data-parallel compression scheme optimized for high-throughput GPU decompression. We designed GDeflate with the following goals:

  • High-performance GPU-optimized decompression to support the fastest NVMe devices
  • Offload the CPU to avoid making it the bottleneck during I/O operations
  • Portable to a variety of data-parallel architectures, including CPUs and GPUs
  • Can be implemented cheaply in fixed-function hardware, using existing IP
  • Establish as a data-parallel data compression standard

As you could guess from its name, GDeflate builds upon the well-established RFC 1951 DEFLATE algorithm, expanding and adapting it for data-parallel processing. While more sophisticated compression schemes exist, the simplicity and robustness of the original DEFLATE data coding make it an appealing choice for highly tuned GPU-based implementations.

Existing fixed-function implementations of DEFLATE can also be easily adapted to support GDeflate for improved compatibility and performance.

Two-level parallelism

A many-core SIMD machine consumes the GDeflate bitstream by design, explicitly exposing parallelism at two levels.

First, the original data stream is segmented into 64 KB tiles, which are processed independently. This coarse-grained decomposition provides thread-level parallelism, enabling multiple tiles to be processed concurrently on multiple cores of the target processor. This also enables random access to the compressed data at tile granularity. For example, a streaming engine may request a sparse set of tiles to be decompressed in accordance with the required working set for a given frame.

Also, 64 KB happens to be the standard tile size for tiled or sparse resources in graphics APIs (DirectX and Vulkan), which makes GDeflate compatible with future on-demand streaming architectures leveraging these API features.

Second, the bitstream within tiles is specifically formatted to expose finer-grained, SIMD-level parallelism. We expect that a cooperative group of threads will process individual tiles, as the group can directly parse the GDeflate bitstream using hardware-accelerated data-parallel operations, commonly available on most SIMD architectures.

All threads in the SIMD group share the decompression state. The formatting of the bitstream is carefully constructed to enable highly optimized cooperative processing of compressed data.

This two-level parallelization strategy enables GDeflate implementations to scale easily across a wide range of data-parallel architectures, also providing necessary headroom for supporting future, even wider data-parallel machines without compromising decompression performance.

NVIDIA RTX IO supports DirectStorage 1.1

NVIDIA RTX IO is now included in the current Game Ready Driver (version 526.47), which offers accelerated decompression throughput.

Both DirectStorage and RTX IO leverage the GDeflate compression standard.

“Microsoft is delighted to partner with NVIDIA to bring the benefits of next-generation I/O to Windows gamers. DirectStorage for Windows will enable games to leverage NVIDIA’s cutting-edge RTX IO and provide game developers with a highly efficient and standard way to get the best possible performance from the GPU and I/O system. With DirectStorage, game sizes are minimized, load times reduced, and virtual worlds are free to become more expansive and detailed, with smooth and seamless streaming.”

Bryan Langley, Group Program Manager for Windows Graphics and Gaming

Getting started with DirectStorage in RTX IO drivers

We have a few more recommendations to help ensure the best possible experience using DirectStorage with GPU decompression on NVIDIA GPUs.

Preparing your application for DirectStorage

Achieving maximum end-to-end throughput with DirectStorage with GPU decompression requires enqueuing a sufficient number of read requests, to keep the pipeline fully saturated.

In preparation for DirectStorage integration, applications should group resource I/O and creation requests close together in time. Ideally, resource I/O and creation operations occur in their own CPU thread, separate from threads doing other loading screen activities like shader creation.

Assets on disk should be also packaged together in large enough chunks so that DirectStorage API call frequency is kept to a minimum and CPU costs are minimized. This ensures that enough work can be submitted to DirectStorage to keep the pipeline fully saturated.

For more information about general best practices, see Using DirectStorage and the DirectStorage 1.1 Now Available Microsoft post.

Deciding the staging buffer size

  • Make sure to change the default staging buffer size whenever GPU decompression is used. The current 32 MB default isn’t sufficient to saturate modern GPU capabilities.
  • Make sure to benchmark different platforms with varying NVMe, PCIe, and GPU capabilities when deciding on the staging buffer size. We found that a 128-MB staging buffer size is a reasonable default. Smaller GPUs may require less and larger GPUs may require more.

Compression ratio considerations

  • Make sure to measure the impact that different resource types have on compression savings and GPU decompression performance.
  • In general, various data types, such as texture and geometry, compress at different ratios. This can cause some variation in GPU decompression execution performance.
  • This won’t have a significant effect on end-to-end throughput. However, it may result in variation in latency when delivering the resource contents to their final locations.

Windows File System

  • Try to keep disk files accessed by DirectStorage separate from files accessed by other I/O APIs. Shared file use across different I/O APIs may result in the loss of bypass I/O improvements.

Command queue scheduling when background streaming

  • In Windows 10, command queue scheduling contention can occur between DirectStorage copy and compute command queues, and application-managed copy and compute command queues.
  • The NVIDIA Nsight Systems, PIX, and GPUView tools can assist in determining whether background streaming with DirectStorage is in contention with important application-managed command queues.
  • In Windows 11, overlapped execution between DirectStorage and application command queues is fully expected.
  • If overlapped execution results in suboptimal performance of application workloads, we recommend throttling back DirectStorage reads. This helps maintain critical application performance while background streaming is occurring.

Summary

Next-generation game engines require streaming huge amounts of data, aiming to create increasingly realistic, detailed game worlds. Given that, it’s necessary to rethink game engines’ resource streaming architecture, and fully leverage improvements in I/O technology.

Using the GPU as an accelerator for compute-intensive, data decompression becomes critical for maximizing system performance and reducing load times.

The NVIDIA RTX IO implementation of GDeflate is a scalable GPU-optimized compression technology that enables applications to benefit from the computational power of the GPU for I/O acceleration. It acts as a bandwidth amplifier for high-performance I/O capabilities of today and future systems.

Categories
Misc

Data Storytelling Best Practices for Data Scientists and AI Practitioners

Storytelling with data is a crucial soft skill for AI and data professionals. To ensure that stakeholders understand the technical requirements, value, and…

Storytelling with data is a crucial soft skill for AI and data professionals. To ensure that stakeholders understand the technical requirements, value, and impact of data science team efforts, it is necessary for data scientists, data engineers, and machine learning (ML) engineers to communicate effectively.

This post provides a framework and tips you can adopt to incorporate key elements of data storytelling into your next presentation, pitch, or proposal. It aims to accomplish the following:

  • Introduce storytelling within the context of data science and machine learning
  • Highlight the benefits of effective storytelling for data science practitioners
  • Provide tips on how to cultivate data storytelling skills

What is storytelling with data

Data storytelling is the ability to add contextual information to key data and insights to help develop viewpoints and realizations for project stakeholders. Data scientists and AI practitioners must effectively convey the impact of data-driven action or reasoning.  

Data and machine learning practitioners can use data storytelling to more effectively communicate with clients, project stakeholders, team members, and other business entities. A compelling narrative can help your audience understand complex concepts and can help win new projects.

Data storytelling case study

This section explores the key structural components of a data-driven story. 

The article, What Africa Will Look Like in 100 Years, leverages data and visualizations to tell a narrative of the ongoing transformation occurring in Africa from the viewpoint of major African cities such as Lagos, Dakar, and Cairo.

The strategic composition of this article presents the problem, background, and solution. This approach provides a strong foundation for any data-driven narrative. The article also includes facts, anecdotes, data, and charts and graphs. Together, these produce a free-flowing, well-structured, engaging, and informative account of the subject matter.

The opening sections of this article describe the context and main point: “Can Africa translate its huge population growth into economic development and improved quality of life?” 

Information such as key dates, figures, and first-person statements create a picture grounded in reality, allowing the reader to form a full understanding of the subject matter. The presentation of data using charts and graphs allows for the visualization of Africa’s major cities transformations. Specific data points include population growth, education rate, and life expectancy. Personal experiences and first-hand accounts from citizens of the focus cities provide additional context.

An effective framework for storytelling in data science

This section explores how storytelling in the data science field should be structured and presented. The goal is to equip you with an easy-to-follow framework for your next presentation, article, or video to stakeholders. 

The recipe for success when storytelling can be distilled into three individual components: context, dispute, and solution (Figure 1). These components can be combined with other methods to tell a compelling story with data. 

  • Context: Lay the foundation for your narrative and provide some background
  • Dispute: Discuss the problem associated with the context
  • Solution: Explain and discuss the solution that either ends or mitigates the identified problem
Graphic showing the components of storytelling: context, dispute, and solution.
Figure 1. The components of storytelling

Context

In storytelling, context involves providing information to reinforce, support, and reveal the key findings extracted from data samples. Without context, collated data are only collections of alphanumeric representations of information that alone don’t provide any actionable insight into the issue or topic. Presenting data together with reinforcing context and other supporting elements can aid understanding and help audiences reach meaningful conclusions. 

You can use many different methods to create context when storytelling. A context within data is produced by leveraging a collection of reinforcing materials such as actors, anecdotes, visualization, data labels, diagrams, and more.

To provide an example, consider the sentence below:

“200,000 plug-in electric vehicles were sold in the United Kingdom in 2021, representing an approximate 140% year-on-year increase.” 

Adding contextual information and supporting anecdotes can increase relatability, as shown in the paragraph below: 

“James’s interest in electric vehicles was sparked by a conversation he overheard on the radio about climate change. He did some research and found that a Volkswagen ID.3 would be a great choice for him. James decided to buy the car and by mid-2021, he was one of the many UK residents who had made the switch to electric vehicles. Sales of electric vehicles in 2021 more than doubled what they were in 2020, due to the public’s increasing awareness of climate change and its effects.”

Charts and diagrams are also important to include. They visualize data to aid understanding and provide additional support (Figure 2).

Bar chart showing the sales volume of plug-in electric vehicles in selected European countries in 2021, as an example of data visualization.
Figure 2. A bar chart is an example of data visualization that helps to provide context in data storytelling

Dispute

Dispute, in the context of data storytelling, is a problem, conflict, argument, debate, or issue. To drive the impact of introducing a new tool or adopting a new methodology, it helps to include mention of the key dispute. 

Below is an example of a dispute that helps drive the point of the initial electric vehicle data:

“The United Kingdom is a net importer of fossil fuels for the use of energy and electricity generation. Fossil fuels power our transportation, electrical, and technological services, and even domestic items heavily reliant on fossil fuels’ energy output. The problem is that the UK is determined to significantly reduce its dependence on fossil fuels by 2050. Hence, the question is how the UK can reduce its fossil fuel consumption and move to low-carbon energy sources as an alternative. In addition, fossil fuels are a massive contributor to climate change and extreme weather.”

Solution

The third, and final element to consider when connecting storytelling with data is the solution. The solution can come in many forms, such as reconfiguring an existing system, implementing new methodologies, or becoming aware of educational materials and how to best use them.

The proposed solution should be direct, obvious, and memorable. If proposed solutions are ambiguous, stakeholders will ask more questions. A direct solution, on the other hand, allows for action and the formation of future steps.

Below is an example of a proposed solution:

“Awareness is the first step to making the national UK goal of reducing fossil fuel dependency by 2050. To reach more people like James, we propose a scale-up of the WWF Carbon footprint app to include AI-powered functionality that enables services such as energy consumption prediction per household based on historical data and predicted energy demands. This scale-up initiative will require funding of £100 million and will be delivered to the public a year after project approval.”

The proposed solution contains a reference to the story to make it easier to remember. It also includes information about the project cost and timeline to show that it is direct. 

Sample outline 

Use the sample outline below as a reference for your next data storytelling project.

Opening section

  • Start with a factual statement of your key data point or dataset summary that highlights the impact of the dispute, lack of solution, or the impact of a possible solution. For example, “305,300 plug-in electric vehicles were sold in the United Kingdom in 2021, representing an approximate 140% year-on-year increase.”
  • Expand on the initial opening section by including several paragraphs introducing, explaining, and expanding on the context.

Middle section

  • Introduce, explain, and expand on the dispute.
  • Include anecdotes, facts, figures, charts, and diagrams to contextualize the dispute and present the problem.
  • Introduce, explain, and expand on the dispute concerning the solution.
  • Include anecdotes, facts, figures, charts, and diagrams to illustrate the impact and value of the proposed solution.

Closing section

  • Summarize your main points. Show the benefits a solution would bring, and the undesired consequences of not having a solution.
  • Include a call to action as a next step that encapsulates the desired outcome of the story told with data.
Complete diagram of the components, elements, and considerations for storytelling.
Figure 3. The key components and accompanying attributes of effective data storytelling

Summary

Companies and organizations are becoming more data-driven every day. As a result, AI and data professionals of all levels need to develop data storytelling skills to bridge gaps of understanding related to technicalities, datasets, and technologies. The information in this post will give you a strong foundation from which to start building your data storytelling skills.

Categories
Misc

Tiny Computer, Huge Learnings: Students at SMU Build Baby Supercomputer With NVIDIA Jetson Edge AI Platform

“DIY” and “supercomputer” aren’t words typically used together. But a do-it-yourself supercomputer is exactly what students built at Southern Methodist University, in Dallas, using 16 NVIDIA Jetson Nano modules, four power supplies, more than 60 handmade wires, a network switch and some cooling fans. The project, dubbed SMU’s “baby supercomputer,” aims to help educate those Read article >

The post Tiny Computer, Huge Learnings: Students at SMU Build Baby Supercomputer With NVIDIA Jetson Edge AI Platform appeared first on NVIDIA Blog.

Categories
Misc

Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron

Large language models (LLMs) are some of the most advanced deep learning algorithms that are capable of understanding written language. Many modern LLMs are…

Large language models (LLMs) are some of the most advanced deep learning algorithms that are capable of understanding written language. Many modern LLMs are built using the transformer network introduced by Google in 2017 in the Attention Is All You Need research paper.

NVIDIA NeMo Megatron is an end-to-end GPU-accelerated framework for training and deploying transformer-based LLMs up to a trillion parameters. In September 2022, NVIDIA announced that NeMo Megatron is now available in Open Beta, allowing you to train and deploy LLMs using your own data. With this announcement, several pretrained checkpoints have been uploaded to HuggingFace, enabling anyone to deploy LLMs locally using GPUs.

This post walks you through the process of downloading, optimizing, and deploying a 1.3 billion parameter GPT-3 model using NeMo Megatron. It includes NVIDIA Triton Inference Server, a powerful open-source, inference-serving software that can deploy a wide variety of models and serve inference requests on both CPUs and GPUs in a scalable manner.

System requirements

While training LLMs requires massive amounts of compute power, trained models can be deployed for inference at a much smaller scale for most use cases.

The models from HuggingFace can be deployed on a local machine with the following specifications:

  • Running a modern Linux OS (tested with Ubuntu 20.04).
  • An NVIDIA Ampere architecture GPU or newer with at least 8 GB of GPU memory.
  • At least 16 GB of system memory.
  • Docker version 19.03 or newer with the NVIDIA Container Runtime.
  • Python 3.7 or newer with PIP.
  • A reliable Internet connection for downloading models.
  • Permissive firewall, if serving inference requests from remote machines.

Preparation

NeMo Megatron is now in Open Beta and available for anyone who completes the free registration form. Registration is required to gain access to the training and inference containers, as well as helper scripts to convert and deploy trained models.

Several trained NeMo Megatron models are hosted publicly on HuggingFace, including 1.3B, 5B, and 20B GPT-3 models. These models have been converted to the .nemo format which is optimized for inference.

Converted models cannot be retrained or fine-tuned, but they enable fully trained models to be deployed for inference. These models are significantly smaller in size compared to the pre-conversion checkpoints and are supported by the FasterTransformer (FT) format. FasterTransformer is a backend in Triton Inference Server to run LLMs across GPUs and nodes.

For the purposes of this post, we used the 1.3B model, which has the quickest inference speeds and can comfortably fit in memory for most modern GPUs.

To convert the model, run the following steps.

Download the 1.3B model to your system. Run the following command in the desired directory to keep converted models for NVIDIA Triton to read:

wget https://huggingface.co/nvidia/nemo-megatron-gpt-1.3B/resolve/main/nemo_gpt1.3B_fp16.nemo

Make a note of the folder to which the model was copied, as it is used throughout the remainder of this post.

Verify the MD5sum of the downloaded file:

$ md5sum nemo_gpt1.3B_fp16.nemo
38f7afe7af0551c9c5838dcea4224f8a  nemo_gpt1.3B_fp16.nemo

Use a web browser to log in to NGC at ngc.nvidia.com. Enter the Setup menu by selecting your account name. Select Get API Key followed by Generate API Key to create the token. Make a note of the key as it is only shown one time.

In the terminal, add the token to Docker:

$ docker login nvcr.io
Username: $oauthtoken
Password: 

Replace with the token that was generated. The username must be exactly $oauthtoken, as this indicates that a personal access token is being used.

Pull the latest training and inference images for NeMo Megatron:

$ docker pull nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3
$ docker pull nvcr.io/ea-bignlp/bignlp-inference:22.08-py3

At the time of publication, the latest image tags are 22.08.01-py3 for training and 22.08-py3 for inference. We recommend checking for newer tags on NGC and pulling those, if available.

Verify that the images were pulled successfully, as the IDs might change with different tags:

$ docker images | grep "ea-bignlp/bignlp"
nvcr.io/ea-bignlp/bignlp-training                       22.08.01-py3                         d591b7488a47   11 days ago     17.3GB
nvcr.io/ea-bignlp/bignlp-inference                      22.08-py3                            77a6681df8d6   2 weeks ago     12.2GB

Model conversion

To optimize throughput and latency of the model, it can be converted to the FT format, which contains performance modifications to the encoder and decoder layers in the transformer architecture.

FT can serve inference requests with 3x quicker latencies or more compared to their non-FT counterparts. The NeMo Megatron training container includes the FT framework as well as scripts to convert a .nemo file to the FT format.

Triton Inference Server expects models to be stored in a model repository. Model repositories contain checkpoints and model-specific information that Triton Inference Server reads to tune the model at deployment time. As with the FT framework, the NeMo Megatron training container includes scripts to convert the FT model to a model repository for Triton.

Converting a model to the FT format and creating a model repository for the converted model can be done in one pass in a Docker container. To create an FT-based model repository, run the following command. Items that might have to change are in bold.

docker run --rm 
    --gpus all 
    --shm-size=16GB 
    -v /path/to/checkpoints:/checkpoints 
    -v /path/to/checkpoints/output:/model_repository 
    nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3 
    bash -c 'export PYTHONPATH=/opt/bignlp/FasterTransformer:${PYTHONPATH} && 
    cd /opt/bignlp && 
    python3 FasterTransformer/examples/pytorch/gpt/utils/nemo_ckpt_convert.py 
        --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo 
        --infer-gpu-num 1 
        --saved-dir /model_repository/gpt3_1.3b 
        --weight-data-type fp16 
        --load-checkpoints-to-cpu 0 && 
    python3 /opt/bignlp/bignlp-scripts/bignlp/collections/export_scripts/prepare_triton_model_config.py 
        --model-train-name gpt3_1.3b 
        --template-path /opt/bignlp/fastertransformer_backend/all_models/gpt/fastertransformer/config.pbtxt 
        --ft-checkpoint /model_repository/gpt3_1.3b/1-gpu 
        --config-path /model_repository/gpt3_1.3b/config.pbtxt 
        --max-batch-size 256 
        --pipeline-model-parallel-size 1 
        --tensor-model-parallel-size 1 
        --data-type bf16'

These steps launch a Docker container to run the conversions. The following list is of a few important parameters and their functions:

  • -v /path/to/checkpoints:/checkpoints: Specify the local directory where checkpoints were saved. This is the directory that was mentioned during the checkpoint download step earlier. The final :/checkpoints directory in the command should stay the same.
  • -v /path/to/checkpoint/output:/model_repository: Specify the local directory to save the converted checkpoints to. Make a note of this location as it is used in the deployment step later. The final :/model_repository directory in the command should stay the same.
  • nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3: If a newer image exists on NGC, replace the highlighted tag with the new version.
  • --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo: The name of the downloaded checkpoint to convert. If you are using a different version, replace the name here.
  • --infer-gpu-num 1: This is the number of GPUs to use for the deployed model. If using more than one GPU, increase this number to the desired amount. The remainder of this post assumes that the value of 1 was used here.
  • --model-train-name gpt3_1.3b: The name of the deployed model. If you are using a different model name, make a note of the new name as NVIDIA Triton requests require the name to be specified.
  • --tensor-model-parallel-size 1: If you are using a different GPU count for inference, this number must be updated. The value should match that of --infer-gpu-num from earlier.

After running the command, verify that the model has been converted by viewing the specified output directory. The output should be similar to the following (truncated for brevity):

$ ls -R output/
output/:
gpt3_1.3b

output/gpt3_1.3b:
1-gpu  config.pbtxt

output/gpt3_1.3b/1-gpu:
config.ini
merges.txt
model.final_layernorm.bias.bin
model.final_layernorm.weight.bin
...

Model deployment

Now that the model has been converted to a model repository, it can be deployed with Triton Inference Server. Do this using the NeMo Megatron Inference container, which has NVIDIA Triton built in.

By default, NVIDIA Triton uses three ports for HTTP, gRPC, and metric requests.

docker run --rm 
    --name triton-inference-server 
    -d 
    --gpus all 
    -p 8000-8002:8000-8002 
    -v /path/to/checkpoints/output:/model_repository 
    nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 
    bash -c 'export CUDA_VISIBLE_DEVICES=0 && 
    tritonserver --model-repository /model_repository'
  • -d: This tells Docker to run the container in the background. The server remains online and available for requests until the container is killed.
  • -p 8000-8002:8000-8002: NVIDIA Triton communicates using ports 8000 for HTTP requests, 8001 for gRPC requests, and 8002 for metrics information. These ports are mapped from the container to the host, allowing the host to handle requests directly and route them to the container.
  • -v /path/to/checkpoints/output:/model_repository: Specify the location where the converted checkpoints were saved to on the machine. This should match the model repository location from the conversion step earlier.
  • nvcr.io/ea-bignlp/bignlp-inference:22.08-py3: If a newer version exists on NGC, replace the highlighted tag with the new version.
  • export CUDA_VISIBLE_DEVICES=0: Specify which devices to use. If the model was converted to use multiple GPUs earlier, this should be a comma-separated list of the GPUs up to the desired number. For example, if you are using four GPUs, this should be CUDA_VISIBLE_DEVICES=0,1,2,3.

To verify that the container was launched successfully, run docker ps, which should show output similar to the following:

CONTAINER ID   IMAGE                                          COMMAND                  CREATED              STATUS              PORTS                                                           NAMES
f25cf23b75b7   nvcr.io/ea-bignlp/bignlp-inference:22.08-py3   "/opt/nvidia/nvidia_…"   About a minute ago   Up About a minute   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   triton-inference-server

Check the logs to see if the model was deployed and ready for requests (output truncated for brevity).

$ docker logs triton-inference-server
I0928 14:29:34.011299 1 server.cc:629] 
+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| gpt3_1.3b | 1       | READY  |
+-----------+---------+--------+

I0928 14:29:34.131430 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0928 14:29:34.132280 1 tritonserver.cc:2176] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.24.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /model_repository                                                                                                                                                                            |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0928 14:29:34.133520 1 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I0928 14:29:34.133751 1 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I0928 14:29:34.174655 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

If the output is similar to what’s shown here, the model is ready to receive inference requests.

Sending inference requests

With a local Triton Inference Server running, you can start sending inference requests to the server. NVIDIA Triton’s client API supports multiple languages including Python, Java, and C++. For the purposes of this post, we provide a sample Python application.

from argparse import ArgumentParser
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype
from transformers import GPT2Tokenizer

def fill_input(name, data):
    infer_input = httpclient.InferInput(name, data.shape, np_to_triton_dtype(data.dtype))
    infer_input.set_data_from_numpy(data)
    return infer_input

def build_request(query, host, output):
    with httpclient.InferenceServerClient(host) as client:
        request_data = []
        request = np.array([query]).astype(np.uint32)
        request_len = np.array([[len(query)]]).astype(np.uint32)
        request_output_len = np.array([[output]]).astype(np.uint32)
        top_k = np.array([[1]]).astype(np.uint32)
        top_p = np.array([[0.0]]).astype(np.float32)
        temperature = np.array([[1.0]]).astype(np.float32)

        request_data.append(fill_input('input_ids', request))
        request_data.append(fill_input('input_lengths', request_len))
        request_data.append(fill_input('request_output_len', request_output_len))
        request_data.append(fill_input('runtime_top_k', top_k))
        request_data.append(fill_input('runtime_top_p', top_p))
        request_data.append(fill_input('temperature', temperature))
        result = client.infer('gpt3_1.3b', request_data)
        output = result.as_numpy('output_ids').squeeze()
        return output

def main():
    parser = ArgumentParser('Simple Triton Inference Requestor')
    parser.add_argument('query', type=str, help='Enter a text query to send to '
                        'the Triton Inference Server in quotes.')
    parser.add_argument('--output-length', type=int, help='Specify the desired '
                        'length for output.', default=30)
    parser.add_argument('--server', type=str, help='Specify the host:port that '
                        'Triton is listening on. Defaults to localhost:8000',
                        default='localhost:8000')
    args = parser.parse_args()

    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    query = tokenizer(args.query).input_ids
    request = build_request(query, args.server, args.output_length)
    print(tokenizer.decode(request))

if __name__ == '__main__':
    main()

At a high level, the script does the following:

  1. Takes an input request from the user, such as, “Hello there! How are you today?”
  2. Tokenizes the input using a pretrained GPT-2 tokenizer from HuggingFace.
  3. Builds an inference request using several required and optional parameters, such as request, temperature, output length, and so on.
  4. Sends the request to NVIDIA Triton.
  5. Decodes the response using the tokenizer from earlier.

To run the code, several Python dependencies are required. These packages can be installed by running the following command:

$ pip3 install numpy tritonclient[http] transformers

After the dependencies are installed, save the code to a local file and name it infer.py. Next, run the application as follows:

$ python3 infer.py "1 2 3 4 5 6"

This sends the prompt “1 2 3 4 5 6” to the local inference server and should output the following to complete the sequence up to the default response token limit of 30:

“1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36"

The server can now respond to any HTTP requests using this basic formula and can support multiple concurrent requests both locally and remote.

Summary

Large language models are powering a growing number of applications. With the public release of several NeMo Megatron models, it’s now possible to deploy trained models locally.

This post outlined how to deploy public NeMo Megatron models using a simple Python script. You can test more robust models and use cases by downloading the larger models hosted on HuggingFace.

For more information about using NeMo Megatron, see the NeMo Megatron documentation and NVIDIA/nemo GitHub repo.

Categories
Misc

Explainer: What Are Graph Neural Networks?

GNNs apply the predictive power of deep learning to rich data structures that depict objects and their relationships as points connected by lines in a graph.

GNNs apply the predictive power of deep learning to rich data structures that depict objects and their relationships as points connected by lines in a graph.

Categories
Offsites

Researchers thought this was a bug (Borwein integrals)

Categories
Misc

Meet the Omnivore: Indie Showrunner Transforms Napkin Doodles Into Animated Shorts With NVIDIA Omniverse

Editor’s note: This post is a part of our Meet the Omnivore series, which features individual creators and developers who use NVIDIA Omniverse to accelerate their 3D workflows and create virtual worlds. 3D artist Rafi Nizam has worn many hats since starting his career as a web designer more than two decades ago, back when Read article >

The post Meet the Omnivore: Indie Showrunner Transforms Napkin Doodles Into Animated Shorts With NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Misc

New Research Highlights Speed and Cost Savings of Clara Parabricks for Genomic Analyses

Many organizations are using Clara Parabricks for fast human genome and exome analysis for large population projects, critically-ill patients, clinical…

Many organizations are using Clara Parabricks for fast human genome and exome analysis for large population projects, critically-ill patients, clinical workflows, and cancer genomics projects. Their work aims to accurately and quickly identify disease-causing variants, keeping pace with accelerated next-generation sequencing as well as accelerated genomic analyses. 

Most recently, two peer-reviewed scientific publications in August and September highlight the speed, accuracy, and cost savings of Clara Parabricks for de novo and pathogen workflows. 

Genome variant identification to track Malaria transmission

Lead Purdue University researcher Dr. Giovanna Carpi and her team sought to understand the performance of Clara Parabricks relative to existing methods used by the Malaria community for variant identification to track malaria transmission and monitor antimalarial drug resistance using 1,000 malaria genomes. 

Dr. Carpi, who has been researching pathogen genomics for many years, demonstrated a 27x increase in analysis speed and a 5x decrease in cost compared to the CPU conventional pipeline, while delivering 99.9% accuracy. The malaria genome is relatively large (24 MB) and AT-rich, which makes it quite challenging to analyze. Dr. Carpi used publicly available data from the MalariaGEN consortium, which were raw reads from Illumina. The research is presented in A GPU-Accelerated Compute Framework for Pathogen Genomic Variant Identification to Aid Genomic Epidemiology of Infectious Disease: A Malaria Case Study, published in Briefings in Bioinformatics.

The ability to sequence and analyze whole-genome pathogens quickly helps public health officials understand the spread of a disease, drug resistance, and also new variants’ transmissibility and severity. The World Health Organization (WHO) reported 241 million cases of malaria in 2020 compared to 227 million cases in 2019, and an estimated 627,000 deaths in 2020—an increase of 69,000 deaths over the previous year.

Malaria is caused by Plasmodium parasites that are transmitted to people through the bites of infected female Anopheles mosquitoes. Africa carries a disproportionately high share of the global malaria burden, with children under five years of age accounting for 80% of the total deaths in the region. 

Dr. Carpi noted, “The ability to generate analysis-ready variant outputs in less than five minutes with greater than 99.9% accuracy for large-scale whole-genome Plasmodium studies at lower costs, remarkably reduces the computational bottleneck that most malaria genomics programs currently face, and facilitates decentralized bioinformatics analyses in endemic countries.” Visit malaria-parabricks-pipeline on GitHub to download this Clara Parabricks workflow for malaria and to learn more.

Two figures showing Clara Parabricks variant calling acceleration. A workflow (left) and runtime comparisons in CPU and GPU (right). GPU-accelerated Clara Parabricks shows a 27x acceleration compared to GATK in a CPU environment.
Figure 1. Clara Parabricks variant calling acceleration and scalability as shown in a workflow (left) and runtime comparisons in CPU and GPU (right). GPU-accelerated Clara Parabricks shows a 27x acceleration compared to GATK in a CPU environment.

Discovering de novo variants in autism patients

Separately, Dr. Tychele Turner and her team from Washington University in St. Louis developed a fast genomics workflow for discovering de novo variants (DNVs) in autism patients using GPU-accelerated Clara Parabricks. Dr. Turner is a geneticist/genomicist with a deep interest in understanding the genetic architecture of human disease. Her lab is focused on the genomics of neurodevelopmental disorders, optimization of genomic workflows, and application of novel genomic technologies to understand disease. The research is presented in De Novo Variant Calling Identifies Cancer Mutation Signatures in the 1000 Genomes Project, published in Human Mutation.

Dr. Turner worked closely with the NVIDIA genomic team to integrate her trio analysis into NVIDIA Clara Parabricks. Dr. Turner was astonished to see a 100x speedup in turnaround time for a trio analysis using NVIDIA Clara Parabricks. The initial analysis to generate DNVs on GPUs took 8.5 hours using a server with just 4 GPUs versus 800 hours on CPUs. When the team further parallelized the workflow on GPUs, the run time was further shortened to less than one hour. 

Dr. Turner has focused most of her career on DNVs, which are novel variants present in children’s DNA but not present in their parent’s DNA. These DNVs can be assessed by sequencing the DNA from a child and both parents followed by a comparative analysis, called a trio analysis. In the general population, each individual has around 40 to 100 DNVs and most DNVs do not affect the genes. 

However, a genetic disease often results when a Single Nucleotide Variant (SNV) in a base pair (A,T, C, G), small insertion/deletion (indel), or Structural Variant (SV), alters a gene and affects the resulting protein production or function. This is the case with some neurodevelopmental disorders, where enrichment of protein-coding DNVs in patients has been identified in phenotypes including autism, epilepsy, intellectual disability, and congenital heart defects.

These fast results held promise not only for scientific discovery but also for Dr. Turner’s vision of same-day clinical results. To confirm the accuracy of the de novo variant calls from the new GPU-based workflow, the team leveraged NVIDIA Clara Parabricks to study a family with monozygotic twins, also known as identical twins, who have the same DNA. 

The results showed the same number of DNVs in both the GPU-based and the previous CPU-based workflows, and in both cases about 20% CpG sites were found, indicating that the NVIDIA Clara Parabricks workflow produced equivalent results, but 100x faster. This meant that their autism genomic research could be completed faster, variants could be discovered faster, and hopefully insights for patients can be understood faster.

Images showing mutational properties of de novo variants.
Figure 2. Mutational properties of de novo variants

Dr. Turner remarked, “Utilization of GPUs is enabling rapid bioinformatic analyses to move forward to a one-hour genomic workup.” 

With the new GPU-based DNV genomic analysis workflow, the team proceeded to study sequence data from the 1000 Genomes Project, an international research consortium that has sequenced representative cohorts from African, East Asian, South Asian, and European populations. The 1000 Genomes Project aims to describe and characterize the variations found in human genomes as a basis for investigating the relationship between genetic polymorphisms and phenotypes by sequencing 2,600 individuals from 26 populations from around the world.  

Recently, The New York Genome Center sequenced these individuals at high depth and made the data publicly available. The population included 602 trios of families with no autism. This was the first opportunity to look at DNVs with no known phenotypes as a control to understand the level of DNVs in population and compare those to the autism cohort.

The DNV analysis of the 1000 Genomes Project individuals ended up surprising Dr. Turner’s team. They saw a bimodal distribution of the number of DNVs with peaks at 200, a little larger than expected, and at 2000, much larger than expected. Dr. Turner looked at the various cohorts in the 1000 Genomes Project data and noticed that the CEU population, which is a cohort of European individuals, has been studied for a longer time and therefore has been also cultured more, potentially leading to more cell line artifacts. 

One individual, identified as NA12878 in the cohort, was sequenced multiple times: in 2012, 2013, twice in 2018, and in 2020. Dr. Turner showed that the DNVs had increased over time. 2020 had the most DNVs, supporting the conclusion that more cell line artifacts were in the 2020 samples versus the 2012 sample. The team concluded that although the 1000 Genomes Project is an excellent source of data for genomic study, it may not be ideal for filtering datasets for patient controls, due to the prevalence of cell line artifacts. 

Though the 1000 Genomes Project provides critical biological and practical insights, only 20% of the children have the expected number of DNVs and considerable evidence indicates that excessive DNVs are cell line artifacts. The excess DNVs match mutation signatures of B-cell lymphoma cancers, demonstrating that cell line artifacts are not accumulating in a random manner. 

Protein-coding DNVs are identified in DNA repair genes and may contribute to excess DNVs. The cohort of 602 individuals is significant for protein-coding DNVs in IGLL5 that is known to have excess mutations in B-cell lymphomas and individuals with these DNVs all have greater than 100 DNVs. Protein-coding DNVs are identified in clinically relevant variant sites warranting caution in using this data as a binary filtering set for patients. Future genomic studies performing genome sequencing should focus on either family-based approaches or utilized DNA derived directly from blood for building good controls and reference data bases. 

Dr. Turner commented, “My lab was excited to develop a de novo variant calling workflow that utilizes GPUs which enabled us to quickly analyze nearly 4,800 whole-genome sequenced parent-child trios to gain important biological insights.” 

An accelerated suite of tools to power genomic research

Clara Parabricks v4.0 is a more focused genomic analysis toolset than previous versions, with rapid alignment, gold standard processing, and high accuracy variant calling. It offers the flexibility to freely and seamlessly intertwine GPU and CPU tasks and prioritize the GPU-acceleration of the most popular and bottlenecked tools in the genomics workflow. Clara Parabricks can also integrate cutting-edge deep learning approaches in genomics.

Diagram showing the toolset of NVIDIA Clara Parabricks v4.0.
Figure 3. The toolset of NVIDIA Clara Parabricks v4.0

You can register to download Clara Parabricks for free. You can also request a free Clara Parabricks NVIDIA LaunchPad Lab demo to experience accelerated industry-standard tools for germline and somatic analysis for an exome and whole genome dataset. 

For more information about Clara Parabricks, including technical details on the tools available, check out the Clara Parabricks documentation.

Categories
Misc

Unearthing Data: Vision AI Startup Digs Into Digital Twins for Mining and Construction

Skycatch, a San Francisco-based startup, has been helping companies mine both data and minerals for nearly a decade. The software-maker is now digging into the creation of digital twins, with an initial focus on the mining and construction industry, using the NVIDIA Omniverse platform for connecting and building custom 3D pipelines. SkyVerse, which is a Read article >

The post Unearthing Data: Vision AI Startup Digs Into Digital Twins for Mining and Construction appeared first on NVIDIA Blog.

Categories
Misc

Take the Green Train: NVIDIA BlueField DPUs Drive Data Center Efficiency

The numbers are in, and they paint a picture of data centers going a deeper shade of green, thanks to energy-efficient networks accelerated with data processing units (DPUs). A suite of tests run with help from Ericsson, RedHat and VMware show power reductions up to 24% on servers using NVIDIA BlueField-2 DPUs. In one case, Read article >

The post Take the Green Train: NVIDIA BlueField DPUs Drive Data Center Efficiency appeared first on NVIDIA Blog.