Categories
Misc

Selecting Large Language Model Customization Techniques

Decorative image.Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes….Decorative image.

Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes. However, off-the-shelf LLMs often fall short in meeting the specific needs of enterprises due to industry-specific terminology, domain expertise, or unique requirements.

This is where custom LLMs come into play.

Enterprises need custom models to tailor the language processing capabilities to their specific use cases and domain knowledge. Custom LLMs enable a business to generate and understand text more efficiently and accurately within a certain industry or organizational context.

Custom models empower enterprises to create personalized solutions that align with their brand voice, optimize workflows, provide more precise insights, and deliver enhanced user experiences, ultimately driving a competitive edge in the market.

This post covers various model customization techniques and when to use them. NVIDIA NeMo supports many of the methods.

NVIDIA NeMo is an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrail toolkits, data curation tools, and pretrained models, offering an easy, cost-effective, and fast way to adopt generative AI.

Selecting an LLM customization technique

You can categorize techniques by the trade-offs between dataset size requirements and the level of training effort during customization compared to the downstream task accuracy requirements.

Diagram shows four customization tools with a table of techniques, use cases, and examples.
Figure 1. LLM customization techniques available with NVIDIA NeMo

Figure 1 shows the following popular customization techniques:

  • Prompt engineering: Manipulates the prompt sent to the LLM but doesn’t alter the parameters of the LLM in any way. It is light in terms of data and compute requirements.
  • Prompt learning: Uses prompt and completion pairs imparting task-specific knowledge to LLMs through virtual tokens. This process requires more data and compute but provides better accuracy than prompt engineering.
  • Parameter-efficient fine-tuning (PEFT): Introduces a small number of parameters or layers to existing LLM architecture and is trained with use-case–specific data, providing higher accuracy than prompt engineering and prompt learning, while requiring more training data and compute.
  • Fine-tuning: Involves updating the pretrained LLM weights unlike the three types of customization techniques outlined earlier that keep these weights frozen. This means fine-tuning also requires the most amount of training data and compute as compared to these other techniques. However, it provides the most accuracy for specific use cases, justifying the cost and complexity.

For more information, see An Introduction to Large Language Models: Prompt Engineering and P-Tuning.

Prompt engineering

Prompt engineering involves customization at inference time with show-and-tell examples. An LLM is provided with example prompts and completions, detailed instructions that are prepended to a new prompt to generate the desired completion. The parameters of the model are not changed.

Few-shot prompting: This approach requires prepending a few sample prompts and completion pairs to the prompt, so that the LLM learns how to generate responses for a new unseen prompt. While few-shot prompting requires a relatively smaller amount of data as compared to other customization techniques and does not require fine-tuning, it does add to inference latency.

Chain-of-thought reasoning: Just as humans decompose bigger problems into smaller ones and apply chain of thought to solve problems effectively, chain-of-thought reasoning is a prompt engineering technique that helps LLMs improve their performance on multi-step tasks. It involves breaking a problem down into simpler steps with each of the steps requiring slow and deliberate reasoning. This approach works well for logical, arithmetic, and deductive reasoning tasks.

System prompting: This approach involves adding a system-level prompt in addition to the user prompt to provide specific and detailed instructions to the LLMs to behave as intended. The system prompt can be thought of as input to the LLM to generate its response. The quality and specificity of the system prompt can have a significant impact on the relevance and accuracy of the LLM’s response.

Prompt learning

Prompt learning is an efficient customization method that makes it possible to use pretrained LLMs on many downstream tasks without needing to tune the pretrained model’s full set of parameters. It includes two variations with subtle differences called p-tuning and prompt tuning; both methods are collectively referred to as prompt learning.

Prompt learning enables adding new tasks to LLMs without overwriting or disrupting previous tasks for which the model has already been pretrained. Because the original model parameters are frozen and never altered, prompt learning also avoids catastrophic forgetting issues often encountered when fine-tuning models. Catastrophic forgetting occurs when LLMs learn new behavior during the fine-tuning process at the cost of foundational knowledge gained during LLM pretraining.

Diagram shows that prompt learning prepends trained virtual tokens to prompt tokens resulting in more accurate LLM completions for the specific use case the virtual tokens were trained for.
Figure 2. Prompt learning applied to LLMs

Instead of selecting discrete text prompts in a manual or automated fashion, prompt tuning and p-tuning use virtual prompt embeddings that you can optimize by gradient descent. These virtual token embeddings exist in contrast to the discrete, hard, or real tokens that do make up the model’s vocabulary. Virtual tokens are purely 1D vectors with dimensionality equal to that of each real token embedding. In training and inference, continuous token embeddings are inserted among discrete token embeddings according to a template provided in the model’s config.

Prompt tuning: For a pretrained LLM, soft prompt embeddings are initialized as a 2D matrix of size total_virtual_tokens Xhidden_size. Each task that the model is prompt-tuned to perform has its own associated 2D embedding matrix. Tasks do not share any parameters during training or inference. The NeMo framework prompt tuning implementation is based on The Power of Scale for Parameter-Efficient Prompt Tuning.

P-tuning: An LSTM or MLP model called prompt_encoder is used to predict virtual token embeddings. prompt_encoder parameters are randomly initialized at the start of p-tuning. All base LLM parameters are frozen, and only the prompt_encoder weights are updated at each training step. When p-tuning completes, prompt-tuned virtual tokens from prompt_encoder are automatically moved to prompt_table where all prompt-tuned and p-tuned soft prompts are stored. prompt_encoder is then removed from the model. This enables you to preserve previously p-tuned soft prompts while still maintaining the ability to add new p-tuned or prompt-tuned soft prompts in the future.

prompt_table uses the task name as a key to look up the correct virtual tokens for a specified task. The NeMo framework p-tuning implementation is based on GPT Understands, Too.

Parameter-efficient fine-tuning

Parameter-efficient fine-tuning (PEFT) techniques use clever optimizations to selectively add and update few parameters or layers to the original LLM architecture. Using PEFT, model parameters are trained for specific use cases. Pretrained LLM weights are kept frozen and significantly fewer parameters are updated during PEFT using domain and task-specific datasets. This enables LLMs to reach high accuracy on trained tasks.

There are several popular parameter-efficient alternatives to fine-tuning pretrained language models. Unlike prompt learning, these methods do not insert virtual prompts into the input. Instead, they introduce trainable layers into the transformer architecture for task-specific learning. This helps attain strong performance on downstream tasks while reducing the number of trainable parameters by several orders of magnitude (closer to 10,000x fewer parameters) compared to fine-tuning.

  • Adapter Learning
  • Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3)
  • Low-Rank Adaptation (LoRA)

Adapter Learning: Introduces small feed-forward layers in between the layers of the core transformer architecture. Only these layers (adapters) are trained at fine-tuning time for specific downstream tasks. The adapter layer generally uses a down-projection to project the input h to a lower-dimensional space followed by a nonlinear activation function, and an up-projection with W_up. A residual connection adds the output of this to the input, leading to a final form:

h leftarrow h + f(hW_{down})W_{up}

Adapter modules are usually initialized such that the initial output of the adapter is always zeros to prevent degradation of the original model’s performance due to the addition of such modules. The NeMo framework adapter implementation is based on Parameter-Efficient Transfer Learning for NLP.

IA3: Adds even fewer parameters, compared to adapters, which simply scale the hidden representations in the transformer layer using learned vectors. These scaling parameters can be trained for specific downstream tasks. The learned vectors lk, lv, and lff, respectively rescale the keys and values in attention mechanisms and the inner activations in position-wise feed-forward networks. This technique also makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector. The NeMo framework IA3 implementation is based on Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning.

Diagram shows LoRA fine-tuning achieves parameter efficiency through frozen pretrained weights and reduced dimension layers.
Figure 3. LoRA for parameter-efficient fine-tuning

LoRA: Injects trainable low-rank matrices into transformer layers to approximate weight updates. Instead of updating the full pretrained weight matrix W, LoRA updates its low-rank decomposition, reducing the number of trainable parameters 10,000 times and the GPU memory requirements by 3x compared to fine-tuning. This update is applied to the query and value projection weight matrices in the multi-head attention sub-layer. Applying updates to low-rank decomposition instead of the entire matrix has been shown to be on par or better in model quality than fine-tuning, enabling higher training throughput and with no additional inference latency.

NeMo framework LoRA implementation is based on Low-Rank Adaptation of Large Language Models. For more information about how to apply the LoRa model to an extractive QA task, see the LoRA tutorial notebook.

Fine-tuning

When data and compute resources have no hard constraints, customization techniques such as supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are great alternative approaches to PEFT and prompt engineering. Fine-tuning can help achieve the best accuracy on a range of use cases as compared to other customization approaches.

Supervised fine-tuning: SFT is the process of fine-tuning all the model’s parameters on labeled data of inputs and outputs that teaches the model domain-specific terms and how to follow user-specified instructions. It is typically done after model pretraining. Using pretrained models enables many benefits that include the use of state-of-the-art models without having to train from scratch, reduced computation costs, and reduced data collection needs as compared to the pretraining stage. A form of SFT is referred to as instruction tuning because it involves fine-tuning language models on a collection of datasets described through instructions.

Diagram shows supervised fine-tuning updates the pretrained LLM weights using instruction following datasets blended in varying proportions by tasks to help improve LLM performance on unseen tasks.
Figure 4. Supervised fine-tuning with labeled instructions following data

SFT with instructions leverages the intuition that NLP tasks can be described through natural language instructions, such as “Summarize the following article into three sentences.” or “Write an email in Spanish about an upcoming school festival.” This method successfully combines the strengths of fine-tuning and prompting paradigms to improve LLM zero-shot performance at inference time.

The instruction tuning process involves performing fine-tuning on the pretrained model on a mixture of several NLP datasets expressed through natural language instructions that are blended in varying proportions. At inference time, the fine-tuned model is evaluated on unseen tasks and this process is known to substantially improve zero-shot performance on unseen tasks. SFT is also an important intermediary step in the process of improving LLM capabilities using reinforcement learning, which we describe next.

Reinforcement learning with human feedback: Reinforcement learning with human feedback (RLHF) is a customization technique that enables LLMs to achieve better alignment with human values and preferences. It uses reinforcement learning to enable the model to adapt its behavior based on the feedback it receives. It involves a three-stage fine-tuning process that uses human preference as the loss function. The SFT model fine-tuned with instructions as described in the earlier section is considered the first stage in the RLHF technique.

Diagram shows reinforcement learning with human feedback is a three-stage process that leverages a reward model trained on human preferences to provide feedback to a supervised fine-tuned LLM using reinforcement learning.
Figure 5. Aligning LLM behavior with human preferences using reinforcement learning

The SFT model is trained as a reward model (RM) in stage 2 of RLHF. A dataset consisting of prompts with multiple responses ranked by humans is used to train the RM to predict human preference.

After the RM is trained, stage 3 of RLHF focuses on fine-tuning the initial policy model against the RM using reinforcement learning with a proximal policy optimization (PPO) algorithm. These three stages of RLHF performed iteratively enable LLMs to generate outputs that are more aligned with human preferences and can follow instructions more effectively.

While RLHF results in powerful LLMs, the downside is that this method can be misused and exploited to generate undesirable or harmful content. The NeMo method uses the PPO value network as a critic model to guide the LLMs away from generating harmful content. There are other approaches being actively explored in the research community to steer the LLMs towards appropriate behavior and reduce toxic generation or hallucinations where LLMs make up facts.

Customize your LLMs

This post covered various model customization techniques and when to use them. Many of those methods are supported by NVIDIA NeMo.

NeMo provides an accelerated workflow for training with 3D parallelism techniques. It offers a choice of several customization techniques and is optimized for at-scale inference of large-scale models for language and image applications, with multi-GPU and multi-node configurations.

Download the NeMo framework today and customize pretrained LLMs on your preferred on-premises and cloud platforms.

Categories
Misc

Just Released: NVIDIA Modulus 23.08

An abstract image representing Modulus.NVIDIA Modulus is now part of the NVIDIA AI Enterprise suite, supporting PyTorch 2.0, CUDA 12, and new samples.An abstract image representing Modulus.

NVIDIA Modulus is now part of the NVIDIA AI Enterprise suite, supporting PyTorch 2.0, CUDA 12, and new samples.

Categories
Offsites

Advances in document understanding

The last few years have seen rapid progress in systems that can automatically process complex business documents and turn them into structured objects. A system that can automatically extract data from documents, e.g., receipts, insurance quotes, and financial statements, has the potential to dramatically improve the efficiency of business workflows by avoiding error-prone, manual work. Recent models, based on the Transformer architecture, have shown impressive gains in accuracy. Larger models, such as PaLM 2, are also being leveraged to further streamline these business workflows. However, the datasets used in academic literature fail to capture the challenges seen in real-world use cases. Consequently, academic benchmarks report strong model accuracy, but these same models do poorly when used for complex real-world applications.

In “VRDU: A Benchmark for Visually-rich Document Understanding”, presented at KDD 2023, we announce the release of the new Visually Rich Document Understanding (VRDU) dataset that aims to bridge this gap and help researchers better track progress on document understanding tasks. We list five requirements for a good document understanding benchmark, based on the kinds of real-world documents for which document understanding models are frequently used. Then, we describe how most datasets currently used by the research community fail to meet one or more of these requirements, while VRDU meets all of them. We are excited to announce the public release of the VRDU dataset and evaluation code under a Creative Commons license.

Benchmark requirements

First, we compared state-of-the-art model accuracy (e.g., with FormNet and LayoutLMv2) on real-world use cases to academic benchmarks (e.g., FUNSD, CORD, SROIE). We observed that state-of-the-art models did not match academic benchmark results and delivered much lower accuracy in the real world. Next, we compared typical datasets for which document understanding models are frequently used with academic benchmarks and identified five dataset requirements that allow a dataset to better capture the complexity of real-world applications:

  • Rich Schema: In practice, we see a wide variety of rich schemas for structured extraction. Entities have different data types (numeric, strings, dates, etc.) that may be required, optional, or repeated in a single document or may even be nested. Extraction tasks over simple flat schemas like (header, question, answer) do not reflect typical problems encountered in practice.
  • Layout-Rich Documents: The documents should have complex layout elements. Challenges in practical settings come from the fact that documents may contain tables, key-value pairs, switch between single-column and double-column layout, have varying font-sizes for different sections, include pictures with captions and even footnotes. Contrast this with datasets where most documents are organized in sentences, paragraphs, and chapters with section headers — the kinds of documents that are typically the focus of classic natural language processing literature on long inputs.
  • Diverse Templates: A benchmark should include different structural layouts or templates. It is trivial for a high-capacity model to extract from a particular template by memorizing the structure. However, in practice, one needs to be able to generalize to new templates/layouts, an ability that the train-test split in a benchmark should measure.
  • High-Quality OCR: Documents should have high-quality Optical Character Recognition (OCR) results. Our aim with this benchmark is to focus on the VRDU task itself and to exclude the variability brought on by the choice of OCR engine.
  • Token-Level Annotation: Documents should contain ground-truth annotations that can be mapped back to corresponding input text, so that each token can be annotated as part of the corresponding entity. This is in contrast with simply providing the text of the value to be extracted for the entity. This is key to generating clean training data where we do not have to worry about incidental matches to the given value. For instance, in some receipts, the ‘total-before-tax’ field may have the same value as the ‘total’ field if the tax amount is zero. Having token level annotations prevents us from generating training data where both instances of the matching value are marked as ground-truth for the ‘total’ field, thus producing noisy examples.

VRDU datasets and tasks

The VRDU dataset is a combination of two publicly available datasets, Registration Forms and Ad-Buy forms. These datasets provide examples that are representative of real-world use cases, and satisfy the five benchmark requirements described above.

The Ad-buy Forms dataset consists of 641 documents with political advertisement details. Each document is either an invoice or receipt signed by a TV station and a campaign group. The documents use tables, multi-columns, and key-value pairs to record the advertisement information, such as the product name, broadcast dates, total price, and release date and time.

The Registration Forms dataset consists of 1,915 documents with information about foreign agents registering with the US government. Each document records essential information about foreign agents involved in activities that require public disclosure. Contents include the name of the registrant, the address of related bureaus, the purpose of activities, and other details.

We gathered a random sample of documents from the public Federal Communications Commission (FCC) and Foreign Agents Registration Act (FARA) sites, and converted the images to text using Google Cloud’s OCR. We discarded a small number of documents that were several pages long and the processing did not complete in under two minutes. This also allowed us to avoid sending very long documents for manual annotation — a task that can take over an hour for a single document. Then, we defined the schema and corresponding labeling instructions for a team of annotators experienced with document-labeling tasks.

The annotators were also provided with a few sample labeled documents that we labeled ourselves. The task required annotators to examine each document, draw a bounding box around every occurrence of an entity from the schema for each document, and associate that bounding box with the target entity. After the first round of labeling, a pool of experts were assigned to review the results. The corrected results are included in the published VRDU dataset. Please see the paper for more details on the labeling protocol and the schema for each dataset.

Existing academic benchmarks (FUNSD, CORD, SROIE, Kleister-NDA, Kleister-Charity, DeepForm) fall-short on one or more of the five requirements we identified for a good document understanding benchmark. VRDU satisfies all of them. See our paper for background on each of these datasets and a discussion on how they fail to meet one or more of the requirements.

We built four different model training sets with 10, 50, 100, and 200 samples respectively. Then, we evaluated the VRDU datasets using three tasks (described below): (1) Single Template Learning, (2) Mixed Template Learning, and (3) Unseen Template Learning. For each of these tasks, we included 300 documents in the testing set. We evaluate models using the F1 score on the testing set.

  • Single Template Learning (STL): This is the simplest scenario where the training, testing, and validation sets only contain a single template. This simple task is designed to evaluate a model’s ability to deal with a fixed template. Naturally, we expect very high F1 scores (0.90+) for this task.
  • Mixed Template Learning (MTL): This task is similar to the task that most related papers use: the training, testing, and validation sets all contain documents belonging to the same set of templates. We randomly sample documents from the datasets and construct the splits to make sure the distribution of each template is not changed during sampling.
  • Unseen Template Learning (UTL): This is the most challenging setting, where we evaluate if the model can generalize to unseen templates. For example, in the Registration Forms dataset, we train the model with two of the three templates and test the model with the remaining one. The documents in the training, testing, and validation sets are drawn from disjoint sets of templates. To our knowledge, previous benchmarks and datasets do not explicitly provide such a task designed to evaluate the model’s ability to generalize to templates not seen during training.

The objective is to be able to evaluate models on their data efficiency. In our paper, we compared two recent models using the STL, MTL, and UTL tasks and made three observations. First, unlike with other benchmarks, VRDU is challenging and shows that models have plenty of room for improvements. Second, we show that few-shot performance for even state-of-the-art models is surprisingly low with even the best models resulting in less than an F1 score of 0.60. Third, we show that models struggle to deal with structured repeated fields and perform particularly poorly on them.

Conclusion

We release the new Visually Rich Document Understanding (VRDU) dataset that helps researchers better track progress on document understanding tasks. We describe why VRDU better reflects practical challenges in this domain. We also present experiments showing that VRDU tasks are challenging, and recent models have substantial headroom for improvements compared to the datasets typically used in the literature with F1 scores of 0.90+ being typical. We hope the release of the VRDU dataset and evaluation code helps research teams advance the state of the art in document understanding.

Acknowledgements

Many thanks to Zilong Wang, Yichao Zhou, Wei Wei, and Chen-Yu Lee, who co-authored the paper along with Sandeep Tata. Thanks to Marc Najork, Riham Mansour and numerous partners across Google Research and the Cloud AI team for providing valuable insights. Thanks to John Guilyard for creating the animations in this post.

Categories
Misc

Speed Up GPU Crash Debugging with NVIDIA Nsight Aftermath

A spinning GIF showing a tree on fire in the dark with other trees surrounding it.NVIDIA Nsight Developer Tools provide comprehensive access to NVIDIA GPUs and graphics APIs for performance analysis, optimization, and debugging activities….A spinning GIF showing a tree on fire in the dark with other trees surrounding it.

NVIDIA Nsight Developer Tools provide comprehensive access to NVIDIA GPUs and graphics APIs for performance analysis, optimization, and debugging activities. When using advanced rendering techniques like ray tracing or path tracing, Nsight tools are your companion for creating a smooth and polished experience. 

At SIGGRAPH 2023, NVIDIA hosted a lab exploring how to use NVIDIA Nsight Tools to debug and profile ray tracing applications. New versions of the NVIDIA Nsight Aftermath SDK, NVIDIA Nsight Graphics, and NVIDIA Nsight Systems are also available. For more information on Nsight Tools released at SIGGRAPH, check out the latest video on NVIDIA Graphics Tools.

This post explores how Nsight Aftermath SDK 2023.2 speeds up GPU crash debugging with improved event marker performance.  

Nsight Aftermath SDK GPU crash postmortem analysis

Few issues are as pressing as a GPU crash, which can abruptly block development progress until resolved. Developers and end users alike find these crashes frustrating, especially when they can’t capture useful debugging information from the GPU pipeline at the moment of failure. To shed light on hidden exceptions, the Nsight Aftermath SDK opens a window into the GPU at the moment a game fails. This helps pinpoint the source of the issue and guides the developer in resolving it.

The Nsight Aftermath SDK generates GPU crash dump files that load into NVIDIA Nsight Graphics to visualize the GPU state—revealing MMU fault information, warp details, problematic shader source, and more. Integrating Aftermath into existing crash reporters also provides more granular pipeline dumps from end-users’ machines, providing actionable reports. Today’s update to the Nsight Aftermath SDK improves the contextual data provided through low-overhead, application-specific markers.

Event marker performance has been enhanced in the Nsight Aftermath SDK for DirectX 12 applications. You can insert these markers into your CPU code at desired intervals, and the significantly reduced overhead makes them usable in shipping applications. Markers are written to the Aftermath crash dump file, indicating where in the application’s frame a GPU exception occurred. With this information, you can determine the workload executing on the GPU and view what shaders were in use at the time of the crash.

The 2023.2 version of the Nsight Aftermath SDK also supports collecting and displaying shader register values to aid in debugging streaming multiprocessor (SM) exceptions. On the SM, registers store the results of instructions as they are executing. This data is particularly relevant to determining the source of a crash if a shader workload triggered the failure. After being written to an Nsight Aftermath dump file, you can inspect the register values for faulting threads in Nsight Graphics. This helps you determine where and why the shader execution failed.

Screenshot of a SM register profiling in Nsight Aftermath SDK.
Figure 1. Nsight Aftermath SDK exposes shader register values that correspond to the line of shader source code that caused an exception

SM register data is now available for DirectX 12 and Vulkan applications. Note that viewing this data requires NVIDIA Nsight Graphics Pro. Coordinate with your NVIDIA Developer Technologies or Developer Relations contact, or reach out, to request access.

Nsight Aftermath is also now compatible with the latest applications using cutting-edge DirectX12 features through the DirectX Agility SDK.

Getting Started with Nsight Aftermath SDK and event markers

Getting started with the SDK is easy. Here are some tips to help you use GPU crash dumps and event markers. More information is included in the Read Me section of the download.

  1. Download Nsight Aftermath SDK 2023.2.
  2. Enable GPU crash dump creation by calling GFSDK_Aftermath_EnableGpuCrashDumps. Note that crash dumps won’t be made for devices generated before that call. Make sure it’s enabled first.
  3. Set the Nsight Aftermath options to control what information is captured. 
    For example, you can enable ‌shader debug information and runtime shader error reports “flags” when you initialize Nsight Aftermath for the device. 

    Tip: To use event markers, make sure that the event marker flag is enabled at this step. You can also use the Nsight Aftermath Monitor application to enable SM register collection.

Screen grab of the Nsight Aftermath Monitor.
Figure 2. The Nsight Aftermath Monitor–included in both the SDK and Nsight Graphics—is the command center for collecting crash information
  1. When your GPU dump has been collected, open it up with Nsight Graphics for rich data visualization. Nsight Graphics will help you analyze the crash and determine how to resolve it.

Tip: the Aftermath API provides a simple and lightweight solution for inserting event markers on the GPU timeline. To keep CPU overhead to a minimum, you can set dataSize=0 to instruct Aftermath to rely on the application to manage and resolve marker data itself. 

Download NVIDIA Nsight Developer Tools

Download all of the new Nsight Developer Tools announced at SIGGRAPH.

Dive deeper or ask questions in Developer Tools forums or learn more about graphics development with Nsight Tools at SIGGRAPH 2023.

Categories
Misc

Scale XR Workflows with NVIDIA CloudXR Suite

NVIDIA is providing developers with an advanced platform to create scalable, branded, custom extended reality (XR) products with the new NVIDIA CloudXR Suite….

NVIDIA is providing developers with an advanced platform to create scalable, branded, custom extended reality (XR) products with the new NVIDIA CloudXR Suite.

Built on a new architecture, NVIDIA CloudXR Suite is a major step forward in scaling the XR ecosystem. It provides a platform for developers, professionals, and enterprise teams to flexibly orchestrate and scale XR workloads across operating systems, including virtual machines in Windows and Linux-based systems such as containers. 

With the NVIDIA CloudXR streaming stack, users can build flexible, high-performance cloud solutions capable of streaming the most demanding immersive experiences. Teams can also access NVIDIA streaming technology to effectively manage the quality of the streaming across large public and private networks, including the internet.

 Figure shows a diagram with different components labeled CloudXR Essentials, CloudXR Server Extensions, and CloudXR Client Extensions. These components are part of the new architecture for the CloudXR Suite.
Figure 1. NVIDIA CloudXR Suite consists of three components: CloudXR Essentials, CloudXR Server Extensions, and CloudXR Client Extensions

Immersive content developers have a challenge in supporting both tethered devices, driven by high-powered graphics cards, and mobile devices that have limited graphics power.  

Using NVIDIA CloudXR, developers can create high-quality versions of applications—built to take advantage of powerful GPUs—and still target users with mobile XR devices using NVIDIA CloudXR streaming. 

Additionally, cloud service providers (CSPs), orchestrators, and system integrators can extend GPU services with interactive graphics to support next-generation XR applications. 

NVIDIA CloudXR Suite is composed of three components: CloudXR Essentials, CloudXR Server Extensions, and CloudXR Client Extensions. 

CloudXR Essentials provide the underlying streaming layer, complete with new improvements such as 5G L4S optimizations, QoS algorithms, and enhanced logging tools. ‌Essentials also includes the SteamVR plug-in, along with sample clients and a new server-side API that can be directly integrated into XR applications. This removes the need for a separate XR runtime.

CloudXR Server Extensions extend server-side interfaces with a source code addition to Collabara’s Monado OpenXR runtime. The new CloudXR Server API contained in CloudXR Essentials, plus the OpenXR API, represent the gateway to scaling XR distribution for orchestration partners. 

CloudXR Client Extensions empower users to build custom CloudXR client applications. The first offering is a Unity plug-in for CloudXR. With this plug-in, Unity app developers can more easily build applications with branded custom interfaces and lobbies before connecting to the CloudXR streaming server. 

“The new CloudXR Server Extensions bring more opportunities for software developers to use Monado’s OpenXR runtime to build next generation immersive experiences,” said Frédéric Plourde, XR Lead at Collabora.

High-quality XR streaming for clients

PureWeb helps organizations across industries adopt real-time 3D technology to improve operations. Using NVIDIA CloudXR, PureWeb shows customers how to deliver complex, immersive XR workloads at scale, on their current devices.

“We want to provide our customers with access to the GPU resources and streaming technologies they need, so they can share immersive experiences without worrying about not having enough computing resources,” said Chris Jarabek, VP of Product Development at PureWeb. “With this new advancement through CloudXR Suite, we can better scale with OpenXR and build with that.” 

The team at Innoactive has integrated NVIDIA CloudXR into their VR application deployment platform, Innoactive Portal. Many Innoactive customers are using the platform to provide high-quality immersive training to users, wherever they are.

“Many of our customers are building applications and plan to stream them to all-in-one headsets or other mobile XR devices,” said Daniel Seidl, CEO of Innoactive. “With the Unity plug-in, our customers can now stream from AWS and Microsoft Azure with CloudXR, making XR streaming even more accessible from the cloud.”

Learn more and download NVIDIA CloudXR Suite

Categories
Offsites

AdaTape: Foundation model with adaptive computation and dynamic read-and-write

Adaptive computation refers to the ability of a machine learning system to adjust its behavior in response to changes in the environment. While conventional neural networks have a fixed function and computation capacity, i.e., they spend the same number of FLOPs for processing different inputs, a model with adaptive and dynamic computation modulates the computational budget it dedicates to processing each input, depending on the complexity of the input.

Adaptive computation in neural networks is appealing for two key reasons. First, the mechanism that introduces adaptivity provides an inductive bias that can play a key role in solving some challenging tasks. For instance, enabling different numbers of computational steps for different inputs can be crucial in solving arithmetic problems that require modeling hierarchies of different depths. Second, it gives practitioners the ability to tune the cost of inference through greater flexibility offered by dynamic computation, as these models can be adjusted to spend more FLOPs processing a new input.

Neural networks can be made adaptive by using different functions or computation budgets for various inputs. A deep neural network can be thought of as a function that outputs a result based on both the input and its parameters. To implement adaptive function types, a subset of parameters are selectively activated based on the input, a process referred to as conditional computation. Adaptivity based on the function type has been explored in studies on mixture-of-experts, where the sparsely activated parameters for each input sample are determined through routing.

Another area of research in adaptive computation involves dynamic computation budgets. Unlike in standard neural networks, such as T5, GPT-3, PaLM, and ViT, whose computation budget is fixed for different samples, recent research has demonstrated that adaptive computation budgets can improve performance on tasks where transformers fall short. Many of these works achieve adaptivity by using dynamic depth to allocate the computation budget. For example, the Adaptive Computation Time (ACT) algorithm was proposed to provide an adaptive computational budget for recurrent neural networks. The Universal Transformer extends the ACT algorithm to transformers by making the computation budget dependent on the number of transformer layers used for each input example or token. Recent studies, like PonderNet, follow a similar approach while improving the dynamic halting mechanisms.

In the paper “Adaptive Computation with Elastic Input Sequence”, we introduce a new model that utilizes adaptive computation, called AdaTape. This model is a Transformer-based architecture that uses a dynamic set of tokens to create elastic input sequences, providing a unique perspective on adaptivity in comparison to previous works. AdaTape uses an adaptive tape reading mechanism to determine a varying number of tape tokens that are added to each input based on input’s complexity. AdaTape is very simple to implement, provides an effective knob to increase the accuracy when needed, but is also much more efficient compared to other adaptive baselines because it directly injects adaptivity into the input sequence instead of the model depth. Finally, Adatape offers better performance on standard tasks, like image classification, as well as algorithmic tasks, while maintaining a favorable quality and cost tradeoff.

Adaptive computation transformer with elastic input sequence

AdaTape uses both the adaptive function types and a dynamic computation budget. Specifically, for a batch of input sequences after tokenization (e.g., a linear projection of non-overlapping patches from an image in the vision transformer), AdaTape uses a vector representing each input to dynamically select a variable-sized sequence of tape tokens.

AdaTape uses a bank of tokens, called a “tape bank”, to store all the candidate tape tokens that interact with the model through the adaptive tape reading mechanism. We explore two different methods for creating the tape bank: an input-driven bank and a learnable bank.

The general idea of the input-driven bank is to extract a bank of tokens from the input while employing a different approach than the original model tokenizer for mapping the raw input to a sequence of input tokens. This enables dynamic, on-demand access to information from the input that is obtained using a different point of view, e.g., a different image resolution or a different level of abstraction.

In some cases, tokenization in a different level of abstraction is not possible, thus an input-driven tape bank is not feasible, such as when it’s difficult to further split each node in a graph transformer. To address this issue, AdaTape offers a more general approach for generating the tape bank by using a set of trainable vectors as tape tokens. This approach is referred to as the learnable bank and can be viewed as an embedding layer where the model can dynamically retrieve tokens based on the complexity of the input example. The learnable bank enables AdaTape to generate a more flexible tape bank, providing it with the ability to dynamically adjust its computation budget based on the complexity of each input example, e.g., more complex examples retrieve more tokens from the bank, which let the model not only use the knowledge stored in the bank, but also spend more FLOPs processing it, since the input is now larger.

Finally, the selected tape tokens are appended to the original input and fed to the following transformer layers. For each transformer layer, the same multi-head attention is used across all input and tape tokens. However, two different feed-forward networks (FFN) are used: one for all tokens from the original input and the other for all tape tokens. We observed slightly better quality by using separate feed-forward networks for input and tape tokens.

An overview of AdaTape. For different samples, we pick a variable number of different tokens from the tape bank. The tape bank can be driven from input, e.g., by extracting some extra fine-grained information or it can be a set of trainable vectors. Adaptive tape reading is used to recursively select different sequences of tape tokens, with variable lengths, for different inputs. These tokens are then simply appended to inputs and fed to the transformer encoder.

AdaTape provides helpful inductive bias

We evaluate AdaTape on parity, a very challenging task for the standard Transformer, to study the effect of inductive biases in AdaTape. With the parity task, given a sequence 1s, 0s, and -1s, the model has to predict the evenness or oddness of the number of 1s in the sequence. Parity is the simplest non-counter-free or periodic regular language, but perhaps surprisingly, the task is unsolvable by the standard Transformer.

Evaluation on the parity task. The standard Transformer and Universal Transformer were unable to perform this task, both showing performance at the level of a random guessing baseline.

Despite being evaluated on short, simple sequences, both the standard Transformer and Universal Transformers were unable to perform the parity task as they are unable to maintain a counter within the model. However, AdaTape outperforms all baselines, as it incorporates a lightweight recurrence within its input selection mechanism, providing an inductive bias that enables the implicit maintenance of a counter, which is not possible in standard Transformers.

Evaluation on image classification

We also evaluate AdaTape on the image classification task. To do so, we trained AdaTape on ImageNet-1K from scratch. The figure below shows the accuracy of AdaTape and the baseline methods, including A-ViT, and the Universal Transformer ViT (UViT and U2T) versus their speed (measured as number of images, processed by each code, per second). In terms of quality and cost tradeoff, AdaTape performs much better than the alternative adaptive transformer baselines. In terms of efficiency, larger AdaTape models (in terms of parameter count) are faster than smaller baselines. Such results are consistent with the finding from previous work that shows that the adaptive model depth architectures are not well suited for many accelerators, like the TPU.

We evaluate AdaTape by training on ImageNet from scratch. For A-ViT, we not only report their results from the paper but also re-implement A-ViT by training from scratch, i.e., A-ViT(Ours).

A study of AdaTape’s behavior

In addition to its performance on the parity task and ImageNet-1K, we also evaluated the token selection behavior of AdaTape with an input-driven bank on the JFT-300M validation set. To better understand the model’s behavior, we visualized the token selection results on the input-driven bank as heatmaps, where lighter colors mean that position is more frequently selected. The heatmaps reveal that AdaTape more frequently picks the central patches. This aligns with our prior knowledge, as central patches are typically more informative — especially in the context of datasets with natural images, where the main object is in the middle of the image. This result highlights the intelligence of AdaTape, as it can effectively identify and prioritize more informative patches to improve its performance.

We visualize the tape token selection heatmap of AdaTape-B/32 (left) and AdaTape-B/16 (right). The hotter / lighter color means the patch at this position is more frequently selected.

Conclusion

AdaTape is characterized by elastic sequence lengths generated by the adaptive tape reading mechanism. This also introduces a new inductive bias that enables AdaTape to have the potential to solve tasks that are challenging for both standard transformers and existing adaptive transformers. By conducting comprehensive experiments on image recognition benchmarks, we demonstrate that AdaTape outperforms standard transformers and adaptive architecture transformers when computation is held constant.

Acknowledgments

One of the authors of this post, Mostafa Dehghani, is now at Google DeepMind.

Categories
Misc

Shutterstock Brings Generative AI to 3D Scene Backgrounds With NVIDIA Picasso

Picture this: Creators can quickly create and customize 3D scene backgrounds with the help of generative AI, thanks to cutting-edge tools from Shutterstock. The visual-content provider is building services using NVIDIA Picasso — a cloud-based foundry for developing generative AI models for visual design. The work incorporates Picasso’s latest feature — announced today during NVIDIA Read article >

Categories
Misc

Content Creation ‘In the NVIDIA Studio’ Gets Boost From New Professional GPUs, AI Tools, Omniverse and OpenUSD Collaboration Features

AI and accelerated computing were in the spotlight at SIGGRAPH — the world’s largest gathering of computer graphics experts — as NVIDIA founder and CEO Jensen Huang announced during his keynote address updates to NVIDIA Omniverse, a platform for building and connecting 3D tools and applications, as well as acceleration for Universal Scene Description (known as OpenUSD), the open and extensible ecosystem for 3D worlds.

Categories
Misc

NVIDIA Omniverse Opens Portals to Vast Worlds of OpenUSD

New Omniverse Cloud APIs Help Developers Adopt OpenUSD; Generative AI Model ChatUSD LLM Converses in USD; RunUSD Translates USD to Interactive Graphics, DeepSearch LLM Enables Semantic 3D …

Categories
Misc

Extended Cut: NVIDIA Expands Maxine for Video Editing, Showcases 3D Virtual Conferencing Research

Professionals, teams, creators and others can tap into the power of AI to create high-quality audio and video effects — even using standard microphones and webcams — with the help of NVIDIA Maxine. The suite of GPU-accelerated software development kits and cloud-native microservices lets users deploy AI features that enhance audio, video and augmented-reality effects Read article >