Categories
Misc

NVIDIA Jetson Project of the Month: This Autonomous Soccer Robot Can Aim, Shoot, and Score

Soccer is considered one of the most popular sports around the world. And with good reason: the action is often intense, and the game combines both physicality…

Soccer is considered one of the most popular sports around the world. And with good reason: the action is often intense, and the game combines both physicality and skill from the players that can be thrilling to watch. So it should come as no surprise that there are folks out there who are working to teach robots the finer points of the game, including how to gather the ball, line up a shot, pass, and score a goal. 

In fact, an entire competition is devoted to this very idea. The RoboCup Small Size League (SSL) Vision Blackout Technical Challenge encourages teams to “explore local sensing and processing rather than the typical approach of an off-board computer and a global set of cameras sensing the environment.” Student João Guilherme, his instructor Edna Barros, and other SSL teammates from the Federal University of Pernambuco in Recife, Brazil built an omnidirectional robot powered by the NVIDIA Jetson Nano Developer Kit to execute soccer tasks autonomously. 

The team built their omnidirectional robot with a monocular camera that can autonomously perform the following tasks:

  • Localization
  • Soccer ball detection and grabbing
  • Coordinate calculation
  • Passing the ball to other team robots
  • Scoring on an empty goal

The team built the robot with an AI software pipeline running at an average processing speed of 30 FPS, with the hardware consuming only around 10.8 W of power.

The robot has a kicking device on its front and is a four-wheeled omnidirectional robot. Figure 1 shows the geometry of the robot.

Chart showing the movement capability of the SSL omnidirectional robot.
Figure 1. The movement capabilities of the omnidirectional robot powered by the NVIDIA Jetson Nano Developer Kit to execute soccer tasks autonomously

“We evaluate our system on three soccer tasks: grabbing a ball, scoring a goal, and passing the ball, achieving 80%, 80%, and 46.7% success rates, respectively,” the team explains in Towards an Autonomous RoboCup Small Size League Robot.

During tournament play, teams will use off-field computers to execute most of the computation, receiving the position of the ball and gathering field geometry information and referee commands. The matches are played between teams of six (division B) and 11 (division A) robots, and the robots receive navigation commands through RF communication with minimal bandwidth. The diameter and height of the robots are limited to 180 millimeters (division B) and 150 millimeters (division A), hence the name Small Size League. 

The SSL RoboCup competitions include four stages:

  1. Grab a stationary ball somewhere on the field
  2. Score with the ball on an empty goal
  3. Move the robot to specific coordinates
  4. Score an indirect goal (two robots required) 

In addition, this challenge requires the robot to detect objects in the field, estimate their position, compute navigation paths, and keep records of past trajectories.

“SSL matches are highly dynamic environments with extremely resource-constrained robots, requiring solutions to consider size, power consumption, accuracy, and processing speed trade-offs. This work presents an architecture that enables these robots to execute basic soccer tasks autonomously, that is, without receiving any external information,” according to Guilherme and his teammates in Towards an Autonomous RoboCup Small Size League Robot.

Project hardware

The team used the following hardware in their project: 

  • A Jetson Nano Developer Kit, to perform embedded vision and decision making 
  • An omnidirectional robot
  • A Logitech C922 camera, to provide monocular vision 
  • Inertial sensors, to implement odometry estimation 
  • An STM32F767ZI microcontroller unit (MCU), to receive target relative positions and navigation flags from the Nano and execute low-level control and trajectory estimation using inertial odometry
Flow chart for soccer robot AI detection pipeline and movement planning.
Figure 2. The AI detection pipeline and movement planning of the soccer robot

For more information about the hardware used, see RobôCIn 2020 Team Description Paper.

Technical challenges 

During the competition’s Vision Blackout Challenge, the winning robot must be able to complete a variety of soccer-based skills, including grabbing a stationary ball, scoring on an empty goal, moving to specific coordinates, and scoring an indirect goal (passing to another robot). 

The robot must be able to perform these skills using only embedded sensing and processing. There are no height restrictions for this challenge, so the team added an onboard camera, the Jetson Nano, and a power supply board on top of their typical robot. 

Two versions of the soccer-playing robot are shown. The one on the left is modified for the Vision Blackout Challenge, with an onboard camera and power supply board. The original robot appears on the right.
Figure 3. The team’s soccer-playing robot modified for the Vision Blackout Challenge (left) and their original robot (right) 

In addition, this challenge requires the robot to detect objects in the field, estimate their position, compute navigation paths, and keep records of past trajectories. The SSL soccer matches make use of external cameras and offboard computers for perceiving the environment and sending commands to the robots. 

According to the researchers, the SSL Vision architecture “presents limitations such as the camera’s field-of-view, color segmentation, software latency, and communication dropouts, forcing teams to develop solutions for dealing with complex conditions. For example, one common problem during matches is ball occlusion, which occurs when a robot’s projection on the camera image overlaps the ball. Another issue is that the ball and robot position flicks, occasionally not detecting or falsely detecting them.”

In the SSL contests, the robots and balls achieve up to 3.7 m/s and 6.5 m/s velocities, respectively, resulting in a fast-moving game requiring high-throughput solutions. Additionally, the size limitations coupled with using a battery as a power source require solutions to have low-power consumption. Also, precise kicks and passes over long distances are performed during ‌matches, requiring accurate position estimations.

The team also noted the importance of accurate motor control, so the robot can move across the soccer field and keep its measured position accurate. The team needed a way to reduce the rate at which the robot’s internal understanding of its position diverges from its actual physical position. For more details, see Towards an Autonomous RoboCup Small Size League Robot.

Flow chart showing how the robot detects the soccer ball on the field.
Figure 4. The soccer robot’s camera aids object detection along with field of vision for decision making and path planning

Project software and AI

The team used OpenCV2 and calibration and pose computation techniques to extract the “intrinsic and extrinsic parameters” of the monocular camera (fixed to the robot). They used SSD MobileNet v2 to detect objects’ 2D bounding boxes on camera frames. They also used a program applying linear regression to the bounding box coordinates created by SSD MobileNet that was used to estimate precalibrated camera parameters. This would assign points on the field corresponding to the object’s bottom center (which has an object’s relative position to the camera), and therefore to the robot, too. 

Results 

The team is pleased with how their robot played in this year’s challenge. Highlights include: 

  • Grabbing a stationary ball: In 12 out of 15 attempts, the robot was able to stop with the ball touching its dribbler, an 80% success rate. 
  • Scoring a goal: A goal was scored in 12 of the 15 runs.
  • Passing: The robot passed the ball in 7 of the 15 tries, resulting in a 46.7% success rate. 

Visit RoboCup 2023 Results to see the full list of results. The team has participated in the RoboCup Small Size League since 2019, winning their first world title in 2022 (Division B). They are currently a three-time Latin American champion. RobôCIn Small Size League Extended Team Description Paper for RoboCup 2023 presents the improvements the team made to their project for the Small Size League (SSL) division B title in RoboCup 2023 in Bordeaux, France in late July, when they took first place.

Figure 5. The robot grabbing a stationary ball (left) and scoring a goal (right)

Future plans

Guilherme shared some insights about challenges their team encountered in competition, and opportunities for improvement for future events. He noted that most of the failures were due to false-positive detections from objects outside the field. “We are working on a solution for detecting the field boundaries and applying a mask to discard those objects,” he said. 

The team needs faster object detection solutions. “Even though we are able to execute basic skills so far, 30 FPS is still a low processing speed for the SSL environment. At the main competition, cameras usually operate at 70 FPS,” he said. 

The robot’s skills were implemented using only relative positions from detected objects–that is, without the knowledge of the robot’s self-localization on the field. “We believe this information might be useful for optimizing our performance in the soccer tasks, while also allowing us to avoid penalties,” Guilherme noted. For example, the robot should not enter the goalkeeper’s area. “We are working on a self-localization algorithm based on Monte Carlo Localization (MCL) and will share it in the coming months.”

The team plans to add more features to the robot’s system in the future (such as field line detection, localization algorithms, and path planning), and they will be working to optimize each part of the system for those needs. 

In addition, the team continues to work on solutions for detecting field boundaries and lines, and estimating the robot’s self-localization. They also plan to replace the Jetson Nano with a Jetson Orin Nano so they can achieve faster processing speeds with their robot. That upgrade should help the team compete more effectively in league play. 

To learn more about the team’s original project, visit the Developer Forum and GitHub. Explore Jetson Community Projects for more ideas and inspiration from your fellow robotics developers.

Categories
Misc

Pro Tips for Building Multilingual Recommender Systems

Picture this: You’re browsing through an online store, looking for the perfect pair of running shoes. But with thousands of options available, where do you even…

Picture this: You’re browsing through an online store, looking for the perfect pair of running shoes. But with thousands of options available, where do you even begin? Suddenly, a section catches your eye: “Recommended for You.” Intrigued, you click and, within seconds, a curated list of running shoes tailored to your unique preferences appears. It’s as if the website understands your tastes, needs, and style.

Welcome to the world of recommendation systems, where cutting-edge technology combines data analysis, artificial intelligence (AI), and a touch of magic to transform our digital experiences.

This post dives deep into the fascinating realm of recommendation systems and explores the modeling approach for building a two-stage candidate reranker. I provide pro tips on how to overcome data scarcity in underrepresented languages, along with a technical walkthrough of how to implement these best practices.

Overview of building a two-stage candidate reranker

For each user, a recommender system must predict a few items that this user will be interested in from possibly millions of items. This is a daunting task. A powerful modeling approach is called the two-stage candidate reranker

Figure 1 shows the two stages. In the first stage, the model identifies hundreds of candidate items that the user may be interested in. In the second stage, the model ranks this list from most likely to least likely. Finally, the model suggests the most likely items to the user.

Diagram shows the steps from candidate generation to ranking and the relative pool size.
Figure 1. Flow of a two-stage candidate reranker recommendation system

Stage 1: Candidate generation

There are many ways to generate candidates, including statistical methods and deep learning methods. One statistical technique to generate candidates is building a co-visitation matrix. You iterate through all user historical sessions and maintain a cumulative tally of how often each pair of items coexists within user sessions. As a result, you know the top 100 items that are frequently paired with each item.

Now, given a specific user, you can generate candidate items by iterating through their user history and combining all top 100 lists associated with each item in their history. Many items appear multiple times. The candidates are the most common items in this concatenated list of hundreds of items.

Stage 2: Ranking

Using candidates from stage 1, build a tabular dataframe (Figure 2), which you use to train a reranker. Imagine that stage 1 produces 100 candidates per user. Then your tabular dataframe has 100 rows for each trained data user. One column is user and another column is candidate item. Add a third column for the target. Each row with a candidate item that is a correct match for that row’s user has a 1 in the target column and 0 otherwise.

Tabular reranker dataframe diagram shows rows and columns to describe user sessions, candidate items, and features.
Figure 2. Reranker dataframe table

Next, add columns that describe the user sessions and items called feature columns. These feature columns are what the reranker uses to learn patterns and predict the target column. You train your reranker with either a binary classification objective or a pairwise or listwise ranking objective. Afterward, you use this trained model to predict items for unseen test user sessions.

Data scarcity in underrepresented languages

The two-stage candidate reranker approach (and any other approach) requires a large amount of training data to train the machine learning or deep learning model properly. Popular languages typically have lots of existing data, but this is not true for historically underrepresented languages.

Advocating for underserved languages is crucial for several reasons, such as promoting inclusivity, increasing global reach, and improving online user engagement and satisfaction.

To build recommender systems for underrepresented languages, I recommend using transfer learning. By leveraging datasets for common languages, models can recognize existing patterns and apply these learnings to support languages that are not widely spoken. This helps you overcome small dataset challenges and create a more inclusive digital world.

Pro tips for developing multilingual recommendation systems

To overcome data scarcity, use transfer learning to apply information from one language to another for stages 1 and 2. Many items have equivalents in multiple languages. Therefore, user-item interaction behavior in one language can be translated to another language.

Here are the top tips for speeding up the development process for multilingual recommendation engines.

Tips for candidate generation

  • First, create co-visitation matrices for underrepresented languages by using user histories that exist in both popular languages and underrepresented languages.
  • Be sure to represent items with pretrained multilingual large language model (LLM) embeddings. Then, use cosine similarity to find candidate items in underrepresented languages.
  • Initialize NN embeddings with pretrained multilingual LLM embeddings. Then, fine-tune and use cosine similarity between user and item embeddings to find candidate items in the underrepresented languages.

Tips for ranking

  • You can use item features from popular languages as item features for underrepresented languages in the tabular dataframe for the reranker.
  • Create user-item interaction features by transferring user-item patterns learned from popular languages to underrepresented languages.
  • Finally, train an underrepresented language’s reranker using user-item dataframe rows from popular languages.

Tutorial: Multilingual recommender system

To help you test these methods out, I walk you through an optimized process for building a multilingual recommender system.

Candidate generation implementation

The goal of candidate generation is to generate hundreds of item suggestions per user. Two popular techniques are using co-visitation matrices and using representation learning. Using transfer learning with co-visitation matrices is straightforward.

Earlier in this post, I discussed how co-visitation candidate generation is based on counting the coexisting pairs of product IDs within user histories. As many product IDs exist in multiple languages, you can use pairs from a German user’s history as counts in a Spanish co-visitation matrix. In Figure 3, the top German row is from the training data. You then “translate” it to Spanish, shown in the bottom row.

Diagram showing co-visitation matrices preprocessing for generating item suggestions using a transfer learning technique.
Figure 3. Transfer learning process

The procedure is as follows.

  • Given a pair of Spanish product IDs, you can iterate through users from the other five languages: English, German, Japanese, Italian, and French.
  • Whenever you observe the pair of Spanish product IDs in one of these user’s histories, add 1 to the count for this Spanish item pair. Or you can use a different weight, like adding 0.5 to the count.
  • After you accumulate counts for all Spanish item pairs, continue to generate candidates as before by applying the new co-visitation matrix to each Spanish user’s history to generate candidates for the Spanish user.

The fastest and most efficient way to create co-visitation matrices is to use RAPIDS cuDF. To follow along, see the Candidate ReRank Model using Handcrafted Rules Jupyter notebook with example code.

By merging a dataframe that contains all user histories (that is, a dataframe with columns user and history item) to itself on the key user, you create all historical pairs. Then group by item pairs and aggregate the counts.

import cudf 
df = cudf.DataFrame(ALL_USER_HISTORIES)
df = df.merge(df, on='user')
df['wgt'] = 1
df = df.groupby(['item_x','item_y']).wgt.sum()

Representation learning, LLMs, and deep learning embeddings are hot and current topics. Besides co-visitation matrices, an alternative to generating candidate items for each user is to create meaningful distance embeddings. If you have meaningful distance embeddings for each item, then you could use a model that predicts an embedding for each user. Next, find the 100 closest (through cosine similarity) embeddings to this predicted embedding and use these as your candidates (Figure 4).

Visual representation of a varied approach to generating item recommendations using distance embeddings.
Figure 4. Compute distance between embeddings

The process of training meaningful distance embeddings for items is called representation learning. Embeddings are N dimensional vectors in N dimensional space. During training, embeddings of similar items are modified to be closer together (through some distance metric) while embeddings of dissimilar items are modified to have at least a predefined gap distance (margin) between them.

One way to use transfer learning during representation learning is to pre-initialize the embeddings with multilingual sentence embeddings. Each item has a title, whether it’s in English, German, Japan, Spanish, Italian, or French. You can pre-initialize each item with its title embedding from Hugging Face’s model stsb-xlm-r-multilingual, for example. This model has been trained on many different languages and transfers learning from all of them. Afterward, you can fine-tune the embeddings using your training data with the model shown in Figure 5.

Diagram showing the convolution and embedding layers workflow in representation learning.
Figure 5. Representation learning

Fine-tune your model using all train data user histories. Every three consecutive history items are paired with one positive item target, which is the next consecutive item. Each triplet is paired with 4096 negative item targets, which are randomly chosen items. Backpropagation maximizes cosine similarity between the predicted embedding and positive target. And it minimizes cosine similarity between the predicted embedding and negative target. Afterward, you have meaningful distance embeddings for each item and a predicted embedding for each user.

A quick and easy way to create transformer-based, session-aware recommender systems that can use pretrained embeddings is to use the NVIDIA Merlin framework. For more information, see the Session-Based Next Item Prediction for Fashion E-Commerce and Training With Pretrained Embeddings Jupyter notebooks.

You can also feed your models with NVIDIA Merlin Dataloader.

Ranking implementation

The goal of stage 2 is to train a reranker that predicts the likelihood of each candidate item being correct among all possible candidate items for each user. To train a model successfully, you need feature columns in addition to a user, item, and target column. There are three types of feature columns:

  • Item features
  • User features
  • User-item interaction features
Diagram shows different type of features: item, user, and user-item interaction.
Figure 6. Reranker dataframe with features

Item features describe items. For example, you can add an item price feature. Then, every row in your reranker dataframe with item A has a corresponding price A in the item price column (Figure 6).

Using transfer learning on item features is easy. To transfer learning from German to Spanish, you can create item features from the German user history data and then merge it to Spanish items.

For example, for each item product ID, count how often it appears in all German user histories. Then every row in your reranker dataframe with Spanish item A has a corresponding German popularity A in the German item popularity column. The reason this works is because many item product IDs exist in both German and Spanish. If a certain Spanish product ID does not exist in German, then you insert NAN in the German item popularity column.

User feature columns and item feature columns are generally created with dataframe groupby commands. Create a property for each user or item and then merge it into your dataframe. The quickest and most efficient method is to use RAPIDS cuDF.

import cudf
item_features = data.groupby(‘item’)
.agg({‘item:count’,’user:nunique’,’price:first’})
df = df.merge(item_features, left_on=’item’, 
   right_index=True, how=’left’)
user_features = data.groupby(‘user’)
.agg({‘user:count’,’item:nunique’})
df = df.merge(user_features, left_on=’user’, 
   right_index=True, how=’left’)

User-item interaction features describe the relationship between a row’s candidate item and that row’s user. These features have a different value for each row. A common way to generate user-item interaction features is to describe the relationship between a user’s last history item and their candidate item.

One way to use transfer learning from popular languages to underrepresented languages is to create meaningful distance embeddings for all items using multilingual information. Then a user-item interaction feature can be the cosine similarity score between a user’s last history item and candidate item based on the embeddings.

Figure 7 shows extracting item embeddings from a multilingual LLM. You concatenate all the text for each item and input it into your LLM. Extract the last hidden layer activations as your embedding.

Architecture of LLM embeddings with an output layer, hidden layers, and input layer used to extract embeddings.
Figure 7. Large language model embeddings

A third way to use information from popular languages to improve underrepresented language recommendation is to train your underrepresented GBT reranker using dataframe rows from popular languages. First, you use the same column features for all language dataframes and then you merge all dataframes into one new dataframe. Afterward, your dataframe is large.

The best way to train GBT with millions of rows is to use RAPIDS Dask cuDF XGB, which uses multiple GPUs! For more information, see the KDD cup solution code.

The key lines of the code are as follows:

import xgboost as xgb
import dask, dask_cudf
from dask.distributed import Client
client = Client(cluster)
df = dask_cudf.read_parquet(FILES).persist()
dtrain = xgb.dask.DaskQuantileDMatrix(
client, df[FEATURES], df[TARGET])
xgb.dask.train(client, xgb_parms, dtrain)

Conclusion

When browsing online, recommendation systems may seem magical but, as you learned throughout this post, the inner workings of a multilingual recommendation engine are deterministic and understandable.

In this post, I shared techniques that the Kaggle Grandmasters of NVIDIA and NVIDIA Merlin teams used to win the recent KDD cup 2023 Multilingual Recommender System competition hosted by Amazon.

I also introduced the two-stage candidate reranker technique for recommendation systems. This is a powerful technique that helps solve many recommender system needs. Next, I gave you pro tips to help train recommendation systems for underrepresented languages. I shared how RAPIDS and NVIDIA Merlin frameworks can help you build recommender systems.

I hope that you can use some of these ideas in your next recommender system project. By improving online recommender systems for underrepresented languages, we can all make the Internet more inclusive, extend global reach, and improve user engagement and satisfaction.

Categories
Misc

Selecting Large Language Model Customization Techniques

Decorative image.Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes….Decorative image.

Large language models (LLMs) are becoming an integral tool for businesses to improve their operations, customer interactions, and decision-making processes. However, off-the-shelf LLMs often fall short in meeting the specific needs of enterprises due to industry-specific terminology, domain expertise, or unique requirements.

This is where custom LLMs come into play.

Enterprises need custom models to tailor the language processing capabilities to their specific use cases and domain knowledge. Custom LLMs enable a business to generate and understand text more efficiently and accurately within a certain industry or organizational context.

Custom models empower enterprises to create personalized solutions that align with their brand voice, optimize workflows, provide more precise insights, and deliver enhanced user experiences, ultimately driving a competitive edge in the market.

This post covers various model customization techniques and when to use them. NVIDIA NeMo supports many of the methods.

NVIDIA NeMo is an end-to-end, cloud-native framework to build, customize, and deploy generative AI models anywhere. It includes training and inferencing frameworks, guardrail toolkits, data curation tools, and pretrained models, offering an easy, cost-effective, and fast way to adopt generative AI.

Selecting an LLM customization technique

You can categorize techniques by the trade-offs between dataset size requirements and the level of training effort during customization compared to the downstream task accuracy requirements.

Diagram shows four customization tools with a table of techniques, use cases, and examples.
Figure 1. LLM customization techniques available with NVIDIA NeMo

Figure 1 shows the following popular customization techniques:

  • Prompt engineering: Manipulates the prompt sent to the LLM but doesn’t alter the parameters of the LLM in any way. It is light in terms of data and compute requirements.
  • Prompt learning: Uses prompt and completion pairs imparting task-specific knowledge to LLMs through virtual tokens. This process requires more data and compute but provides better accuracy than prompt engineering.
  • Parameter-efficient fine-tuning (PEFT): Introduces a small number of parameters or layers to existing LLM architecture and is trained with use-case–specific data, providing higher accuracy than prompt engineering and prompt learning, while requiring more training data and compute.
  • Fine-tuning: Involves updating the pretrained LLM weights unlike the three types of customization techniques outlined earlier that keep these weights frozen. This means fine-tuning also requires the most amount of training data and compute as compared to these other techniques. However, it provides the most accuracy for specific use cases, justifying the cost and complexity.

For more information, see An Introduction to Large Language Models: Prompt Engineering and P-Tuning.

Prompt engineering

Prompt engineering involves customization at inference time with show-and-tell examples. An LLM is provided with example prompts and completions, detailed instructions that are prepended to a new prompt to generate the desired completion. The parameters of the model are not changed.

Few-shot prompting: This approach requires prepending a few sample prompts and completion pairs to the prompt, so that the LLM learns how to generate responses for a new unseen prompt. While few-shot prompting requires a relatively smaller amount of data as compared to other customization techniques and does not require fine-tuning, it does add to inference latency.

Chain-of-thought reasoning: Just as humans decompose bigger problems into smaller ones and apply chain of thought to solve problems effectively, chain-of-thought reasoning is a prompt engineering technique that helps LLMs improve their performance on multi-step tasks. It involves breaking a problem down into simpler steps with each of the steps requiring slow and deliberate reasoning. This approach works well for logical, arithmetic, and deductive reasoning tasks.

System prompting: This approach involves adding a system-level prompt in addition to the user prompt to provide specific and detailed instructions to the LLMs to behave as intended. The system prompt can be thought of as input to the LLM to generate its response. The quality and specificity of the system prompt can have a significant impact on the relevance and accuracy of the LLM’s response.

Prompt learning

Prompt learning is an efficient customization method that makes it possible to use pretrained LLMs on many downstream tasks without needing to tune the pretrained model’s full set of parameters. It includes two variations with subtle differences called p-tuning and prompt tuning; both methods are collectively referred to as prompt learning.

Prompt learning enables adding new tasks to LLMs without overwriting or disrupting previous tasks for which the model has already been pretrained. Because the original model parameters are frozen and never altered, prompt learning also avoids catastrophic forgetting issues often encountered when fine-tuning models. Catastrophic forgetting occurs when LLMs learn new behavior during the fine-tuning process at the cost of foundational knowledge gained during LLM pretraining.

Diagram shows that prompt learning prepends trained virtual tokens to prompt tokens resulting in more accurate LLM completions for the specific use case the virtual tokens were trained for.
Figure 2. Prompt learning applied to LLMs

Instead of selecting discrete text prompts in a manual or automated fashion, prompt tuning and p-tuning use virtual prompt embeddings that you can optimize by gradient descent. These virtual token embeddings exist in contrast to the discrete, hard, or real tokens that do make up the model’s vocabulary. Virtual tokens are purely 1D vectors with dimensionality equal to that of each real token embedding. In training and inference, continuous token embeddings are inserted among discrete token embeddings according to a template provided in the model’s config.

Prompt tuning: For a pretrained LLM, soft prompt embeddings are initialized as a 2D matrix of size total_virtual_tokens Xhidden_size. Each task that the model is prompt-tuned to perform has its own associated 2D embedding matrix. Tasks do not share any parameters during training or inference. The NeMo framework prompt tuning implementation is based on The Power of Scale for Parameter-Efficient Prompt Tuning.

P-tuning: An LSTM or MLP model called prompt_encoder is used to predict virtual token embeddings. prompt_encoder parameters are randomly initialized at the start of p-tuning. All base LLM parameters are frozen, and only the prompt_encoder weights are updated at each training step. When p-tuning completes, prompt-tuned virtual tokens from prompt_encoder are automatically moved to prompt_table where all prompt-tuned and p-tuned soft prompts are stored. prompt_encoder is then removed from the model. This enables you to preserve previously p-tuned soft prompts while still maintaining the ability to add new p-tuned or prompt-tuned soft prompts in the future.

prompt_table uses the task name as a key to look up the correct virtual tokens for a specified task. The NeMo framework p-tuning implementation is based on GPT Understands, Too.

Parameter-efficient fine-tuning

Parameter-efficient fine-tuning (PEFT) techniques use clever optimizations to selectively add and update few parameters or layers to the original LLM architecture. Using PEFT, model parameters are trained for specific use cases. Pretrained LLM weights are kept frozen and significantly fewer parameters are updated during PEFT using domain and task-specific datasets. This enables LLMs to reach high accuracy on trained tasks.

There are several popular parameter-efficient alternatives to fine-tuning pretrained language models. Unlike prompt learning, these methods do not insert virtual prompts into the input. Instead, they introduce trainable layers into the transformer architecture for task-specific learning. This helps attain strong performance on downstream tasks while reducing the number of trainable parameters by several orders of magnitude (closer to 10,000x fewer parameters) compared to fine-tuning.

  • Adapter Learning
  • Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3)
  • Low-Rank Adaptation (LoRA)

Adapter Learning: Introduces small feed-forward layers in between the layers of the core transformer architecture. Only these layers (adapters) are trained at fine-tuning time for specific downstream tasks. The adapter layer generally uses a down-projection to project the input h to a lower-dimensional space followed by a nonlinear activation function, and an up-projection with W_up. A residual connection adds the output of this to the input, leading to a final form:

h leftarrow h + f(hW_{down})W_{up}

Adapter modules are usually initialized such that the initial output of the adapter is always zeros to prevent degradation of the original model’s performance due to the addition of such modules. The NeMo framework adapter implementation is based on Parameter-Efficient Transfer Learning for NLP.

IA3: Adds even fewer parameters, compared to adapters, which simply scale the hidden representations in the transformer layer using learned vectors. These scaling parameters can be trained for specific downstream tasks. The learned vectors lk, lv, and lff, respectively rescale the keys and values in attention mechanisms and the inner activations in position-wise feed-forward networks. This technique also makes mixed-task batches possible because each sequence of activations in the batch can be separately and cheaply multiplied by its associated learned task vector. The NeMo framework IA3 implementation is based on Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning.

Diagram shows LoRA fine-tuning achieves parameter efficiency through frozen pretrained weights and reduced dimension layers.
Figure 3. LoRA for parameter-efficient fine-tuning

LoRA: Injects trainable low-rank matrices into transformer layers to approximate weight updates. Instead of updating the full pretrained weight matrix W, LoRA updates its low-rank decomposition, reducing the number of trainable parameters 10,000 times and the GPU memory requirements by 3x compared to fine-tuning. This update is applied to the query and value projection weight matrices in the multi-head attention sub-layer. Applying updates to low-rank decomposition instead of the entire matrix has been shown to be on par or better in model quality than fine-tuning, enabling higher training throughput and with no additional inference latency.

NeMo framework LoRA implementation is based on Low-Rank Adaptation of Large Language Models. For more information about how to apply the LoRa model to an extractive QA task, see the LoRA tutorial notebook.

Fine-tuning

When data and compute resources have no hard constraints, customization techniques such as supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are great alternative approaches to PEFT and prompt engineering. Fine-tuning can help achieve the best accuracy on a range of use cases as compared to other customization approaches.

Supervised fine-tuning: SFT is the process of fine-tuning all the model’s parameters on labeled data of inputs and outputs that teaches the model domain-specific terms and how to follow user-specified instructions. It is typically done after model pretraining. Using pretrained models enables many benefits that include the use of state-of-the-art models without having to train from scratch, reduced computation costs, and reduced data collection needs as compared to the pretraining stage. A form of SFT is referred to as instruction tuning because it involves fine-tuning language models on a collection of datasets described through instructions.

Diagram shows supervised fine-tuning updates the pretrained LLM weights using instruction following datasets blended in varying proportions by tasks to help improve LLM performance on unseen tasks.
Figure 4. Supervised fine-tuning with labeled instructions following data

SFT with instructions leverages the intuition that NLP tasks can be described through natural language instructions, such as “Summarize the following article into three sentences.” or “Write an email in Spanish about an upcoming school festival.” This method successfully combines the strengths of fine-tuning and prompting paradigms to improve LLM zero-shot performance at inference time.

The instruction tuning process involves performing fine-tuning on the pretrained model on a mixture of several NLP datasets expressed through natural language instructions that are blended in varying proportions. At inference time, the fine-tuned model is evaluated on unseen tasks and this process is known to substantially improve zero-shot performance on unseen tasks. SFT is also an important intermediary step in the process of improving LLM capabilities using reinforcement learning, which we describe next.

Reinforcement learning with human feedback: Reinforcement learning with human feedback (RLHF) is a customization technique that enables LLMs to achieve better alignment with human values and preferences. It uses reinforcement learning to enable the model to adapt its behavior based on the feedback it receives. It involves a three-stage fine-tuning process that uses human preference as the loss function. The SFT model fine-tuned with instructions as described in the earlier section is considered the first stage in the RLHF technique.

Diagram shows reinforcement learning with human feedback is a three-stage process that leverages a reward model trained on human preferences to provide feedback to a supervised fine-tuned LLM using reinforcement learning.
Figure 5. Aligning LLM behavior with human preferences using reinforcement learning

The SFT model is trained as a reward model (RM) in stage 2 of RLHF. A dataset consisting of prompts with multiple responses ranked by humans is used to train the RM to predict human preference.

After the RM is trained, stage 3 of RLHF focuses on fine-tuning the initial policy model against the RM using reinforcement learning with a proximal policy optimization (PPO) algorithm. These three stages of RLHF performed iteratively enable LLMs to generate outputs that are more aligned with human preferences and can follow instructions more effectively.

While RLHF results in powerful LLMs, the downside is that this method can be misused and exploited to generate undesirable or harmful content. The NeMo method uses the PPO value network as a critic model to guide the LLMs away from generating harmful content. There are other approaches being actively explored in the research community to steer the LLMs towards appropriate behavior and reduce toxic generation or hallucinations where LLMs make up facts.

Customize your LLMs

This post covered various model customization techniques and when to use them. Many of those methods are supported by NVIDIA NeMo.

NeMo provides an accelerated workflow for training with 3D parallelism techniques. It offers a choice of several customization techniques and is optimized for at-scale inference of large-scale models for language and image applications, with multi-GPU and multi-node configurations.

Download the NeMo framework today and customize pretrained LLMs on your preferred on-premises and cloud platforms.

Categories
Misc

Just Released: NVIDIA Modulus 23.08

An abstract image representing Modulus.NVIDIA Modulus is now part of the NVIDIA AI Enterprise suite, supporting PyTorch 2.0, CUDA 12, and new samples.An abstract image representing Modulus.

NVIDIA Modulus is now part of the NVIDIA AI Enterprise suite, supporting PyTorch 2.0, CUDA 12, and new samples.

Categories
Offsites

Advances in document understanding

The last few years have seen rapid progress in systems that can automatically process complex business documents and turn them into structured objects. A system that can automatically extract data from documents, e.g., receipts, insurance quotes, and financial statements, has the potential to dramatically improve the efficiency of business workflows by avoiding error-prone, manual work. Recent models, based on the Transformer architecture, have shown impressive gains in accuracy. Larger models, such as PaLM 2, are also being leveraged to further streamline these business workflows. However, the datasets used in academic literature fail to capture the challenges seen in real-world use cases. Consequently, academic benchmarks report strong model accuracy, but these same models do poorly when used for complex real-world applications.

In “VRDU: A Benchmark for Visually-rich Document Understanding”, presented at KDD 2023, we announce the release of the new Visually Rich Document Understanding (VRDU) dataset that aims to bridge this gap and help researchers better track progress on document understanding tasks. We list five requirements for a good document understanding benchmark, based on the kinds of real-world documents for which document understanding models are frequently used. Then, we describe how most datasets currently used by the research community fail to meet one or more of these requirements, while VRDU meets all of them. We are excited to announce the public release of the VRDU dataset and evaluation code under a Creative Commons license.

Benchmark requirements

First, we compared state-of-the-art model accuracy (e.g., with FormNet and LayoutLMv2) on real-world use cases to academic benchmarks (e.g., FUNSD, CORD, SROIE). We observed that state-of-the-art models did not match academic benchmark results and delivered much lower accuracy in the real world. Next, we compared typical datasets for which document understanding models are frequently used with academic benchmarks and identified five dataset requirements that allow a dataset to better capture the complexity of real-world applications:

  • Rich Schema: In practice, we see a wide variety of rich schemas for structured extraction. Entities have different data types (numeric, strings, dates, etc.) that may be required, optional, or repeated in a single document or may even be nested. Extraction tasks over simple flat schemas like (header, question, answer) do not reflect typical problems encountered in practice.
  • Layout-Rich Documents: The documents should have complex layout elements. Challenges in practical settings come from the fact that documents may contain tables, key-value pairs, switch between single-column and double-column layout, have varying font-sizes for different sections, include pictures with captions and even footnotes. Contrast this with datasets where most documents are organized in sentences, paragraphs, and chapters with section headers — the kinds of documents that are typically the focus of classic natural language processing literature on long inputs.
  • Diverse Templates: A benchmark should include different structural layouts or templates. It is trivial for a high-capacity model to extract from a particular template by memorizing the structure. However, in practice, one needs to be able to generalize to new templates/layouts, an ability that the train-test split in a benchmark should measure.
  • High-Quality OCR: Documents should have high-quality Optical Character Recognition (OCR) results. Our aim with this benchmark is to focus on the VRDU task itself and to exclude the variability brought on by the choice of OCR engine.
  • Token-Level Annotation: Documents should contain ground-truth annotations that can be mapped back to corresponding input text, so that each token can be annotated as part of the corresponding entity. This is in contrast with simply providing the text of the value to be extracted for the entity. This is key to generating clean training data where we do not have to worry about incidental matches to the given value. For instance, in some receipts, the ‘total-before-tax’ field may have the same value as the ‘total’ field if the tax amount is zero. Having token level annotations prevents us from generating training data where both instances of the matching value are marked as ground-truth for the ‘total’ field, thus producing noisy examples.

VRDU datasets and tasks

The VRDU dataset is a combination of two publicly available datasets, Registration Forms and Ad-Buy forms. These datasets provide examples that are representative of real-world use cases, and satisfy the five benchmark requirements described above.

The Ad-buy Forms dataset consists of 641 documents with political advertisement details. Each document is either an invoice or receipt signed by a TV station and a campaign group. The documents use tables, multi-columns, and key-value pairs to record the advertisement information, such as the product name, broadcast dates, total price, and release date and time.

The Registration Forms dataset consists of 1,915 documents with information about foreign agents registering with the US government. Each document records essential information about foreign agents involved in activities that require public disclosure. Contents include the name of the registrant, the address of related bureaus, the purpose of activities, and other details.

We gathered a random sample of documents from the public Federal Communications Commission (FCC) and Foreign Agents Registration Act (FARA) sites, and converted the images to text using Google Cloud’s OCR. We discarded a small number of documents that were several pages long and the processing did not complete in under two minutes. This also allowed us to avoid sending very long documents for manual annotation — a task that can take over an hour for a single document. Then, we defined the schema and corresponding labeling instructions for a team of annotators experienced with document-labeling tasks.

The annotators were also provided with a few sample labeled documents that we labeled ourselves. The task required annotators to examine each document, draw a bounding box around every occurrence of an entity from the schema for each document, and associate that bounding box with the target entity. After the first round of labeling, a pool of experts were assigned to review the results. The corrected results are included in the published VRDU dataset. Please see the paper for more details on the labeling protocol and the schema for each dataset.

Existing academic benchmarks (FUNSD, CORD, SROIE, Kleister-NDA, Kleister-Charity, DeepForm) fall-short on one or more of the five requirements we identified for a good document understanding benchmark. VRDU satisfies all of them. See our paper for background on each of these datasets and a discussion on how they fail to meet one or more of the requirements.

We built four different model training sets with 10, 50, 100, and 200 samples respectively. Then, we evaluated the VRDU datasets using three tasks (described below): (1) Single Template Learning, (2) Mixed Template Learning, and (3) Unseen Template Learning. For each of these tasks, we included 300 documents in the testing set. We evaluate models using the F1 score on the testing set.

  • Single Template Learning (STL): This is the simplest scenario where the training, testing, and validation sets only contain a single template. This simple task is designed to evaluate a model’s ability to deal with a fixed template. Naturally, we expect very high F1 scores (0.90+) for this task.
  • Mixed Template Learning (MTL): This task is similar to the task that most related papers use: the training, testing, and validation sets all contain documents belonging to the same set of templates. We randomly sample documents from the datasets and construct the splits to make sure the distribution of each template is not changed during sampling.
  • Unseen Template Learning (UTL): This is the most challenging setting, where we evaluate if the model can generalize to unseen templates. For example, in the Registration Forms dataset, we train the model with two of the three templates and test the model with the remaining one. The documents in the training, testing, and validation sets are drawn from disjoint sets of templates. To our knowledge, previous benchmarks and datasets do not explicitly provide such a task designed to evaluate the model’s ability to generalize to templates not seen during training.

The objective is to be able to evaluate models on their data efficiency. In our paper, we compared two recent models using the STL, MTL, and UTL tasks and made three observations. First, unlike with other benchmarks, VRDU is challenging and shows that models have plenty of room for improvements. Second, we show that few-shot performance for even state-of-the-art models is surprisingly low with even the best models resulting in less than an F1 score of 0.60. Third, we show that models struggle to deal with structured repeated fields and perform particularly poorly on them.

Conclusion

We release the new Visually Rich Document Understanding (VRDU) dataset that helps researchers better track progress on document understanding tasks. We describe why VRDU better reflects practical challenges in this domain. We also present experiments showing that VRDU tasks are challenging, and recent models have substantial headroom for improvements compared to the datasets typically used in the literature with F1 scores of 0.90+ being typical. We hope the release of the VRDU dataset and evaluation code helps research teams advance the state of the art in document understanding.

Acknowledgements

Many thanks to Zilong Wang, Yichao Zhou, Wei Wei, and Chen-Yu Lee, who co-authored the paper along with Sandeep Tata. Thanks to Marc Najork, Riham Mansour and numerous partners across Google Research and the Cloud AI team for providing valuable insights. Thanks to John Guilyard for creating the animations in this post.

Categories
Misc

Speed Up GPU Crash Debugging with NVIDIA Nsight Aftermath

A spinning GIF showing a tree on fire in the dark with other trees surrounding it.NVIDIA Nsight Developer Tools provide comprehensive access to NVIDIA GPUs and graphics APIs for performance analysis, optimization, and debugging activities….A spinning GIF showing a tree on fire in the dark with other trees surrounding it.

NVIDIA Nsight Developer Tools provide comprehensive access to NVIDIA GPUs and graphics APIs for performance analysis, optimization, and debugging activities. When using advanced rendering techniques like ray tracing or path tracing, Nsight tools are your companion for creating a smooth and polished experience. 

At SIGGRAPH 2023, NVIDIA hosted a lab exploring how to use NVIDIA Nsight Tools to debug and profile ray tracing applications. New versions of the NVIDIA Nsight Aftermath SDK, NVIDIA Nsight Graphics, and NVIDIA Nsight Systems are also available. For more information on Nsight Tools released at SIGGRAPH, check out the latest video on NVIDIA Graphics Tools.

This post explores how Nsight Aftermath SDK 2023.2 speeds up GPU crash debugging with improved event marker performance.  

Nsight Aftermath SDK GPU crash postmortem analysis

Few issues are as pressing as a GPU crash, which can abruptly block development progress until resolved. Developers and end users alike find these crashes frustrating, especially when they can’t capture useful debugging information from the GPU pipeline at the moment of failure. To shed light on hidden exceptions, the Nsight Aftermath SDK opens a window into the GPU at the moment a game fails. This helps pinpoint the source of the issue and guides the developer in resolving it.

The Nsight Aftermath SDK generates GPU crash dump files that load into NVIDIA Nsight Graphics to visualize the GPU state—revealing MMU fault information, warp details, problematic shader source, and more. Integrating Aftermath into existing crash reporters also provides more granular pipeline dumps from end-users’ machines, providing actionable reports. Today’s update to the Nsight Aftermath SDK improves the contextual data provided through low-overhead, application-specific markers.

Event marker performance has been enhanced in the Nsight Aftermath SDK for DirectX 12 applications. You can insert these markers into your CPU code at desired intervals, and the significantly reduced overhead makes them usable in shipping applications. Markers are written to the Aftermath crash dump file, indicating where in the application’s frame a GPU exception occurred. With this information, you can determine the workload executing on the GPU and view what shaders were in use at the time of the crash.

The 2023.2 version of the Nsight Aftermath SDK also supports collecting and displaying shader register values to aid in debugging streaming multiprocessor (SM) exceptions. On the SM, registers store the results of instructions as they are executing. This data is particularly relevant to determining the source of a crash if a shader workload triggered the failure. After being written to an Nsight Aftermath dump file, you can inspect the register values for faulting threads in Nsight Graphics. This helps you determine where and why the shader execution failed.

Screenshot of a SM register profiling in Nsight Aftermath SDK.
Figure 1. Nsight Aftermath SDK exposes shader register values that correspond to the line of shader source code that caused an exception

SM register data is now available for DirectX 12 and Vulkan applications. Note that viewing this data requires NVIDIA Nsight Graphics Pro. Coordinate with your NVIDIA Developer Technologies or Developer Relations contact, or reach out, to request access.

Nsight Aftermath is also now compatible with the latest applications using cutting-edge DirectX12 features through the DirectX Agility SDK.

Getting Started with Nsight Aftermath SDK and event markers

Getting started with the SDK is easy. Here are some tips to help you use GPU crash dumps and event markers. More information is included in the Read Me section of the download.

  1. Download Nsight Aftermath SDK 2023.2.
  2. Enable GPU crash dump creation by calling GFSDK_Aftermath_EnableGpuCrashDumps. Note that crash dumps won’t be made for devices generated before that call. Make sure it’s enabled first.
  3. Set the Nsight Aftermath options to control what information is captured. 
    For example, you can enable ‌shader debug information and runtime shader error reports “flags” when you initialize Nsight Aftermath for the device. 

    Tip: To use event markers, make sure that the event marker flag is enabled at this step. You can also use the Nsight Aftermath Monitor application to enable SM register collection.

Screen grab of the Nsight Aftermath Monitor.
Figure 2. The Nsight Aftermath Monitor–included in both the SDK and Nsight Graphics—is the command center for collecting crash information
  1. When your GPU dump has been collected, open it up with Nsight Graphics for rich data visualization. Nsight Graphics will help you analyze the crash and determine how to resolve it.

Tip: the Aftermath API provides a simple and lightweight solution for inserting event markers on the GPU timeline. To keep CPU overhead to a minimum, you can set dataSize=0 to instruct Aftermath to rely on the application to manage and resolve marker data itself. 

Download NVIDIA Nsight Developer Tools

Download all of the new Nsight Developer Tools announced at SIGGRAPH.

Dive deeper or ask questions in Developer Tools forums or learn more about graphics development with Nsight Tools at SIGGRAPH 2023.

Categories
Misc

Scale XR Workflows with NVIDIA CloudXR Suite

NVIDIA is providing developers with an advanced platform to create scalable, branded, custom extended reality (XR) products with the new NVIDIA CloudXR Suite….

NVIDIA is providing developers with an advanced platform to create scalable, branded, custom extended reality (XR) products with the new NVIDIA CloudXR Suite.

Built on a new architecture, NVIDIA CloudXR Suite is a major step forward in scaling the XR ecosystem. It provides a platform for developers, professionals, and enterprise teams to flexibly orchestrate and scale XR workloads across operating systems, including virtual machines in Windows and Linux-based systems such as containers. 

With the NVIDIA CloudXR streaming stack, users can build flexible, high-performance cloud solutions capable of streaming the most demanding immersive experiences. Teams can also access NVIDIA streaming technology to effectively manage the quality of the streaming across large public and private networks, including the internet.

 Figure shows a diagram with different components labeled CloudXR Essentials, CloudXR Server Extensions, and CloudXR Client Extensions. These components are part of the new architecture for the CloudXR Suite.
Figure 1. NVIDIA CloudXR Suite consists of three components: CloudXR Essentials, CloudXR Server Extensions, and CloudXR Client Extensions

Immersive content developers have a challenge in supporting both tethered devices, driven by high-powered graphics cards, and mobile devices that have limited graphics power.  

Using NVIDIA CloudXR, developers can create high-quality versions of applications—built to take advantage of powerful GPUs—and still target users with mobile XR devices using NVIDIA CloudXR streaming. 

Additionally, cloud service providers (CSPs), orchestrators, and system integrators can extend GPU services with interactive graphics to support next-generation XR applications. 

NVIDIA CloudXR Suite is composed of three components: CloudXR Essentials, CloudXR Server Extensions, and CloudXR Client Extensions. 

CloudXR Essentials provide the underlying streaming layer, complete with new improvements such as 5G L4S optimizations, QoS algorithms, and enhanced logging tools. ‌Essentials also includes the SteamVR plug-in, along with sample clients and a new server-side API that can be directly integrated into XR applications. This removes the need for a separate XR runtime.

CloudXR Server Extensions extend server-side interfaces with a source code addition to Collabara’s Monado OpenXR runtime. The new CloudXR Server API contained in CloudXR Essentials, plus the OpenXR API, represent the gateway to scaling XR distribution for orchestration partners. 

CloudXR Client Extensions empower users to build custom CloudXR client applications. The first offering is a Unity plug-in for CloudXR. With this plug-in, Unity app developers can more easily build applications with branded custom interfaces and lobbies before connecting to the CloudXR streaming server. 

“The new CloudXR Server Extensions bring more opportunities for software developers to use Monado’s OpenXR runtime to build next generation immersive experiences,” said Frédéric Plourde, XR Lead at Collabora.

High-quality XR streaming for clients

PureWeb helps organizations across industries adopt real-time 3D technology to improve operations. Using NVIDIA CloudXR, PureWeb shows customers how to deliver complex, immersive XR workloads at scale, on their current devices.

“We want to provide our customers with access to the GPU resources and streaming technologies they need, so they can share immersive experiences without worrying about not having enough computing resources,” said Chris Jarabek, VP of Product Development at PureWeb. “With this new advancement through CloudXR Suite, we can better scale with OpenXR and build with that.” 

The team at Innoactive has integrated NVIDIA CloudXR into their VR application deployment platform, Innoactive Portal. Many Innoactive customers are using the platform to provide high-quality immersive training to users, wherever they are.

“Many of our customers are building applications and plan to stream them to all-in-one headsets or other mobile XR devices,” said Daniel Seidl, CEO of Innoactive. “With the Unity plug-in, our customers can now stream from AWS and Microsoft Azure with CloudXR, making XR streaming even more accessible from the cloud.”

Learn more and download NVIDIA CloudXR Suite

Categories
Offsites

AdaTape: Foundation model with adaptive computation and dynamic read-and-write

Adaptive computation refers to the ability of a machine learning system to adjust its behavior in response to changes in the environment. While conventional neural networks have a fixed function and computation capacity, i.e., they spend the same number of FLOPs for processing different inputs, a model with adaptive and dynamic computation modulates the computational budget it dedicates to processing each input, depending on the complexity of the input.

Adaptive computation in neural networks is appealing for two key reasons. First, the mechanism that introduces adaptivity provides an inductive bias that can play a key role in solving some challenging tasks. For instance, enabling different numbers of computational steps for different inputs can be crucial in solving arithmetic problems that require modeling hierarchies of different depths. Second, it gives practitioners the ability to tune the cost of inference through greater flexibility offered by dynamic computation, as these models can be adjusted to spend more FLOPs processing a new input.

Neural networks can be made adaptive by using different functions or computation budgets for various inputs. A deep neural network can be thought of as a function that outputs a result based on both the input and its parameters. To implement adaptive function types, a subset of parameters are selectively activated based on the input, a process referred to as conditional computation. Adaptivity based on the function type has been explored in studies on mixture-of-experts, where the sparsely activated parameters for each input sample are determined through routing.

Another area of research in adaptive computation involves dynamic computation budgets. Unlike in standard neural networks, such as T5, GPT-3, PaLM, and ViT, whose computation budget is fixed for different samples, recent research has demonstrated that adaptive computation budgets can improve performance on tasks where transformers fall short. Many of these works achieve adaptivity by using dynamic depth to allocate the computation budget. For example, the Adaptive Computation Time (ACT) algorithm was proposed to provide an adaptive computational budget for recurrent neural networks. The Universal Transformer extends the ACT algorithm to transformers by making the computation budget dependent on the number of transformer layers used for each input example or token. Recent studies, like PonderNet, follow a similar approach while improving the dynamic halting mechanisms.

In the paper “Adaptive Computation with Elastic Input Sequence”, we introduce a new model that utilizes adaptive computation, called AdaTape. This model is a Transformer-based architecture that uses a dynamic set of tokens to create elastic input sequences, providing a unique perspective on adaptivity in comparison to previous works. AdaTape uses an adaptive tape reading mechanism to determine a varying number of tape tokens that are added to each input based on input’s complexity. AdaTape is very simple to implement, provides an effective knob to increase the accuracy when needed, but is also much more efficient compared to other adaptive baselines because it directly injects adaptivity into the input sequence instead of the model depth. Finally, Adatape offers better performance on standard tasks, like image classification, as well as algorithmic tasks, while maintaining a favorable quality and cost tradeoff.

Adaptive computation transformer with elastic input sequence

AdaTape uses both the adaptive function types and a dynamic computation budget. Specifically, for a batch of input sequences after tokenization (e.g., a linear projection of non-overlapping patches from an image in the vision transformer), AdaTape uses a vector representing each input to dynamically select a variable-sized sequence of tape tokens.

AdaTape uses a bank of tokens, called a “tape bank”, to store all the candidate tape tokens that interact with the model through the adaptive tape reading mechanism. We explore two different methods for creating the tape bank: an input-driven bank and a learnable bank.

The general idea of the input-driven bank is to extract a bank of tokens from the input while employing a different approach than the original model tokenizer for mapping the raw input to a sequence of input tokens. This enables dynamic, on-demand access to information from the input that is obtained using a different point of view, e.g., a different image resolution or a different level of abstraction.

In some cases, tokenization in a different level of abstraction is not possible, thus an input-driven tape bank is not feasible, such as when it’s difficult to further split each node in a graph transformer. To address this issue, AdaTape offers a more general approach for generating the tape bank by using a set of trainable vectors as tape tokens. This approach is referred to as the learnable bank and can be viewed as an embedding layer where the model can dynamically retrieve tokens based on the complexity of the input example. The learnable bank enables AdaTape to generate a more flexible tape bank, providing it with the ability to dynamically adjust its computation budget based on the complexity of each input example, e.g., more complex examples retrieve more tokens from the bank, which let the model not only use the knowledge stored in the bank, but also spend more FLOPs processing it, since the input is now larger.

Finally, the selected tape tokens are appended to the original input and fed to the following transformer layers. For each transformer layer, the same multi-head attention is used across all input and tape tokens. However, two different feed-forward networks (FFN) are used: one for all tokens from the original input and the other for all tape tokens. We observed slightly better quality by using separate feed-forward networks for input and tape tokens.

An overview of AdaTape. For different samples, we pick a variable number of different tokens from the tape bank. The tape bank can be driven from input, e.g., by extracting some extra fine-grained information or it can be a set of trainable vectors. Adaptive tape reading is used to recursively select different sequences of tape tokens, with variable lengths, for different inputs. These tokens are then simply appended to inputs and fed to the transformer encoder.

AdaTape provides helpful inductive bias

We evaluate AdaTape on parity, a very challenging task for the standard Transformer, to study the effect of inductive biases in AdaTape. With the parity task, given a sequence 1s, 0s, and -1s, the model has to predict the evenness or oddness of the number of 1s in the sequence. Parity is the simplest non-counter-free or periodic regular language, but perhaps surprisingly, the task is unsolvable by the standard Transformer.

Evaluation on the parity task. The standard Transformer and Universal Transformer were unable to perform this task, both showing performance at the level of a random guessing baseline.

Despite being evaluated on short, simple sequences, both the standard Transformer and Universal Transformers were unable to perform the parity task as they are unable to maintain a counter within the model. However, AdaTape outperforms all baselines, as it incorporates a lightweight recurrence within its input selection mechanism, providing an inductive bias that enables the implicit maintenance of a counter, which is not possible in standard Transformers.

Evaluation on image classification

We also evaluate AdaTape on the image classification task. To do so, we trained AdaTape on ImageNet-1K from scratch. The figure below shows the accuracy of AdaTape and the baseline methods, including A-ViT, and the Universal Transformer ViT (UViT and U2T) versus their speed (measured as number of images, processed by each code, per second). In terms of quality and cost tradeoff, AdaTape performs much better than the alternative adaptive transformer baselines. In terms of efficiency, larger AdaTape models (in terms of parameter count) are faster than smaller baselines. Such results are consistent with the finding from previous work that shows that the adaptive model depth architectures are not well suited for many accelerators, like the TPU.

We evaluate AdaTape by training on ImageNet from scratch. For A-ViT, we not only report their results from the paper but also re-implement A-ViT by training from scratch, i.e., A-ViT(Ours).

A study of AdaTape’s behavior

In addition to its performance on the parity task and ImageNet-1K, we also evaluated the token selection behavior of AdaTape with an input-driven bank on the JFT-300M validation set. To better understand the model’s behavior, we visualized the token selection results on the input-driven bank as heatmaps, where lighter colors mean that position is more frequently selected. The heatmaps reveal that AdaTape more frequently picks the central patches. This aligns with our prior knowledge, as central patches are typically more informative — especially in the context of datasets with natural images, where the main object is in the middle of the image. This result highlights the intelligence of AdaTape, as it can effectively identify and prioritize more informative patches to improve its performance.

We visualize the tape token selection heatmap of AdaTape-B/32 (left) and AdaTape-B/16 (right). The hotter / lighter color means the patch at this position is more frequently selected.

Conclusion

AdaTape is characterized by elastic sequence lengths generated by the adaptive tape reading mechanism. This also introduces a new inductive bias that enables AdaTape to have the potential to solve tasks that are challenging for both standard transformers and existing adaptive transformers. By conducting comprehensive experiments on image recognition benchmarks, we demonstrate that AdaTape outperforms standard transformers and adaptive architecture transformers when computation is held constant.

Acknowledgments

One of the authors of this post, Mostafa Dehghani, is now at Google DeepMind.

Categories
Misc

Shutterstock Brings Generative AI to 3D Scene Backgrounds With NVIDIA Picasso

Picture this: Creators can quickly create and customize 3D scene backgrounds with the help of generative AI, thanks to cutting-edge tools from Shutterstock. The visual-content provider is building services using NVIDIA Picasso — a cloud-based foundry for developing generative AI models for visual design. The work incorporates Picasso’s latest feature — announced today during NVIDIA Read article >

Categories
Misc

Content Creation ‘In the NVIDIA Studio’ Gets Boost From New Professional GPUs, AI Tools, Omniverse and OpenUSD Collaboration Features

AI and accelerated computing were in the spotlight at SIGGRAPH — the world’s largest gathering of computer graphics experts — as NVIDIA founder and CEO Jensen Huang announced during his keynote address updates to NVIDIA Omniverse, a platform for building and connecting 3D tools and applications, as well as acceleration for Universal Scene Description (known as OpenUSD), the open and extensible ecosystem for 3D worlds.