Categories
Misc

Access the Latest in Vision AI Model Development Workflows with NVIDIA TAO Toolkit 5.0

TAO Toolkit graphicNVIDIA TAO Toolkit provides a low-code AI framework to accelerate vision AI model development suitable for all skill levels, from novice beginners to expert…TAO Toolkit graphic

NVIDIA TAO Toolkit provides a low-code AI framework to accelerate vision AI model development suitable for all skill levels, from novice beginners to expert data scientists. With the TAO Toolkit, developers can use the power and efficiency of transfer learning to achieve state-of-the-art accuracy and production-class throughput in record time with adaptation and optimization.  

NVIDIA released TAO Toolkit 5.0, bringing groundbreaking features to enhance any AI model development. The new features include source-open architecture, transformer-based pretrained models, AI-assisted data annotation, and the capability to deploy models on any platform.

Release highlights include:

  • Model export in open ONNX format to support deployment on GPUs, CPUs, MCUs, neural accelerators, ‌and more.
  • Advanced Vision Transformer training for better accuracy and robustness against image corruption and noise.
  • New AI-assisted data annotation, accelerating labeling tasks for segmentation masks.
  • Support for new computer vision tasks and pretrained models for optical inspection, such as optical character detection and Siamese Network models.
  • Open source availability for customizable solutions, faster development, and integration.

Get started

This post has been revised from its original version to provide accurate information reflecting the TAO Toolkit 5.0 release.

Graphic of the NVIDIA TAO Toolkit pipeline for an AI model: data, train, and deploy.
Figure 1. NVIDIA TAO Toolkit workflow diagram

Deploy NVIDIA TAO models on any platform, anywhere 

NVIDIA TAO Toolkit 5.0 supports model export in ONNX. This makes it possible to deploy a model trained with NVIDIA TAO Toolkit on any computing platform—GPUs, CPUs, MCUs, DLAs, FPGAs—at the edge or in the cloud. NVIDIA TAO Toolkit simplifies the model training process and optimizes the model for inference throughput, powering AI across hundreds of billions of devices.  

TAO Toolkit architecture workflow diagram.
Figure 2. NVIDIA TAO Toolkit architecture

Edge Impulse, a platform for building, refining, and deploying machine learning models and algorithms, integrated TAO Toolkit into their edge AI platform. With this integration, Edge Impulse now offers advanced vision AI capabilities and models that complement its current offerings. Developers can use the platform to build production AI with TAO for any edge device. Learn more about the integration in a blog post from Edge Impulse.

Video 1. Training an AI model with the Edge Impulse platform leveraging NVIDIA TAO and deployed on a Cortex-M7 MCU

STMicroelectronics, a global leader in embedded microcontrollers, integrated NVIDIA TAO Toolkit into its STM32Cube AI developer workflow. This puts the latest AI capabilities into the hands of millions of STMicroelectronics developers. It provides, for the first time, the ability to integrate sophisticated AI into widespread IoT and edge use cases powered by the STM32Cube. 

Now, with NVIDIA TAO Toolkit, even the most novice AI developers can optimize and quantize AI models to run on STM32 MCU within the microcontroller’s compute and memory budget. Developers can also bring their own models and fine-tune using TAO Toolkit. More information about this work is captured in the following demo. Learn more about the project on the STMicroelectronics GitHub page.

Video 2. Learn how to deploy a model optimized with TAO Toolkit on an STM microcontroller

While TAO Toolkit models can run on any platform, these models achieve the highest throughput on NVIDIA GPUs using TensorRT for inference. On CPUs, these models use ONNX-RT for inference. The script and recipe to replicate these numbers will be provided once the software becomes available.

  NVIDIA Jetson Orin Nano 8 GB  NVIDIA Jetson AGX Orin 64 GB  T4  A2  A100 L4  H100  
PeopleNet  112  679 429 242 3,264 797 7,062
DINO – FAN-S  3.1 11.2 20.4 11.7 121 44 213
SegFormer – MiT 1.3 4.8  9.4 5.8 62.2 17.8 108
OCRNet  935  3,876 3,649 2,094 28,300 8,036 55,700
EfficientDet  61 227 303 184 1,521 522 2,428
2D Body Pose  136  557 593 295 4,140 1,010 7,812
3D Action Recognition  52 212 269 148 1,658 529 2,708
Table 1. Performance comparison (in FPS) of several NVIDIA TAO Toolkit vision models, including new Vision Transformer models on NVIDIA GPUs

AI-assisted data annotation and management 

Data annotation remains an expensive and time-consuming process for all AI projects. This is especially true for CV tasks like segmentation that require generating a segmentation mask at pixel level around the object. Generally, segmentation masks cost 10x more than object detection or classification. 

It is faster and less expensive to annotate segmentation masks with new AI-assisted annotation capabilities using TAO Toolkit 5.0. Now you can use the weakly supervised segmentation architecture, Mask Auto Labeler (MAL) to aid in segmentation annotation and in fixing and tightening bounding boxes for object detection. Loose bounding boxes around an object in ground truth data can lead to suboptimal detection results. But, with AI-assisted annotation, you can tighten your bounding boxes over objects, leading to more accurate models.  

GIF showing the concept of auto-labeling of cars and a person holding a grocery basket, using box-cropped images as inputs, and generating the segmentation mask.
Figure 3. TAO Toolkit auto labeling

MAL is a transformer-based, mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates the mask pseudo-labels. It uses COCO annotation format for both input and output labels.  

MAL significantly reduces the gap between auto labeling and human annotation for mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of the fully supervised counterparts, retaining up to 97.4% performance of fully supervised models.  

Architecture diagram for Mask Auto Labeler (MAL).
Figure 4. MAL network architecture

When training the MAL network, a task network and a teacher network (sharing the same transformer structure) work together to achieve class-agnostic self-training. This enables refining the prediction masks with conditional random field (CRF) loss and multi-instance learning (MIL) loss.    

TAO Toolkit uses MAL in both the auto-labeling pipeline and data augmentation pipeline. Specifically, users can generate pseudo-masks on the spatially augmented images (sheared or rotated, for example), and refine and tighten the corresponding bounding boxes using the generated masks. 

State-of-the-art Vision Transformers 

Transformers have become the standard architecture in NLP, largely because of self-attention. They have also gained popularity for a range of vision AI tasks. In general, transformer-based models can outperform traditional CNN-based models due to their robustness, generalizability, and ability to perform parallelized processing of large-scale inputs. All of this increases training efficiency, provides better robustness against image corruption and noise, and generalizes better on unseen objects.  

TAO Toolkit 5.0 features several state-of-the-art (SOTA) Vision Transformers for popular CV tasks, as detailed below.

Fully Attentional Network  

Fully Attentional Network (FAN) is a transformer-based family of backbones from NVIDIA Research that achieves SOTA in robustness against various corruptions. This family of backbones can easily generalize to new domains and be more robust to noise, blur, and more. 

A key design behind the FAN block is the attentional channel processing module that leads to robust representation learning. FAN can be used for image classification tasks as well as downstream tasks such as object detection and segmentation.  

Performing heat map on a corrupted image (left) using ResNet50 (center) and FAN-Small (right).
Figure 5. An activation heat map on a corrupted image for ResNet50 (center) compared to FAN-Small (right)

The FAN family supports four backbones, as shown in Table 2.

Model  # of parameters/FLOPs  Accuracy 
FAN-Tiny  7 M/3.5 G  71.7 
FAN-Small  26 M/6.7  77.5 
FAN-Base  50 M/11.3 G  79.1 
FAN-Large  77 M/16.9 G  81.0 
Table 2. FAN backbones with size and accuracy

Global Context Vision Transformer 

Global Context Vision Transformer (GC-ViT) is a novel architecture from NVIDIA Research that achieves very high accuracy and compute efficiency. GC-ViT addresses the lack of inductive bias in Vision Transformers. It achieves better results on ImageNet with a smaller number of parameters through the use of local self-attention. 

Local self-attention paired with global context self-attention can effectively and efficiently model both long and short-range spatial interactions. Figure 6 shows the GC-ViT model architecture. For more details, see Global Context Vision Transformers.

Diagram of the architecture for GC-ViT.
Figure 6. GC-ViT model architecture

As shown in Table 3, the GC-ViT family contains six backbones, ranging from GC-ViT-xxTiny (compute efficient) to GC-ViT-Large (very accurate). GC-ViT-Large models can achieve Top-1 accuracy of 85.6 on the ImageNet-1K dataset for image classification tasks. This architecture can also be used as the backbone for other CV tasks like object detection and semantic and instance segmentation.  

Model  # of parameters/FLOPs  Accuracy 
GC-ViT-xxTiny  12 M/2.1 G  79.6 
GC-ViT-xTiny  20 M/2.6 G  81.9 
GC-ViT-Tiny  28 M/4.7 G  83.2 
GC-ViT-Small  51 M/8.5 G  83.9 
GC-ViT-Base  90 M/14.8 G  84.4 
GC-ViT-Large  201 M/32.6 G  85.6 
Table 3. GC-ViT backbones with size and accuracy 

DINO 

DINO (detection transformer with improved denoising anchor) is the newest generation of detection transformers (DETR). It achieves a faster training convergence time than its predecessor. Deformable-DETR (D-DETR) requires at least 50 epochs to converge, while DINO can converge in 12 epochs on the COCO dataset. It also achieves higher accuracy when compared with D-DETR. 

DINO achieves faster convergence through the use of denoising during training, which helps the bipartite matching process at the proposal generation stage. The training convergence of DETR-like models is slow due to the instability of bipartite matching. Bipartite matching removed the need for handcrafted and compute-heavy NMS operations. However, it often required much more training because incorrect ground truths were matched to the predictions during bipartite matching. 

To remedy such a problem, DINO introduced noised positive ground-truth boxes and negative ground-truth boxes to handle “no object” scenarios. As a result, training converges very quickly for DINO. For more information, see DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.

Diagram of DINO architecture, including multi-scale features, positional embeddings, query selection, encoder layers, matching, CDN, and decoder layers.
Figure 7. DINO architecture

DINO in TAO Toolkit is flexible and can be combined with various backbones from traditional CNNs, such as ResNets, and transformer-based backbones like FAN and GC-ViT. Table 4 shows the accuracy of the COCO dataset on various versions of DINO with popular YOLOv7. For more details, see YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors.

Model Backbone AP AP50 AP75 APS APM APL Param
YOLOv7 N/A 51.2 69.7 55.5 35.2 56.0 66.7 36.9M
DINO ResNet50 48.8 66.9 53.4 31.8 51.8 63.4 46.7M
FAN-Small 53.1 71.6 57.8 35.2 56.4 68.9 48.3M
GCViT-Tiny 50.7 68.9 55.3 33.2 54.1 65.8 46.9M
Table 4. DINO and D-DETR accuracy on the COCO dataset

SegFormer 

SegFormer is a lightweight transformer-based semantic segmentation. The decoder is made of lightweight MLP layers. It avoids using positional encoding (mostly used by transformers), which makes the inference efficient at different resolutions.  

Adding FAN backbone to SegFormer MLP decoder results in a highly robust and efficient semantic segmentation model. FAN base hybrid + SegFormer was the winning architecture at the Robust Vision Challenge 2022 for semantic segmentation. 

An image of a man walking across the street as a noisy input (left) and the same image with SegFormer with FAN (right).
Figure 8. SegFormer with FAN prediction (right) on a noisy input image (left) 
Model   Dataset  Mean IOU (%)  Retention rate (robustness) (%) 
PSPNet  Cityscapes Validation  78.8  43.8  
SegFormer – FAN-S-Hybrid  Cityscapes validation  81.5  81.5 
Table 5. Robustness of SegFormer compared to PSPNet

See how SegFormer generates robust semantic segmentation while maintaining high efficiency for accelerated autonomous vehicle development in the following video. 

Video 3. NVIDIA DRIVE Labs episode, Enhancing AI Segmentation Models for Autonomous Vehicle Safety

CV tasks beyond object detection and segmentation 

NVIDIA TAO Toolkit accelerates a wide range of CV tasks beyond traditional object detection and segmentation. The new character detection and recognition models in TAO Toolkit 5.0 enable developers to extract text from images and documents. This automates document conversion and accelerates use cases in industries like insurance and finance.  

Detecting anomalies in images is useful when the object being classified varies greatly, such that training with all the variations is impossible. In industrial inspection, for example, a defect can come in any form. Using a simple classifier could result in many missed defects if the defect has not been previously seen by the training data. 

For such use cases, comparing the test object directly against a golden reference would result in better accuracy. TAO Toolkit 5.0 features a Siamese neural network in which the model calculates the difference between the object under test and a golden reference to classify if the object is defective.  

Automate training using AutoML for hyperparameter optimization 

Automated machine learning (autoML) automates the manual task of finding the best models and hyperparameters for the desired KPI on a given dataset. It can algorithmically derive the best model and abstract away much of the complexity of AI model creation and optimization.  

AutoML in TAO Toolkit is fully configurable for automatically optimizing the hyperparameters of a model. It caters to both AI experts and nonexperts. For nonexperts, the guided Jupyter Notebook provides a simple, efficient way to create an accurate AI model. 

For experts, TAO Toolkit gives you full control of which hyperparameters to tune and which algorithm to use for sweeps. TAO Toolkit currently supports two optimization algorithms: Bayesian and Hyperband optimization. These algorithms can sweep across a range of hyperparameters to find the best combination for a given dataset. 

AutoML is supported for a wide range of CV tasks, including several new Vision Transformers such as DINO, D-DETR, SegFormer, and more. Table 6 shows the full list of supported networks (bold items are new to TAO Toolkit 5.0). 

Image classification Object detection Segmentation Other
FAN DINO SegFormer LPRNet
GC-ViT D-DETR UNET
ResNet YoloV3/V4/V4-Tiny MaskRCNN
EfficientNet EfficientDet 
DarkNet RetinaNet
MobileNet FasterRCNN
DetectNet_v2
SSD/DSSD
Table 6. Models supported by AutoML in TAO Toolkit, including several new Vision Transformer models (bold items are new to TAO Toolkit 5.0)

REST APIs for workflow integration 

TAO Toolkit is modular and cloud-native, meaning it is available as containers and can be deployed and managed using Kubernetes. TAO Toolkit can be deployed as a self-managed service on any public or private cloud, DGX, or workstation. TAO Toolkit provides well-defined REST APIs, making it easy to integrate into your development workflow. Developers can call the API endpoints for all training and optimization tasks. These API endpoints can be called from any application or user interface, which can trigger training jobs remotely. 

REST APIs workflow in TAO Toolkit
Figure 9. TAO Toolkit architecture for cloud native deployment

Better inference optimization 

To simplify productization and increase inference throughput, TAO Toolkit provides several turnkey performance optimization techniques. These include model pruning, lower precision quantization, and TensorRT optimization, which can combine to deliver a 4x to 8x performance boost, compared to a comparable model from public model zoos. 

Chart showing performance comparison between TAO Toolkit optimized and public models on a wide range of GPUs
Figure 10. Performance comparison between TAO Toolkit optimized and public models on a wide range of GPUs

Open and flexible, with better support 

An AI model predicts output based on complex algorithms. This can make it difficult to understand how the system arrived at its decision and challenging to debug, diagnose, and fix errors. Explainable AI (XAI) aims to address these challenges by providing insights into how AI models arrive at their decisions. This helps humans understand the reasoning behind the AI output and makes it easier to diagnose and fix errors. This transparency can help to build trust in AI systems. 

To help with transparency and explainability, TAO Toolkit will now be available as source-open. Developers will be able to view feature maps from internal layers, as well as plot activation heat maps to better understand the reasoning behind AI predictions. In addition, having access to the source code will give developers the flexibility to create customized AI, improve debug capability, and increase trust in their models.  

NVIDIA TAO Toolkit is enterprise-ready and available through NVIDIA AI Enterprise (NVAIE). NVAIE provides companies with business-critical support, access to NVIDIA AI experts, and priority security fixes. Join NVAIE to get support from AI experts.  

Integration with cloud services 

NVIDIA TAO Toolkit 5.0 is integrated into various AI services that you might already use, such as Google Vertex AI, AzureML, Azure Kubernetes service, Google GKE, and Amazon EKS. 

Graphic showing the TAO framework for various cloud services, including Google Cloud, Microsoft Azure, and more.
Figure 11. TAO Toolkit 5.0 is integrated with various AI services

Summary 

TAO Toolkit offers a platform for any developer, in any service, and on any device to easily transfer-learn their custom models, perform quantization and pruning, manage complex training workflows, and perform AI-assisted annotation with no coding requirements.

Categories
Misc

Improve Accuracy and Robustness of Vision AI Apps with Vision Transformers and NVIDIA TAO

3 CV overlays tracking people walking across a street.Vision Transformers (ViTs) are taking computer vision by storm, offering incredible accuracy, robust solutions for challenging real-world scenarios, and…3 CV overlays tracking people walking across a street.

Vision Transformers (ViTs) are taking computer vision by storm, offering incredible accuracy, robust solutions for challenging real-world scenarios, and improved generalizability. The algorithms are playing a pivotal role in boosting computer vision applications and NVIDIA is making it easy to integrate ViTs into your applications using NVIDIA TAO Toolkit and NVIDIA L4 GPUs.

How ViTs are different

ViTs are machine learning models that apply transformer architectures, originally designed for natural language processing, to visual data. They have several advantages over their CNN-based counterparts and the ability to perform parallelized processing of large-scale inputs. While CNNs use local operations that lack a global understanding of an image, ViTs provide long-range dependencies and global context. They do this effectively by processing images in a parallel and self-attention-based manner, enabling interactions between all image patches. 

Figure 1 shows see the processing of an image in a ViT model, where the input image is divided into smaller fixed-size patches, which are flattened and transformed into sequences of tokens. These tokens, along with positional encodings, are then fed into a transformer encoder, which consists of multiple layers of self-attention and feed-forward neural networks. 

ViT encoding workflow with an image split in patches, showing positional embeddings and feeds into a transformer encoder.
Figure 1. Processing an image in a ViT model that includes positional embedding and encoder (inspired by the study Transformers for Image Recognition at Scale)

With the self-attention mechanism, each token or patch of an image interacts with other tokens to decide which tokens are important. This helps the model capture relationships and dependencies between tokens and learns which ones are considered important over others. 

For example, with an image of a bird, the model pays more attention to important features, such as the eyes, beak, and feathers rather than the background. This translates into increased training efficiency, enhanced robustness against image corruption and noise, and superior generalization on unseen objects.

Why ViTs are critical for computer vision applications

Real-world environments have diverse and complex visual patterns. The scalability and adaptability of ViTs enable them to handle a wide variety of tasks without the need for task-specific architecture adjustments, unlike CNNs. 

Various examples of noise and imperfections in real-world data such as occlusion, poor lighting, weather conditions (rain, fog), image corruption, etc.
Figure 2.  Different types of imperfect and noisy real-world data create challenges for image analysis.

In the following video, we compare noisy videos running both on a CNN-based model and ViT-based model. In every case, ViTs outperform CNN-based models.

Video 1. Learn about SegFormer, a ViT model that generates robust semantic segmentation while maintaining high efficiency

Integrating ViTs with TAO Toolkit 5.0

TAO, a low-code AI toolkit to build and accelerate vision AI models, now makes it easy to build and integrate ViTs into your applications and AI workflow. Users can quickly get started with a simple interface and config files to train ViTs, without requiring in-depth knowledge of model architectures. 

The TAO Toolkit 5.0 features several advanced ViTs for popular computer vision tasks including the following.

Fully Attentional Network (FAN)

As a transformer-based family of backbones from NVIDIA Research, FAN achieves SOTA robustness against various corruptions as highlighted in Table 1. This family of backbones can easily generalize to new domains, fighting noise and blur. Table 1 shows the accuracy of all FAN models on the ImageNet-1K dataset for both clean and corrupted versions. 

Model # of Params Accuracy (Clean/Corrupted)
FAN-Tiny-Hybrid 7.4M 80.1/57.4
FAN-Small-Hybrid 26.3M 83.5/64.7
FAN-Base-Hybrid 50.4M 83.9/66.4
FAN-Large-Hybrid 76.8M 84.3/68.3
Table 1: Size and accuracy for FAN models

Global Context Vision Transformer (GC-ViT) 

GC-ViT is a novel architecture from NVIDIA Research that achieves very high accuracy and compute efficiency. It addresses the lack of inductive bias in vision transformers. It also achieves better results on ImageNet with a smaller number of parameters through the use of local self-attention, which combined with global self-attention can give much better local and global spatial interactions. 

Model # of Params Accuracy
GC-ViT-xxTiny 12M 79.9
GC-ViT-xTiny 20M 82.0
GC-ViT-Tiny 28M 83.5
GC-ViT-Small 51M 84.3
GC-ViT-Base 90M 85.0
GC-ViT-Large 201M 85.7
Table 2: Size and accuracy for GC-ViT models

Detection transformer with improved denoising anchor (DINO) 

DINO is the newest generation of detection transformers (DETR) with faster training convergence compared to other ViTs and CNNs. DINO in the TAO Toolkit is flexible and can be combined with various backbones from traditional CNNs, such as ResNets, and transformer-based backbones like FAN and GC-ViT. 

The graph shows that a Vision Transformer based model, DINO, provides much better accuracy as compared to CNN-based models.
Figure 3. – Comparing the accuracy of DINO with other models

Segformer

Segformer is a lightweight and robust transformer-based semantic segmentation. The decoder is made of lightweight multi-head perception layers. It avoids using positional encoding (mostly used by transformers), which makes the inference efficient at different resolutions. 

Powering efficient transformers with NVIDIA L4 GPUs

NVIDIA L4 GPUs are built for the next wave of vision AI workloads. They’re powered by the NVIDIA Ada Lovelace architecture, which is designed to accelerate transformative AI technologies. 

L4 GPUs are suitable for running ViT workloads with their high compute capability of FP8 485 TFLOPs with sparsity. FP8 reduces memory pressure when compared to larger precisions and dramatically accelerates AI throughput. 

The versatility and energy-efficient L4 with single-slot, low-profile form factor makes it ideal for vision AI deployments, including edge locations.

Watch this Metropolis Developer Meetup on-demand to learn more about ViTs, NVIDIA TAO Toolkit 5.0, and L4 GPUs.

Categories
Misc

Fin-tastic: 3D Artist Dives Into AI-Powered Oceanic Work This Week ‘In the NVIDIA Studio’

We’re gonna need a bigger boat this week In the NVIDIA Studio as Alessandro Mastronardi, senior artist and programmer at BBC Studios, shares heart-stopping shark videos and renders.

Categories
Misc

NVIDIA DGX Cloud Now Available to Supercharge Generative AI Training

NVIDIA DGX Cloud — which delivers tools that can turn nearly any company into an AI company —  is now broadly available, with thousands of NVIDIA GPUs online on Oracle Cloud Infrastructure, as well as NVIDIA infrastructure located in the U.S. and U.K. Unveiled at NVIDIA’s GTC conference in March, DGX Cloud is an AI Read article >

Categories
Misc

NVIDIA Names Melissa Lora to Board of Directors

SANTA CLARA, Calif., July 24, 2023 (GLOBE NEWSWIRE) — NVIDIA today announced that it has named to its board of directors Melissa Lora, who spent three decades as an executive at Taco Bell …

Categories
Misc

Realizing the Power of Real-Time Network Processing with NVIDIA DOCA GPUNetIO

Real-time processing of network traffic can be leveraged by the high degree of parallelism GPUs offer. Optimizing packet acquisition or transmission in these…

Real-time processing of network traffic can be leveraged by the high degree of parallelism GPUs offer. Optimizing packet acquisition or transmission in these types of applications avoids bottlenecks and enables the overall execution to keep up with high-speed networks. In this context, DOCA GPUNetIO promotes the GPU as an independent component that can exercise network and compute tasks without the intervention of the CPU.

This post provides a list of GPU packet processing applications focusing on different and unrelated contexts where NVIDIA DOCA GPUNetIO has been integrated to lower latency and maximize performance.

NVIDIA DOCA GPUNetIO API

NVIDIA DOCA GPUNetIO is one of the new libraries released with the NVIDIA DOCA software framework. The DOCA GPUNetIO library enables direct communication between the NIC and the GPU through one or more CUDA kernels. This removes the CPU from the critical path. 

Using the CUDA device functions in the DOCA GPUNetIO library, a CUDA kernel can send and receive packets directly from and to the GPU without the need of CPU cores or memory. Key features of this library include:

  • GPUDirect Async Kernel-Initiated Network (GDAKIN): Communications over Ethernet; GPU (CUDA kernel) can directly interact with the network card to send or receive packets in GPU memory (GPUDirect RDMA) without the intervention of the CPU.
  • GPU memory exposure: Combine within a single function the basic CUDA memory allocation feature with the GDRCopy library to expose a GPU memory buffer to the direct access (read or write) from the CPU without using CUDA API.
  • Accurate Send Scheduling: From the GPU, it’s possible to schedule the transmission of a burst of packets in the future, associate a timestamp to it, and provide this information to the network card, which will in turn take care of sending packets at the right time.
  • Semaphores: Useful message passing object to share information and synchronize across different CUDA kernels or between a CUDA kernel and a CPU thread.

For a deep dive into the principles and benefits of DOCA GPUNetIO, see Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO. For more details about DOCA GPUNetIO API, refer to the DOCA GPUNetIO SDK Programming Guide.

Diagram showing a GPU-centric application in which the GPU can execute both network tasks (receive and send) and processing tasks. The CPU is no longer required.
Figure 1. Layout of the receive process in an NVIDIA DOCA GPUNetIO application. No CPU is involved, as the GPU can independently receive and process network packets

Along with the library, the following NVIDIA DOCA application and NVIDIA DOCA sample show how to use functions and features offered by the library. 

  • NVIDIA DOCA application: A GPU packet processing application that can detect, manage, filter, and analyze UDP, TCP, and ICMP traffic. The application also implements an HTTP over TCP server. With a simple HTTP client (curl or wget, for example), it’s possible to establish a TCP three-way handshake connection and ask for simple HTML pages through HTTP GET requests to the GPU.
  • NVIDIA DOCA sample: GPU send-only example that shows how to use the Accurate Send Scheduling feature (system configuration, functions to use).

DOCA GPUNetIO in real-world applications

DOCA GPUNetIO has been used to empower the NVIDIA Aerial SDK to send and receive using the GPU, getting rid of the CPU. For more details, see Inline GPU Packet Processing with NVIDIA DOCA GPUNetIO. The sections below provide new examples that successfully use DOCA GPUNetIO to leverage the GPU packet acquisition with the GDAKIN technique.

NVIDIA Morpheus AI

NVIDIA Morpheus is a performance-oriented, application framework that enables cybersecurity developers to create fully optimized AI pipelines for filtering, processing, and classifying large volumes of real-time data. The framework abstracts GPU and CPU parallelism and concurrency through an accessible programming model consisting of Python and C++ APIs. 

Leveraging this framework, developers can quickly construct arbitrary data pipelines composed of stages that acquire, mutate, or publish data for a downstream consumer. You can apply Morpheus in different contexts including malware detection, phishing/spear phishing detection, ransomware detection, and many others. Its flexibility and high performance are ideal for real-time network traffic analysis.

For the network monitoring use case, the NVIDIA Morpheus team recently integrated the DOCA framework to implement a high-speed, low-latency GPU packet acquisition source stage to feed real-time packets to an AI-pipeline responsible for analyzing packets’ contents. For more details, visit Morpheus on GitHub.

A diagram showing the point of connection between DOCA GPUNetIO and Morpheus AI pipeline. If the packets pass the filter, their information is stored in a GPU memory buffer, accumulate, and trigger AI processing.
Figure 2. DOCA GPUNetIO and NVIDIA Morpheus AI pipeline are connected by a CUDA kernel that receives, filters, and analyzes incoming packets

As shown in Figure 2, GPU packet acquisition happens in real time. Through DOCA Flow, flow steering rules are applied to the Ethernet receive queues, meaning queues can only receive specific types of packets (TCP, for example). Morpheus launches a CUDA kernel, which performs the following steps in a loop:

  1. Receive packets using DOCA GPUNetIO receive function
  2. Filter and analyze in parallel packets in GPU memory
  3. Copy relevant packets info in a list of GPU memory buffers
  4. The related DOCA GPUNetIO semaphore item is set to READY when a buffer has accumulated enough packets of information
  5. The CUDA kernel in front of the AI pipeline is polling the semaphore item
  6. When the item is READY, AI is unblocked as packet information is ready in the buffer

The GTC session Defensive Cyber Operations (DCO) on Edge Networks presents a concrete example in which this architecture is leveraged to deploy a high-performance, AI-enabled SPAN/Network TAP solution. This solution was motivated by the daunting data rates in information technology (IT) and operational technology (OT) networks, heterogeneity of Layer 7 application data, and edge computing size, weight, and power (SWaP) constraints. 

In the case of edge computing, many organizations are unable to “burst to cloud” when the compute demand increases, especially on disconnected edge networks. This scenario requires designing an architecture for I/O and compute challenges that deliver performance across the SWaP spectrum.

This DCO example addresses these constraints through the lens of a common cybersecurity problem, identifying leaked data (leaked passwords, secret keys, and PII, for example) in unencrypted TCP traffic, and represents an extension of the Morpheus SID demo. Identifying and remediating these vulnerabilities reduces the attack surface and increases the security posture of organizations.

In this example, the DCO solution receives packets into a heterogeneous Morpheus pipeline (GPU and concurrent CPU stages written in a mix of Python and C++) that applies a transformer model to detect leaked sensitive data in Layer 7 application data. It integrates outputs with the ELK stack, including an intuitive visualization for Security Operations Center (SOC) analysts to exploit (Figures 3 and 4).

Example Kibana dashboard showcasing results of DOCA GPUNetIO plus Morpheus Sensitive Information Detections including total detections of each type, a pairwise network map, and the distribution of packet sizes.
Figure 3. Kibana dashboard showcasing results of DOCA GPUNetIO plus Morpheus Sensitive Information Detections including total detections of each type, a pairwise network map, and the distribution of packet sizes
Screenshot of Kibana dashboard showcasing results of DOCA GPUNetIO plus Morpheus Sensitive Information Detections including filtered and processed network packets index at up to 50 K packets per second, including a table of payloads with leaked secret keys.
Figure 4. Kibana dashboard showcasing results of DOCA GPUNetIO plus Morpheus Sensitive Information Detections including filtered and processed network packets index at up to 50 K packets per second, including a table of payloads with leaked secret keys

The experimental setup included cloud-native UDP multicast and REST applications running on VMs with a 100 Gbps NVIDIA BlueField-2 DPU. These applications communicated through a SWaP-efficient NVIDIA Spectrum SN2100 Ethernet switch. Packet generators injected sensitive data into packets transmitted by these applications. Network packets were aggregated and mirrored off a SPAN port on the NVIDIA Spectrum SN2100 and sent to an NVIDIA A30X converged accelerator powering the Morpheus packet inspection pipeline achieving impressive throughput results.

  • This pipeline includes several components from I/O, packet filtering, packet processing, and indexing in a third-party SIEM platform (Elasticsearch). Focusing only on the I/O aspects, DOCA GPUNetIO enables Morpheus to receive packets into GPU memory at up to 100 Gbps with a single receive queue, removing a critical bottleneck in cyber packet processing applications.
  • Leveraging stage level concurrency, the pipeline demonstrated a 60% boost in Elasticsearch indexing throughput.
  • Running the end-to-end data pipeline on the NVIDIA A30X converged accelerator generated enriched packets at ~50% capacity of the elasticsearch indexer. Using twice as many A30Xs would fully saturate the indexer, providing a convenient scaling heuristic.
This figure depicts the end-to-end accelerated sensitive information detection packet processing application. SOC teams who leverage other SIEM/SOAR solutions (such as Splunk) are empowered to exchange Morpheus sink stages in their implementations.
Figure 5. The end-to-end packet processing application accelerates the detection of sensitive information

While this use case demonstrates a specific application of Morpheus, it represents the foundational components for cyber packet processing applications. Morpheus plus DOCA GPUNetIO together provide the performance and extensibility for a huge number of latency-sensitive and compute-intensive packet processing applications.

Line-rate radar signal processing

This section walks through an example in which the radar detection application ingests downconverted I/Q samples from a simulated range-only radar system at line rate of 100 Gbps, performing all required signal processing needed to convert the received I/Q RF samples into object detections in real time.

Remote sensing applications such as radar, lidar, and optical platforms rely on signal processing algorithms to turn the raw data collected from the environment they are measuring into actionable information. These algorithms are often highly parallelizable and require a high computational load, making them ideal for GPU-based processing.

Additionally, input sensors generate enormous quantities of raw data, meaning the ingress/egress capability of the processing solution must be able to handle very high bandwidths at low latencies. 

Further complicating the problem, many edge-based sensor systems have strict SWaP constraints, restricting the number and power of available CPU cores that might be used in other high-throughput networking approaches, such as DPDK-based GPUDirect RDMA.

DOCA GPUNetIO enables the GPU to directly handle the networking load as well as the signal processing required to make a real-time sensor streaming application successful.

Commonly used signal processing algorithms were used in the radar detection application. The flowchart in Figure 6 shows a graphical representation of the signal processing pipeline being used to convert the I/Q samples into detections.

Flowchart depicting the building blocks of a signal processing pipeline for computing detections from a reflected RF waveform in a range-only radar system. Stages are MTI Filtering, Pulse compression with FFR and Inverse FFT, and finally CFAR Detection.
Figure 6. The signal processing pipeline for computing detections from a reflected RF waveform in a range-only radar system

MTI filtering is a common technique used to eliminate stationary background clutter, such as the ground or buildings, from the reflected RF waveform in radar systems. The approach used here is known as the Three-Pulse Canceler, which is simply a convolution of the I/Q data in the pulse dimension with the filter coefficients ‘[+1, -2, +1].’

Pulse Compression maximizes the signal-to-noise ratio (SNR) of the received waveform with respect to the presence of targets. It is performed by computing the cross-correlation of the received RF data with the transmitted waveform.

Constant False Alarm Rate (CFAR) detectors compute an empirical estimate of the noise, localized to each range bin of the filtered data. The power of each bin is then compared to the noise and declared a detection if it is statistically likely given the noise estimate and distribution.

A 3D buffer of size (# Waveforms) x (# Channels) x (# Samples) is used to hold the organized RF data being received (note that applying the MTI filter upon packet receipt reduces the size of the pulse dimension to 1). No ordering is assumed to the UDP data streaming in, except that it is streamed roughly in order of the packets’ ascending waveform ID. Around 500 complex samples are transmitted per-packet, and the samples’ location in the 3D buffer is dependent on the waveform ID, channel ID, and sample index.

This application runs two CUDA kernels and one CPU core persistently. The first CUDA kernel is responsible for using the DOCA GPUNetIO API to read packets from the NIC onto the GPU. The second CUDA kernel places packet data into the correct memory location based on metadata in the packets header and applies the MTI filter, and the CPU core is responsible for launching the CUDA kernels that handle Pulse Compression and CFAR. The FFTs were performed using the cuFFT library.

Figure 7 shows a graphical representation of the application.

Figure depicts the GPU-based signal processing pipeline with two CUDA kernels: one using DOCA GPUNetIO to receive packets in GPU memory and a second to analyze packets.
Figure 7. Graphical representation of how work is distributed for the GPU-based signal processing pipeline

The throughput of the radar detection pipeline is >100 Gbps. Running at the line rate of 100 Gbps for 1 million 16-channel waveforms, no packets were dropped and the signal processing never fell behind the throughput of the data stream. The latency, measured from when the last data packet for an independent waveform ID was received, was on the order of 3 milliseconds. An NVIDIA ConnectX-6 Dx SmartNIC and an NVIDIA A100 80 GB GPU were used. Data was sent through UDP packets over Ethernet.

Future work will evaluate the performance of this architecture when running exclusively on a BlueField DPU with an integrated GPU.

Real-time DSP services over GPU

Analog signals are everywhere, both artificial (Wi-Fi radio, for example) and natural (solar radiation and earthquakes, for example). To capture analog data digitally, sound waves must be converted using a D-A converter, controlled by parameters such as sample rate and sample bit depth. Digital audio and video can be processed with FFT, allowing sound designers to use tools such as an equalizer (EQ) to alter general characteristics of the signal.

This example explains how NVIDIA products and SDK were used to perform real-time audio DSP with GPU over the network. To do so, the team built a client that parses a WAV file, frames data into multiple Ethernet packets, and sends them over the network to a server application. This application is responsible for receiving packets, applying FFT, manipulating the audio signal, and finally sending back the modified data.

The client’s responsibility is recognizing which portion should be sent to the “server” for the signal processing chain and how to treat the processed samples when they are received back from the server. This approach supports multiple DSP algorithms, such as overlap-add, and various sample window selections.

The server application uses DOCA GPUNetIO to receive packets in GPU memory from a CUDA kernel. When a subset of packets has been received, the CUDA kernel in parallel applies the FFT through the cuFFTDx library to each packet’s payload. In parallel, to each packet, a different CUDA thread applies a frequency filter reducing the amplitude of low or high frequencies. Basically, it applies a low-pass or high-pass filter.

Diagram depicting the client-server architecture where the client splits a WAV file into multiple Ethernet packets and sends them to the server. On the server, a CUDA kernel in a continuous loop receives those packets, applies frequency filters and then sends back the modified packets.
Figure 8. Client-server architecture built to demonstrate how to do real-time DSP services using a GPU over the network

An inverse FFT is applied to each packet. Through DOCA GPUNetIO, the CUDA kernel sends back to the client the modified packets. The client reorders packets and rebuilds them to recreate an audible and reproducible WAV audio file with sound effects applied.

Using the client, the team could tweak parameters to optimize performance and the quality of the audio output. It is possible to separate the flows and multiplex streams into their processing chains, thus offloading many complex computations into the GPU. This is just scratching the potential of this solution, which could open new market opportunities for cloud DSP service providers.

Summary

DOCA GPUNetIO library promotes a generic GPU-centric approach for both packets’ acquisition and transmission in network applications exercising real-time traffic analysis. This post demonstrates how this library can be adopted in a wide range of applications from different contexts, providing huge improvements for latency, throughput, and system resource utilization.

To learn more about GPU packet processing and GPUNetIO, see the following resources:

Categories
Misc

Event: Jensen Huang NVIDIA Keynote at SIGGRAPH 2023

SIGGRAPH promo card for Jensen Huang keynote. On Aug. 8, Jensen Huang features new NVIDIA technologies and award-winning research for content creation.SIGGRAPH promo card for Jensen Huang keynote.

 On Aug. 8, Jensen Huang features new NVIDIA technologies and award-winning research for content creation.

Categories
Offsites

Google at ICML 2023

Groups across Google actively pursue research in the field of machine learning (ML), ranging from theory and application. We build ML systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. We aim to build a more collaborative ecosystem with the broader ML research community through open-sourcing tools and datasets, publishing our work, and actively participating in conferences.

Google is proud to be a Diamond Sponsor of the 40th International Conference on Machine Learning (ICML 2023), a premier annual conference, which is being held this week in Honolulu, Hawaii. As a leader in ML research, Google has a strong presence at this year’s conference with over 120 accepted papers and active involvement in a number of workshops and tutorials. Google is also proud to be a Platinum Sponsor for both the LatinX in AI and Women in Machine Learning workshops. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.

Registered for ICML 2023? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Visit the @GoogleAI Twitter account to find out about Google booth activities (e.g., demos and Q&A sessions). See Google DeepMind’s blog to learn about their technical participation at ICML 2023.

Take a look below to learn more about the Google research being presented at ICML 2023 (Google affiliations in bold).

Board and Organizing Committee

Board Members include: Corinna Cortes, Hugo Larochelle

Tutorial Chairs include: Hanie Sedghi

Google Research booth activities

Presenters: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer

Title: Unsupervised Graph Embedding @ Google (paper, EXPO workshop)

Tuesday, July 25th at 10:30 AM HST

Presenters: Zheng Xu

Title: Federated Learning of Gboard Language Models with Differential Privacy (paper 1, paper 2, blog post)

Tuesday, July 25th at 3:30 PM HST

Presenters: Thomas Kipf

Title: Self-supervised scene understanding (paper 1, paper 2)

Wednesday, July 26th at 10:30 AM HST

Presenters: Johannes von Oswald, Max Vladymyrov

Title: Transformers learn in-context by gradient descent (paper)

Wednesday, July 26th at 3:30 PM HST

Accepted papers

Scaling Vision Transformers to 22 Billion Parameters (see blog post)

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

Best of Both Worlds Policy Optimization

Christoph Dann, Chen-Yu Wei, Julian Zimmert

Inflow, Outflow, and Reciprocity in Machine Learning

Mukund Sundararajan, Walid Krichene

Transformers Learn In-Context by Gradient Descent

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models

Luke Vilnis, Yury Zemlyanskiy, Patrick Murray*, Alexandre Passos*, Sumit Sanghai

Differentially Private Hierarchical Clustering with Provable Approximation Guarantees (see blog post)

Jacob Imola*, Alessandro Epasto, Mohammad Mahdian, Vincent Cohen-Addad, Vahab Mirrokni

Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning

Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, Abhradeep Thakurta

Random Classification Noise Does Not Defeat All Convex Potential Boosters Irrespective of Model Choice

Yishay Mansour, Richard Nock, Robert Williamson

Simplex Random Features

Isaac Reid, Krzysztof Choromanski, Valerii Likhosherstov, Adrian Weller

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova

Mu2SLAM: Multitask, Multilingual Speech and Language Models

Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna

Robust Budget Pacing with a Single Sample

Santiago Balseiro, Rachitesh Kumar*, Vahab Mirrokni, Balasubramanian Sivan, Di Wang

A Statistical Perspective on Retrieval-Based Models

Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

Approximately Optimal Core Shapes for Tensor Decompositions

Mehrdad Ghadiri, Matthew Fahrbach, Gang Fu, Vahab Mirrokni

Efficient List-Decodable Regression Using Batches

Abhimanyu Das, Ayush Jain*, Weihao Kong, Rajat Sen

Efficient Training of Language Models Using Few-Shot Learning

Sashank J. Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, Sanjiv Kumar

Fully Dynamic Submodular Maximization Over Matroids

Paul Duetting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

GFlowNet-EM for Learning Compositional Latent Variable Models

Edward J Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio

Improved Online Learning Algorithms for CTR Prediction in Ad Auctions

Zhe Feng, Christopher Liaw, Zixin Zhou

Large Language Models Struggle to Learn Long-Tail Knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel

Multi-channel Autobidding with Budget and ROI Constraints

Yuan Deng, Negin Golrezaei, Patrick Jaillet, Jason Cheuk Nam Liang, Vahab Mirrokni

Multi-layer Neural Networks as Trainable Ladders of Hilbert Spaces

Zhengdao Chen

On User-Level Private Convex Optimization

Badih Ghazi, Pritish Kamath, Ravi Kumar, Raghu Meka, Pasin Manurangsi, Chiyuan Zhang

PAC Generalization via Invariant Representations

Advait U Parulekar, Karthikeyan Shanmugam, Sanjay Shakkottai

Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, Jincheng Mei, Pierre Menard, Mohammad Gheshlaghi Azar, Remi Munos, Olivier Pietquin, Matthieu Geist,Csaba Szepesvari, Wataru Kumagai, Yutaka Matsuo

Speeding Up Bellman Ford via Minimum Violation Permutations

Silvio Lattanzi, Ola Svensson, Sergei Vassilvitskii

Statistical Indistinguishability of Learning Algorithms

Alkis Kalavasis, Amin Karbasi, Shay Moran, Grigoris Velegkas

Test-Time Adaptation with Slot-Centric Models

Mihir Prabhudesai, Anirudh Goyal, Sujoy Paul, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gaurav Aggarwal, Thomas Kipf, Deepak Pathak, Katerina Fragkiadaki>

Algorithms for Bounding Contribution for Histogram Estimation Under User-Level Privacy

Yuhan Liu*, Ananda Theertha Suresh, Wennan Zhu, Peter Kairouz, Marco Gruteser

Bandit Online Linear Optimization with Hints and Queries

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, Ion Stoica

CSP: Self-Supervised Contrastive Spatial Pre-training for Geospatial-Visual Representations

Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, Stefano Ermon

Ewald-Based Long-Range Message Passing for Molecular Graphs

Arthur Kosmala, Johannes Gasteiger, Nicholas Gao, Stephan Günnemann

Fast (1+ε)-Approximation Algorithms for Binary Matrix Factorization

Ameya Velingker, Maximilian Vötsch, David Woodruff, Samson Zhou

Federated Linear Contextual Bandits with User-Level Differential Privacy

Ruiquan Huang, Huanyu Zhang, Luca Melis, Milan Shen, Meisam Hejazinia, Jing Yang

Investigating the Role of Model-Based Learning in Exploration and Transfer

Jacob C Walker, Eszter Vértes, Yazhe Li, Gabriel Dulac-Arnold, Ankesh Anand, Theophane Weber, Jessica B Hamrick

Label Differential Privacy and Private Training Data Release

Robert Busa-Fekete, Andres Munoz, Umar Syed, Sergei Vassilvitskii

Lifelong Language Pretraining with Distribution-Specialized Experts

Wuyang Chen*, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cui

Multi-User Reinforcement Learning with Low Rank Rewards

Dheeraj Mysore Nagaraj, Suhas S Kowshik, Naman Agarwal, Praneeth Netrapalli, Prateek Jain

Multi-View Masked World Models for Visual Robotic Manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel

PaLM-E: An Embodied Multimodal Language Model (see blog post)

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter,Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence

Private Federated Learning with Autotuned Compression

Enayat Ullah*, Christopher A. Choquette-Choo, Peter Kairouz, Sewoong Oh

Refined Regret for Adversarial MDPs with Linear Function Approximation

Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert

Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory

Justin Cui, Ruoche Wan, Si Si, Cho-Jui Hsieh

SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance

Amit Attia, Tomer Koren

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Mark Rowland, Yunhao Tang, Clare Lyle, Rémi Munos, Marc G. Bellemare, Will Dabney

Unveiling The Mask of Position-Information Pattern Through the Mist of Image Features

Chieh Hubert Lin, Hung-Yu Tseng, Hsin-Ying Lee, Maneesh Kumar Singh, Ming-Hsuan Yang

User-Level Private Stochastic Convex Optimization with Optimal Rates

Raef Bassily, Ziteng Sun

A Simple Zero-Shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models

James Urquhart Allingham*, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan

Can Large Language Models Reason About Program Invariants?

Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, Pengcheng Yin

Concurrent Shuffle Differential Privacy Under Continual Observation

Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer

Constant Matters: Fine-Grained Error Bound on Differentially Private Continual Observation

Hendrik Fichtenberger, Monika Henzinger, Jalaj Upadhyay

Cross-Entropy Loss Functions: Theoretical Analysis and Applications

Anqi Mao, Mehryar Mohri, Yutao Zhong

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

Orin Levy, Alon Cohen, Asaf Cassel, Yishay Mansour

Fairness in Streaming Submodular Maximization Over a Matroid Constraint

Marwa El Halabi, Federico Fusco, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (see blog post)

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, Adam Roberts

Graph Reinforcement Learning for Network Control via Bi-level Optimization

Daniele Gammelli, James Harrison, Kaidi Yang, Marco Pavone, Filipe Rodrigues, Francisco C. Pereira

Learning-Augmented Private Algorithms for Multiple Quantile Release

Mikhail Khodak*, Kareem Amin, Travis Dick, Sergei Vassilvitskii

LegendreTron: Uprising Proper Multiclass Loss Learning

Kevin H Lam, Christian Walder, Spiridon Penev, Richard Nock

Measuring the Impact of Programming Language Distribution

Gabriel Orlanski*, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta*

Multi-task Differential Privacy Under Distribution Skew

Walid Krichene, Prateek Jain, Shuang Song, Mukund Sundararajan, Abhradeep Thakurta, Li Zhang

Muse: Text-to-Image Generation via Masked Generative Transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

On the Convergence of Federated Averaging with Cyclic Client Participation

Yae Jee Cho, Pranay Sharma, Gauri Joshi, Zheng Xu, Satyen Kale, Tong Zhang

Optimal Stochastic Non-smooth Non-convex Optimization Through Online-to-Non-convex Conversion

Ashok Cutkosky, Harsh Mehta, Francesco Orabona

Out-of-Domain Robustness via Targeted Augmentations

Irena Gao, Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto, Percy Liang

Polynomial Time and Private Learning of Unbounded Gaussian Mixture Models

Jamil Arbas, Hassan Ashtiani, Christopher Liaw

Pre-computed Memory or On-the-Fly Encoding? A Hybrid Approach to Retrieval Augmentation Makes the Most of Your Compute

Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha, William W. Cohen

Scalable Adaptive Computation for Iterative Generation

Allan Jabri*, David J. Fleet, Ting Chen

Scaling Spherical CNNs

Carlos Esteves, Jean-Jacques Slotine, Ameesh Makadia

STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition

Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh

Stratified Adversarial Robustness with Rejection

Jiefeng Chen, Jayaram Raghuram, Jihye Choi, Xi Wu, Yingyu Liang, Somesh Jha

When Does Privileged information Explain Away Label Noise?

Guillermo Ortiz-Jimenez*, Mark Collier, Anant Nawalgaria, Alexander D’Amour, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou

Adaptive Computation with Elastic Input Sequence

Fuzhao Xue*, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You

Can Neural Network Memorization Be Localized?

Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang

Controllability-Aware Unsupervised Skill Discovery

Seohong Park, Kimin Lee, Youngwoon Lee, Pieter Abbeel

Efficient Learning of Mesh-Based Physical Simulation with Bi-Stride Multi-Scale Graph Neural Network

Yadi Cao, Menglei Chai, Minchen Li, Chenfanfu Jiang

Federated Heavy Hitter Recovery Under Linear Sketching

Adria Gascon, Peter Kairouz, Ziteng Sun, Ananda Theertha Suresh

Graph Generative Model for Benchmarking Graph Neural Networks

Minji Yoon, Yue Wu, John Palowitch, Bryan Perozzi, Russ Salakhutdinov

H-Consistency Bounds for Pairwise Misranking Loss Surrogates

Anqi Mao, Mehryar Mohri, Yutao Zhong

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

Uri Sherman, Tomer Koren, Yishay Mansour

Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames

Ondrej Biza*, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Thomas Kipf

Multi-task Off-Policy Learning from Bandit Feedback

Joey Hong, Branislav Kveton, Manzil Zaheer, Sumeet Katariya, Mohammad Ghavamzadeh

Optimal No-Regret Learning for One-Sided Lipschitz Functions

Paul Duetting, Guru Guruganesh, Jon Schneider, Joshua Ruizhi Wang

Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games

Batuhan Yardim, Semih Cayci, Matthieu Geist, Niao He

Regret Minimization and Convergence to Equilibria in General-Sum Markov Games

Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, Yishay Mansour

Reinforcement Learning Can Be More Efficient with Multiple Rewards

Christoph Dann, Yishay Mansour, Mehryar Mohri

Reinforcement Learning with History-Dependent Dynamic Contexts

Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutlier

User-Defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems

Marc Anton Finzi*, Anudhyan Boral, Andrew Gordon Wilson, Fei Sha, Leonardo Zepeda-Nunez

Discrete Key-Value Bottleneck

Frederik Träuble, Anirudh Goyal, Nasim Rahaman, Michael Curtis Mozer, Kenji Kawaguchi, Yoshua Bengio, Bernhard Schölkopf

DSGD-CECA: Decentralized SGD with Communication-Optimal Exact Consensus Algorithm

Lisang Ding, Kexin Jin, Bicheng Ying, Kun Yuan, Wotao Yin

Exphormer: Sparse Transformers for Graphs

Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J. Sutherland, Ali Kemal Sinop

Fast, Differentiable and Sparse Top-k: A Convex Analysis Perspective

Michael Eli Sander*, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel

Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation

Aditya Mate, Bryan Wilder, Aparna Taneja, Milind Tambe

In Search for a Generalizable Method for Source Free Domain Adaptation

Malik Boudiaf*, Tom Denton, Bart van Merrienboer, Vincent Dumoulin, Eleni Triantafillou

Learning Rate Schedules in the Presence of Distribution Shift

Matthew Fahrbach, Adel Javanmard, Vahab Mirrokni, Pratik Worah

Not All Semantics Are Created Equal: Contrastive Self-Supervised Learning with Automatic Temperature Individualization

Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, Tianbao Yang

On the Relationship Between Explanation and Prediction: A Causal View

Amir-Hossein Karimi*, Krikamol Muandet, Simon Kornblith, Bernhard Schölkopf, Been Kim

On the Role of Attention in Prompt-Tuning

Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis

PLay: Parametrically Conditioned Layout Generation Using Latent Diffusion

Chin-Yi Cheng, Forrest Huang, Gang Li, Yang Li

The Power of Learned Locally Linear Models for Nonlinear Policy Optimization

Daniel Pfrommer, Max Simchowitz, Tyler Westenbroek, Nikolai Matni, Stephen Tu

Relevant Walk Search for Explaining Graph Neural Networks

Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus Robert Muller,Shinichi Nakajima

Repository-Level Prompt Generation for Large Language Models of Code

Disha Shrivastava, Hugo Larochelle, Daniel Tarlow

Robust and Private Stochastic Linear Bandits

Vasileios Charisopoulos*, Hossein Esfandiari, Vahab Mirrokni

Simple Diffusion: End-to-End Diffusion for High Resolution Images

Emiel Hoogeboom, Jonathan Heek, Tim Salimans

Tied-Augment: Controlling Representation Similarity Improves Data Augmentation

Emirhan Kurtulus, Zichao Li, Yann Dauphin, Ekin D. Cubuk

Why Is Public Pre-Training Necessary for Private Model Training?

Arun Ganesh, Mahdi Haghifam*, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Guha Thakurta, Lun Wang

A Connection Between One-Step RL and Critic Regularization in Reinforcement Learning

Benjamin Eysenbach, Matthieu Geist, Sergey Levine, Ruslan Salakhutdinov

Beyond Uniform Lipschitz Condition in Differentially Private Optimization

Rudrajit Das*, Satyen Kale, Zheng Xu, Tong Zhang, Sujay Sanghavi

Efficient Graph Field Integrators Meet Point Clouds

Krzysztof Choromanski, Arijit Sehanobish, Han Lin, Yunfan Zhao, Eli Berger, Tetiana Parshakova, Alvin Pan, David Watkins, Tianyi Zhang, Valerii Likhosherstov, Somnath Basu Roy Chowdhury, Avinava Dubey, Deepali Jain, Tamas Sarlos, Snigdha Chaturvedi, Adrian Weller

Fast as CHITA: Neural Network Pruning with Combinatorial Optimization

Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, Rahul Mazumder

Jump-Start Reinforcement Learning (see blog post)

Ikechukwu Uchendu*, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, Karol Hausman

Learning in POMDPs is Sample-Efficient with Hindsight Observability

Jonathan Lee, Alekh Agarwal, Christoph Dann, Tong Zhang

Low-Variance Gradient Estimation in Unrolled Computation Graphs with ES-Single

Paul Vicol

Masked Trajectory Models for Prediction, Representation, and Control

Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, Aravind Rajeswaran

Overcoming Simplicity Bias in Deep Networks Using a Feature Sieve

Rishabh Tiwari, Pradeep Shenoy

Pairwise Ranking Losses of Click-Through Rates Prediction for Welfare Maximization in Ad Auctions

Boxiang Lyu, Zhe Feng, Zachary Robertson, Sanmi Koyejo

Predictive Flows for Faster Ford-Fulkerson

Sami Davies, Benjamin Moseley, Sergei Vassilvitskii, Yuyan Wang

Scaling Laws for Multilingual Neural Machine Translation

Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat

Sequential Monte Carlo Learning for Time Series Structure Discovery

Feras Saad, Brian Patton, Matthew Douglas Hoffman, Rif A. Saurous, Vikash Mansinghka

Stochastic Gradient Succeeds for Bandits

Jincheng Mei, Zixin Zhong, Bo Dai, Alekh Agarwal, Csaba Szepesvari, Dale Schuurmans

Subset-Based Instance Optimality in Private Estimation

Travis Dick, Alex Kulesza, Ziteng Sun, Ananda Theertha Suresh

The Unreasonable Effectiveness of Few-Shot Learning for Machine Translation

Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, Orhan Firat

Tutorials

Self-Supervised Learning in Vision: from Research Advances to Best Practices

Xinlei Chen, Ishan Misra, Randall Balestriero, Mathilde Caron, Christoph Feichtenhofer, Mark Ibrahim

How to DP-fy ML: A Practical Tutorial to Machine Learning with Differential Privacy (see blog post)

Sergei Vassilvitskii, Natalia Ponomareva, Zheng Xu

Recent Advances in the Generalization Theory of Neural Networks

Tengyu Ma, Alex Damian

EXPO Day workshops

Graph Neural Networks in Tensorflow: A Practical Guide

Workshop Organizers include: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer, Jonathan Halcrow

Google sponsored affinity workshops

LatinX in AI (LAXAI)

Platinum Sponsor

Keynote Speaker: Monica Ribero

Panelist: Yao Qin

Women in Machine Learning (WiML)

Platinum Sponsor

Panelists: Yao Qin

Workshops

Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities

Organizer: Peter Kairouz, Zheng Xu

Speaker: Brendan McMahan

Interpretable Machine Learning in Healthcare (IMLH)

Organizer: Ramin Zabih

Knowledge and Logical Reasoning in the Era of Data-Driven Learning

Organizer: Beliz Günel

The Many Facets of Preference-Based Learning (MFPL)

Organizer: Robert Busa-Fekete, Mohammad Ghavamzadeh

The Synergy of Scientific and Machine Learning Modelling (SynS & ML)

Speaker: Sercan Arik

Theory of Mind in Communicating Agents

Organizer: Pei Zhou

Artificial Intelligence & Human Computer Interaction

Organizer: Yang Li, Forrest Huang

Data-Centric Machine Learning Research (DMLR)

Organizer: Alicia Parrish, Najoung Kim

Speaker: Peter Mattson

Neural Compression: from Information Theory to Applications

Speaker: Johannes Ballé

Panelist: George Toderici

Neural Conversational AI Workshop – What’s Left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) Chatbots?

Organizer: Ahmad Beirami

Spurious Correlations, Invariance and Stability (SCIS)

Organizer: Amir Feder


* Work done while at Google

Categories
Offsites

Using societal context knowledge to foster the responsible application of AI

AI-related products and technologies are constructed and deployed in a societal context: that is, a dynamic and complex collection of social, cultural, historical, political and economic circumstances. Because societal contexts by nature are dynamic, complex, non-linear, contested, subjective, and highly qualitative, they are challenging to translate into the quantitative representations, methods, and practices that dominate standard machine learning (ML) approaches and responsible AI product development practices.

The first phase of AI product development is problem understanding, and this phase has tremendous influence over how problems (e.g., increasing cancer screening availability and accuracy) are formulated for ML systems to solve as well many other downstream decisions, such as dataset and ML architecture choice. When the societal context in which a product will operate is not articulated well enough to result in robust problem understanding, the resulting ML solutions can be fragile and even propagate unfair biases.

When AI product developers lack access to the knowledge and tools necessary to effectively understand and consider societal context during development, they tend to abstract it away. This abstraction leaves them with a shallow, quantitative understanding of the problems they seek to solve, while product users and society stakeholders — who are proximate to these problems and embedded in related societal contexts — tend to have a deep qualitative understanding of those same problems. This qualitative–quantitative divergence in ways of understanding complex problems that separates product users and society from developers is what we call the problem understanding chasm.

This chasm has repercussions in the real world: for example, it was the root cause of racial bias discovered by a widely used healthcare algorithm intended to solve the problem of choosing patients with the most complex healthcare needs for special programs. Incomplete understanding of the societal context in which the algorithm would operate led system designers to form incorrect and oversimplified causal theories about what the key problem factors were. Critical socio-structural factors, including lack of access to healthcare, lack of trust in the health care system, and underdiagnosis due to human bias, were left out while spending on healthcare was highlighted as a predictor of complex health need.

To bridge the problem understanding chasm responsibly, AI product developers need tools that put community-validated and structured knowledge of societal context about complex societal problems at their fingertips — starting with problem understanding, but also throughout the product development lifecycle. To that end, Societal Context Understanding Tools and Solutions (SCOUTS) — part of the Responsible AI and Human-Centered Technology (RAI-HCT) team within Google Research — is a dedicated research team focused on the mission to “empower people with the scalable, trustworthy societal context knowledge required to realize responsible, robust AI and solve the world’s most complex societal problems.” SCOUTS is motivated by the significant challenge of articulating societal context, and it conducts innovative foundational and applied research to produce structured societal context knowledge and to integrate it into all phases of the AI-related product development lifecycle. Last year we announced that Jigsaw, Google’s incubator for building technology that explores solutions to threats to open societies, leveraged our structured societal context knowledge approach during the data preparation and evaluation phases of model development to scale bias mitigation for their widely used Perspective API toxicity classifier. Going forward SCOUTS’ research agenda focuses on the problem understanding phase of AI-related product development with the goal of bridging the problem understanding chasm.

Bridging the AI problem understanding chasm

Bridging the AI problem understanding chasm requires two key ingredients: 1) a reference frame for organizing structured societal context knowledge and 2) participatory, non-extractive methods to elicit community expertise about complex problems and represent it as structured knowledge. SCOUTS has published innovative research in both areas.

An illustration of the problem understanding chasm.

A societal context reference frame

An essential ingredient for producing structured knowledge is a taxonomy for creating the structure to organize it. SCOUTS collaborated with other RAI-HCT teams (TasC, Impact Lab), Google DeepMind, and external system dynamics experts to develop a taxonomic reference frame for societal context. To contend with the complex, dynamic, and adaptive nature of societal context, we leverage complex adaptive systems (CAS) theory to propose a high-level taxonomic model for organizing societal context knowledge. The model pinpoints three key elements of societal context and the dynamic feedback loops that bind them together: agents, precepts, and artifacts.

  • Agents: These can be individuals or institutions.
  • Precepts: The preconceptions — including beliefs, values, stereotypes and biases — that constrain and drive the behavior of agents. An example of a basic precept is that “all basketball players are over 6 feet tall.” That limiting assumption can lead to failures in identifying basketball players of smaller stature.
  • Artifacts: Agent behaviors produce many kinds of artifacts, including language, data, technologies, societal problems and products.

The relationships between these entities are dynamic and complex. Our work hypothesizes that precepts are the most critical element of societal context and we highlight the problems people perceive and the causal theories they hold about why those problems exist as particularly influential precepts that are core to understanding societal context. For example, in the case of racial bias in a medical algorithm described earlier, the causal theory precept held by designers was that complex health problems would cause healthcare expenditures to go up for all populations. That incorrect precept directly led to the choice of healthcare spending as the proxy variable for the model to predict complex healthcare need, which in turn led to the model being biased against Black patients who, due to societal factors such as lack of access to healthcare and underdiagnosis due to bias on average, do not always spend more on healthcare when they have complex healthcare needs. A key open question is how can we ethically and equitably elicit causal theories from the people and communities who are most proximate to problems of inequity and transform them into useful structured knowledge?

Illustrative version of societal context reference frame.
Taxonomic version of societal context reference frame.

Working with communities to foster the responsible application of AI to healthcare

Since its inception, SCOUTS has worked to build capacity in historically marginalized communities to articulate the broader societal context of the complex problems that matter to them using a practice called community based system dynamics (CBSD). System dynamics (SD) is a methodology for articulating causal theories about complex problems, both qualitatively as causal loop and stock and flow diagrams (CLDs and SFDs, respectively) and quantitatively as simulation models. The inherent support of visual qualitative tools, quantitative methods, and collaborative model building makes it an ideal ingredient for bridging the problem understanding chasm. CBSD is a community-based, participatory variant of SD specifically focused on building capacity within communities to collaboratively describe and model the problems they face as causal theories, directly without intermediaries. With CBSD we’ve witnessed community groups learn the basics and begin drawing CLDs within 2 hours.

Data 4 Black Lives community members learning system dynamics.

There is a huge potential for AI to improve medical diagnosis. But the safety, equity, and reliability of AI-related health diagnostic algorithms depends on diverse and balanced training datasets. An open challenge in the health diagnostic space is the dearth of training sample data from historically marginalized groups. SCOUTS collaborated with the Data 4 Black Lives community and CBSD experts to produce qualitative and quantitative causal theories for the data gap problem. The theories include critical factors that make up the broader societal context surrounding health diagnostics, including cultural memory of death and trust in medical care.

The figure below depicts the causal theory generated during the collaboration described above as a CLD. It hypothesizes that trust in medical care influences all parts of this complex system and is the key lever for increasing screening, which in turn generates data to overcome the data diversity gap.

Causal loop diagram of the health diagnostics data gap

These community-sourced causal theories are a first step to bridge the problem understanding chasm with trustworthy societal context knowledge.

Conclusion

As discussed in this blog, the problem understanding chasm is a critical open challenge in responsible AI. SCOUTS conducts exploratory and applied research in collaboration with other teams within Google Research, external community, and academic partners across multiple disciplines to make meaningful progress solving it. Going forward our work will focus on three key elements, guided by our AI Principles:

  1. Increase awareness and understanding of the problem understanding chasm and its implications through talks, publications, and training.
  2. Conduct foundational and applied research for representing and integrating societal context knowledge into AI product development tools and workflows, from conception to monitoring, evaluation and adaptation.
  3. Apply community-based causal modeling methods to the AI health equity domain to realize impact and build society’s and Google’s capability to produce and leverage global-scale societal context knowledge to realize responsible AI.
SCOUTS flywheel for bridging the problem understanding chasm.

Acknowledgments

Thank you to John Guilyard for graphics development, everyone in SCOUTS, and all of our collaborators and sponsors.

Categories
Misc

A Comprehensive Guide on Interaction Terms in Time Series Forecasting

Modeling time series data can be challenging (and fascinating) due to its inherent complexity and unpredictability. Long-term trends in time series can change…

Modeling time series data can be challenging (and fascinating) due to its inherent complexity and unpredictability. Long-term trends in time series can change drastically due to certain events, for example. Recall the beginning of the global pandemic, when businesses such as airlines or brick-and-mortar shops saw a quick decline in the number of customers and sales. In contrast, e-commerce businesses continued to operate with less disruption.

Interaction terms can help model such patterns. They capture complex relationships between variables and, as a result, lead to more accurate predictions.

This post explores:

  • Interaction terms in the context of time series forecasting
  • Benefits of interaction terms when modeling complex relationships
  • How to effectively implement interaction terms in your models 

Overview of interaction terms

Interaction terms enable you to investigate whether the relationship between the target and a feature changes depending on the value of another feature. For more details, see my previous post, A Comprehensive Guide to Interaction Terms in Linear Regression.

Figure 1 shows a scatterplot that represents the relationship between miles per gallon (target) and the weight of a vehicle (feature). The relationship is quite different depending on the transmission type (another feature).

Line plot showing the best fit lines for vehicle transmission types. They clearly have different slopes.
Figure 1. Best fit lines for vehicle transmission type, including interaction terms

Improving linear model accuracy

Without using interaction terms, a linear model would not be able to capture such a complex relationship. Effectively, it would assign the same coefficient for the weight feature, regardless of the type of transmission. Figure 1 shows the coefficients (slope of the line) by weight feature, which are drastically different for different transmission types.

To overcome this fallacy and make the linear model more flexible, add interaction terms. In general, they are a multiplication of the original features. By adding these new variables to the regression model, you can measure the effects of the interaction between them and the target. 

Interaction terms in time series forecasting

Interaction terms make linear models more flexible. The following example shows how they work in the context of time series forecasting.

Prerequisites

First, load the required libraries:

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression

import seaborn as sns
import matplotlib.pyplot as plt

Dataset generation

Then, generate some artificial time series data with the following characteristics:

  • 10 years of daily data
  • Repeating patterns (seasonality) present in the time series
  • A decreasing trend over the first 7 years
  • No trend in the last 3 years
  • Random noise, added as the last step
# for reproducibility
np.random.seed(42)

# generate the DataFrame with dates
range_of_dates = pd.date_range(
   start="2010-01-01",
   end="2019-12-30"
)
df = pd.DataFrame(index=range_of_dates)

# create a sequence of day numbers
df["linear_trend"] = range(len(df))
df["trend"] = 0.004 * df["linear_trend"].values[::-1]
df.loc["2017-01-01":, "trend"] = 4

# generate the components of the target
signal_1 = 10 + 4 * np.sin(df["linear_trend"] / 365 * 2 * np.pi)
noise = np.random.normal(0, 0.85, len(df))

# combine them to get the target series
df["target"] = signal_1 + noise + df["trend"]

# plot
df["target"].plot(title="Generated time series");

Figure 2 shows the generated time series, which includes all the desired characteristics.

A time series plot including all the desired characteristics, with a slight downward trend over time and containing multiple increases and decreases
Figure 2. The generated time series

Training the benchmark model

Now train a linear model and inspect the best fit line. For this step, create very simple models with a few features. This enables you to visually inspect the impact of the interaction term on the model’s fit.

The simplest model possible contains one feature — an indicator of the passage of time. The linear_trend column created for the time series is effectively the row number of the DataFrame (ordered by date).

X = df[["linear_trend"]]
y = df[["target"]]

lm = LinearRegression()
lm.fit(X, y)

df["model_1"] = lm.predict(X)

df[["target", "model_1"]].plot(title="Linear trend");

It is worth mentioning that the point is not to properly evaluate the forecasts using separate train and test sets, but rather to explain the impact of the interaction terms on the model’s fit. It is easier to observe the interaction term’s impact by inspecting the fitted values (prediction on the training set) and comparing those fitted values to the original time series. 

Figure 3 shows that the linear model identified a decreasing trend for the entire time series. At the same time, the fit seems off for the last 3 years of data, as there is no trend there.

The plot shows that the fitted line is the same for the entire dataset, thus not capturing the pattern change in the last 3 years.
Figure 3. Best fit line obtained from a linear model using linear trend as a feature

Add a breakpoint

Next, try to make the model learn the new pattern (trend change) using feature engineering. To do so, create a breakpoint, which is a placeholder variable indicating whether a given observation is after January 1, 2017. In this case, the exact point in time when the trend change happened is known. 

Next, train another linear model, this time with two features:

df["after_2017_breakpoint"] = np.where(df.index >= pd.Timestamp('2017-01-01'), 1, 0)

X = df[["linear_trend", "after_2017_breakpoint"]]
y = df[["target"]]

lm = LinearRegression()
lm.fit(X, y)

df["model_2"] = lm.predict(X)

df[["target", "model_2"]].plot(title="Linear trend + breakpoint");
After introducing the breakpoint, there was a vertical jump in the fitted line, but the slope before/after is the same.
Figure 4. Best fit line obtained from a linear model using linear trend and a breakpoint as features

Figure 4 shows a few important changes, as listed below:

  • The fitted line displays a vertical jump, which corresponds to the coefficient by the new Boolean feature.
  • The vertical jump occurs exactly on the first date when the feature becomes active (a value of 1 instead of 0).
  • The slope of the line is the same before and after the introduced breakpoint.
  • The model is trying to compensate for the incorrect slope by adding a fixed amount to the predictions after the breakpoint.

There is no trend in the last 3 years of data, so ideally the line should be close to flat after January 1, 2017. 

Adding an interaction term

To change the slope after the breakpoint, add a more complex dependency on the timestamp (represented by a linear trend). That is exactly what an interaction term does–it is a multiplication of the linear trend and the placeholder variable.

df["interaction_term"] = df["after_2017_breakpoint"] * df["linear_trend"]

X = df[["linear_trend", "after_2017_breakpoint", "interaction_term"]]
y = df[["target"]]

lm = LinearRegression()
lm.fit(X, y)

df["model_3"] = lm.predict(X)

df[["target", "model_3"]].plot(title="Linear trend + breakpoint + interaction term");
After introducing the interaction term, the slope clearly changes after the breakpoint. It is not as steep anymore.
Figure 5. Best fit line obtained from a linear model using a linear trend, a breakpoint, and an interaction term as features

Figure 5 shows the impact of having the interaction term in the model. Compared to Figure 4, the slope of the best fit line is different after the breakpoint. 

To be more precise, the difference is actually the value of the coefficient by the interaction term. While the new line did not flatten out, it is still less steep than it used to be in the earlier parts of the time series.

Introducing the breakpoint together with an interaction term increased the model’s ability to capture the trend of the time series. In turn, that should increase the predictive performance of the model.

Summary

Using interaction terms can make the specification of a linear model more flexible (different slopes for different lines), which can result in a better fit to the data and better predictive performance. You can add interaction terms as a multiplication of the original features. In the context of time series, you can use interaction terms to better capture any changes to the trend.

Find the code used in this post in A Comprehensive Guide on Interaction Terms in Time Series Forecasting on GitHub. Additionally, the code in the notebook shows how to leverage cuDF and cuML to train your models using GPU acceleration. As always, feedback is welcome. You can reach out to me on Twitter or in the comments.