Categories
Misc

Programming the Quantum-Classical Supercomputer

Heterogeneous computing architectures—those that incorporate a variety of processor types working in tandem—have proven extremely valuable in the continued…

Heterogeneous computing architectures—those that incorporate a variety of processor types working in tandem—have proven extremely valuable in the continued scalability of computational workloads in AI, machine learning (ML), quantum physics, and general data science. 

Critical to this development has been the ability to abstract away the heterogeneous architecture and promote a framework that makes designing and implementing such applications more efficient. The most well-known programming model that accomplishes this is CUDA Toolkit, which enables offloading work to thousands of GPU cores in parallel following a single-instruction, multiple-data model. 

Recently, a new form of node-level coprocessor technology has been attracting the attention of the computational science community: the quantum computer, which relies on the non-intuitive laws of quantum physics to process information using principles such as superposition, entanglement, and interference. This unique accelerator technology may prove useful in very specific applications and is poised to work in tandem with CPUs and GPUs, ushering in an era of computational advances previously deemed unfeasible. 

The question then becomes: If you enhance an existing classically heterogeneous compute architecture with quantum coprocessors, how would you program it in a manner fit for computational scalability?

NVIDIA is answering this question with CUDA Quantum, an open-source programming model extending both C++ and Python with quantum kernels intended for compilation and execution on quantum hardware. 

This post introduces CUDA Quantum, highlights its unique features, and demonstrates how researchers can leverage it to gather momentum in day-to-day quantum algorithmic research and development. 

CUDA Quantum: Hello quantum world 

To begin with a look at the CUDA Quantum programming model, create a two-qubit GHZ state with the Pythonic interface. This will accustom you to its syntax.

import cudaq

# Create the CUDA Quantum Kernel
kernel = cudaq.make_kernel()

# Allocate 2 qubits
qubits = kernel.qalloc(2)

# Prepare the bell state
kernel.h(qubits[0]) 
kernel.cx(qubits[0], qubits[1])

# Sample the final state generated by the kernel 
result = cudaq.sample(kernel, shots_count = 1000) 

print(result) 

{11:487, 00:513}

The language specification borrows concepts that CUDA has proven successful; specifically, the separation of host and device code at the function boundary level. The code snippet below demonstrates this functionality on a GHZ state preparation example in C++. 

#include 

int main() {
      // Define the CUDA Quantum kernel as a C++ lambda
	auto ghz =[](int numQubits) __qpu__ {
           // Allocate a vector of qubits
		cudaq::qvector q(numQubits);

           // Prepare the GHZ state, leverage standard 
           // control flow, specify the x operation 
           // is controlled. 
		h(q[0]);
		for (int i = 0; i (q[i], q[i + 1]);
	};

     // Sample the final state generated by the kernel
auto results = cudaq::sample(ghz, 15); 
results.dump();

return 0;
}

CUDA Quantum enables the definition of quantum code as stand-alone kernel expressions. These expressions can be any callable in C++ (a lambda is shown here, and implicitly typed callable) but must be annotated with the __qpu__ attribute enabling the nvq++ compiler to compile them separately. Kernel expressions can take classical input by value (here the number of qubits) and leverage standard C++ control flow, for example for loops and if statements. 

The utility of GPUs

The experimental efforts to scale up QPUs and move them out of research labs and host them on the cloud for general access have been phenomenal. However, current QPUs are noisy and small-scale, hindering advancement of algorithmic research. To aid this, circuit simulation techniques are answering the pressing requirements to advance research frontiers. 

Desktop CPUs can simulate small-scale qubit statistics; however, memory requirements of the state vector grow exponentially with the number of qubits. A typical desktop computer possesses eight GB of RAM, enabling one to sluggishly simulate approximately 15 qubits. The  latest NVIDIA DGX H100 enables you to surpass the 35-qubit mark with unparalleled speed. 

Figure 1 shows a comparison of CUDA Quantum on CPU and GPU backend for a typical variational algorithmic workflow. The need for GPUs is evident here, as the speedup at 14 qubits is 425x and increases with qubit count. Extrapolating to 30 qubits, the CPU-to-GPU runtime is 13 years, compared to 2 days. This unlocks researchers’ abilities to go beyond small-scale proof of concept results to implementing algorithms closer to real-world applications.

Bar graph showing performance improvements in execution time between a CPU and GPU as a function of number of qubits. At 14 qubits, the GPU is 425 times faster than the CPU.
Figure 1. Performance comparison between CPU and GPU for a typical quantum neural network workflow as a function of qubit count 

Along with CUDA Quantum, NVIDA has developed cuQuantum, a library enabling lightning-fast simulation of a quantum computer using both state vector and tensor network methods through hand-optimized CUDA kernels. Memory allocation and processing happens entirely on GPUs resulting in dramatic increases in performance and scale. CUDA Quantum in combination with cuQuantum forms a powerful platform for hybrid algorithm research. 

Figure 2 compares CUDA Quantum with a leading quantum computing SDK, both leveraging the NVIDIA cuQuantum backend to optimally offload circuit simulation onto NVIDIA GPUs. In this case, the benefits of using CUDA Quantum are isolated and yield a 5x performance improvement on average compared to a leading framework. 

 Line plot showing the execution time for a typical quantum neural network workflow as a function of number of qubits for CUDA Quantum and a leading framework. CUDA Quantum is on average 5x faster. Since both frameworks were executed on GPUs, we are isolating the performance benefits of using CUDA Quantum.
Figure 2. GPU-to-GPU comparison between CUDA Quantum and a leading framework, both offloading circuit simulation to NVIDIA GPUs, with CUDA Quantum on average 5x faster

Enabling multi-QPU workflows of the future

CUDA Quantum is not limited to consideration of current cloud-based quantum execution models, but is fully anticipating tightly coupled, system-level quantum acceleration. Moreover, CUDA Quantum enables application developers to envision workflows for multi-QPU architectures with multi-GPU backends. 

For the preceding quantum neural network (QNN) example, you can use the multi-GPU functionality to run a forward pass of the dataset enabling us to perform multi-QPU workflows of the future. Figure 3 shows results for distributing the QNN workflow across two GPUs and demonstrates strong scaling performance indicating effective usage of all GPU compute resources. Using two GPUs makes the overall workflow twice as fast compared to a single GPU, demonstrating strong scaling. 

Line plot showing the execution time of a typical quantum neural network workflow as a function of the number of qubits. The execution time is approximately half when two GPUs are used in comparison to a single GPU.
Figure 3. ‌Results for distributing the QNN forward pass workload to multiple QPUs enabled ‌by the multi-GPU backend

Another common workflow that benefits from multi-QPU parallelization is the Variational Quantum Eigensolver (VQE). This requires the expectation value of a composite Hamiltonian made up of multiple single Pauli tensor product terms. The CUDA Quantum observe call, shown below, automatically batches terms (Figure 4), and offloads to multiple GPUs or QPUs if available, demonstrating strong scaling (Figure 5). 

numQubits, numTerms = 30, 1e5
hamiltonian = cudaq.SpinOperator.random(numQubits, numTerms)
cudaq.observe(ansatz, hamiltonian, parameters)
Image showing a Hamiltonian composed of many terms being batched into four groups and offloaded to four GPUs.
Figure 4. Automatic batching of Hamiltonian terms across multiple NVIDIA A100 GPUs
Bar graph showing speedup in execution time gained by automatically batching a Hamiltonian composed of multiple terms into four batches and executing on four GPUs. The speedups gained demonstrate strong scaling.
Figure 5. Speedups gained due to an optimized software stack supporting the hardware available to the user, GPUs or QPUs

GPU-QPU workflows 

This post has so far explored using GPUs for scaling quantum circuit simulation beyond what is possible on CPUs, as well as multi-QPU workflows. The following sections dive into true heterogeneous computing with a hybrid quantum neural network example using PyTorch and CUDA Quantum.

As shown in Figure 6, a hybrid quantum neural network encompasses a quantum circuit as a layer within the overall neural network architecture. An active area of research, this is poised to be advantageous in certain areas, improving generalization errors.

Image showing layers of neural network nodes, the output of which acts as the input to a quantum circuit, which is measured to generate the loss function. This workflow enables one to integrate PyTorch layers with CUDA Quantum.
Figure 6. Hybrid quantum neural network architecture accelerated by GPUs made possible by CUDA Quantum

Evidently, it is advantageous to run the classical neural network layers on GPUs and the quantum circuits on QPUs. Accelerating the whole workflow with CUDA Quantum is made possible by setting the following: 

quantum_device = cudaq.set_target('ion-trap')
classical_device = torch.cuda.set_device(gpu0)

The utility of this is profound. CUDA Quantum enables offloading relevant kernels suited for QPUs and GPUs in a tightly integrated, seamless fashion. In addition to hybrid applications, workflows involving error correction, real-time optimal control, and error mitigation through Clifford data regression would all benefit from tightly coupled compute architectures. 

QPU hardware providers 

The foundational information unit embedded within the CUDA Quantum programming paradigm is the qudit, which represents a quantum bit capable of accessing d-states. Qubit is a specific instance where d=2. By using qudits, CUDA Quantum can efficiently target diverse quantum computing architectures, including superconducting circuits, ion traps, neutral atoms, diamond-based, photonic systems, and more. 

You can conveniently develop workflows, and the nvq++ compiler automatically compiles and executes the program on the designated architecture. Figure 7 shows the compilation speedups that the novel compiler yields. Compilation involves circuit optimization, decomposing into the native gate sets supported by the hardware and qubit routing. The nvq++ compiler used by CUDA Quantum is on average 2.4x faster compared to its competition.

Line graph showing how the compilation time scales with number of qubits for CUDA Quantum and a leading framework. The novel ‌compiler used by CUDA Quantum is on average 2.4x faster and its rate of increase (gradient) is also much shallower in comparison.
Figure 7. Compilation time scaling with the number of qubits for CUDA Quantum and a leading framework

To accommodate the desired backend, you can simply modify the set_target() flag. Figure 8 shows an example of how you can seamlessly switch between the simulated backend and the Quantinuum H1 ion trap system. The top shows the syntax to set the desired backend in Python and the bottom in C++. 

Image showing a heatmap of the cost landscape generated by a VQE workflow being executed on cuQuantum simulated backed and the Quantinuum H1 processor. The ease with which users can change the backend and the syntax enabling this in Python and C++ is highlighted.
Figure 8. VQE landscape plots demonstrating execution on simulated or QPU hardware

Getting started with CUDA Quantum

This post has just briefly touched on some of the features of the CUDA Quantum programming model. Reach out to the CUDA Quantum community on GitHub and get started with some example code snippets. We are excited to see the research CUDA Quantum enables for you. 

Categories
Misc

Sailing Seas of Data: Startup Charts Autonomous Oceanic Monitoring

Saildrone is making a splash in autonomous oceanic monitoring. The startup’s nautical data collection technology has tracked hurricanes up close in the North Atlantic, discovered a 3,200-foot underwater mountain in the Pacific Ocean and begun to help map the entirety of the world’s ocean floor. Based in the San Francisco Bay Area, the company develops Read article >

Categories
Offsites

SimPer: Simple self-supervised learning of periodic targets

Learning from periodic data (signals that repeat, such as a heart beat or the daily temperature changes on Earth’s surface) is crucial for many real-world applications, from monitoring weather systems to detecting vital signs. For example, in the environmental remote sensing domain, periodic learning is often needed to enable nowcasting of environmental changes, such as precipitation patterns or land surface temperature. In the health domain, learning from video measurement has shown to extract (quasi-)periodic vital signs such as atrial fibrillation and sleep apnea episodes.

Approaches like RepNet highlight the importance of these types of tasks, and present a solution that recognizes repetitive activities within a single video. However, these are supervised approaches that require a significant amount of data to capture repetitive activities, all labeled to indicate the number of times an action was repeated. Labeling such data is often challenging and resource-intensive, requiring researchers to manually capture gold-standard temporal measurements that are synchronized with the modality of interest (e.g., video or satellite imagery).

Alternatively, self-supervised learning (SSL) methods (e.g., SimCLR and MoCo v2), which leverage a large amount of unlabeled data to learn representations that capture periodic or quasi-periodic temporal dynamics, have demonstrated success in solving classification tasks. However, they overlook the intrinsic periodicity (i.e., the ability to identify if a frame is part of a periodic process) in data and fail to learn robust representations that capture periodic or frequency attributes. This is because periodic learning exhibits characteristics that are distinct from prevailing learning tasks.

Feature similarity is different in the context of periodic representations as compared to static features (e.g., images). For example, videos that are offset by short time delays or are reversed should be similar to the original sample, whereas videos that have been upsampled or downsampled by a factor x should be different from the original sample by a factor of x.

To address these challenges, in “SimPer: Simple Self-Supervised Learning of Periodic Targets”, published at the eleventh International Conference on Learning Representations (ICLR 2023), we introduced a self-supervised contrastive framework for learning periodic information in data. Specifically, SimPer leverages the temporal properties of periodic targets using temporal self-contrastive learning, where positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. We propose periodic feature similarity that explicitly defines how to measure similarity in the context of periodic learning. Moreover, we design a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). Next, we demonstrate that SimPer effectively learns period feature representations compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Finally, we are excited to release the SimPer code repo with the research community.

The SimPer framework

SimPer introduces a temporal self-contrastive learning framework. Positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. For temporal video examples, periodicity-invariant changes are cropping, rotation or flipping, whereas periodicity-variant changes involve increasing or decreasing the speed of a video.

To explicitly define how to measure similarity in the context of periodic learning, SimPer proposes periodic feature similarity. This construction allows us to formulate training as a contrastive learning task. A model can be trained with data without any labels and then fine-tuned if necessary to map the learned features to specific frequency values.

Given an input sequence x, we know there’s an underlying associated periodic signal. We then transform x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo- speed or frequency labels for the unlabeled input x.

Conventional similarity measures such as cosine similarity emphasize strict proximity between two feature vectors, and are sensitive to index shifted features (which represent different time stamps), reversed features, and features with changed frequencies. In contrast, periodic feature similarity should be high for samples with small temporal shifts and or reversed indexes, while capturing a continuous similarity change when the feature frequency varies. This can be achieved via a similarity metric in the frequency domain, such as the distance between two Fourier transforms.

To harness the intrinsic continuity of augmented samples in the frequency domain, SimPer designs a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). This makes it suitable for regression tasks, where the goal is to recover a continuous signal, such as a heart beat.

SimPer constructs negative views of data through transformations in the frequency domain. The input sequence x has an underlying associated periodic signal. SimPer transforms x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo speed or frequency labels for unlabeled input x (periodicity-variant augmentations τ). SimPer takes transformations that do not change the identity of the input and defines these as periodicity-invariant augmentations σ, thus creating different positive views of the sample. Then, it sends these augmented views to the encoder f, which extracts corresponding features.

Results

To evaluate SimPer’s performance, we benchmarked it against state-of-the-art SSL schemes (e.g., SimCLR, MoCo v2, BYOL, CVRL) on a set of six diverse periodic learning datasets for common real-world tasks in human behavior analysis, environmental remote sensing, and healthcare. Specifically, below we present results on heart rate measurement and exercise repetition counting from video. The results show that SimPer outperforms the state-of-the-art SSL schemes across all six datasets, highlighting its superior performance in terms of data efficiency, robustness to spurious correlations, and generalization to unseen targets.

Here we show quantitative results on two representative datasets using SimPer pre-trained using various SSL methods and fine-tuned on the labeled data. First, we pre-train SimPer using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) dataset, a human photoplethysmography and heart rate prediction dataset, and compare its performance to state-of-the-art SSL methods. We observe that SimPer outperforms SimCLR, MoCo v2, BYOL, and CVRL methods. The results on the human action counting dataset, Countix, further confirm the benefits of SimPer over others methods as it notably outperforms the supervised baseline. For the feature evaluation results and performance on other datasets, please refer to the paper.

Results of SimCLR, MoCo v2, BYOL, CVRL and SimPer on the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) and Countix datasets. Heart rate and repetition count performance is reported as mean absolute error (MAE).

Conclusion and applications

We present SimPer, a self-supervised contrastive framework for learning periodic information in data. We demonstrate that by combining a temporal self-contrastive learning framework, periodicity-invariant and periodicity-variant augmentations, and continuous periodic feature similarity, SimPer provides an intuitive and flexible approach for learning strong feature representations for periodic signals. Moreover, SimPer can be applied to various fields, ranging from environmental remote sensing to healthcare.

Acknowledgements

We would like to thank Yuzhe Yang, Xin Liu, Ming-Zher Poh, Jiang Wu, Silviu Borac, and Dina Katabi for their contributions to this work.

Categories
Misc

Advanced API Performance: Pipeline State Objects

A graphic of a computer sending code to multiple stacks.Pipeline state objects (PSOs) define how input data is interpreted and rendered by the hardware when submitting work to the GPUs. Proper management of PSOs is…A graphic of a computer sending code to multiple stacks.

Pipeline state objects (PSOs) define how input data is interpreted and rendered by the hardware when submitting work to the GPUs. Proper management of PSOs is essential for optimal usage of system resources and smooth gameplay.

Recommended:

  • Create PSOs on worker threads asynchronously.
    • PSO creation is where shaders compilation and related stalls happen.
  • Start with generic PSOs with generic shaders that compile quickly and generate specializations later.
    • This gets you up and running faster even if you are not running the most optimal PSO or shader yet.
    • Shaders shared between PSOs will only compile once.
  • Avoid runtime PSO compilations as they most likely will lead to stalls.
    • The driver-managed shader disk cache may come to the rescue.
  • Use PSO libraries.
  • Use identical sensible defaults for don’t care fields wherever possible.
    • This allows for more possibilities for PSO reuse
  • Use the /all_resources_bound / D3DCOMPILE_ALL_RESOURCES_BOUND compile flag if possible.
    • The compiler can do a better job at optimizing texture accesses. 
  • Arrange draw calls by PSO & tessellation usage.
  • Remember that PSO creation is where shaders are compiled and stalls are introduced.
    • It is really important to create PSO asynchronously and early enough before they are used.
    • Tread carefully with thread priorities for PSO compilation threads.
    • Use Idle priority if there is no ‘hurry’ to prevent slowdowns for game threads.
    • Consider temporarily boosting priorities when there is a ‘hurry’

Not recommended:

  • Toggling between compute and graphics on the same command queue more than necessary.
    • This is still a heavyweight switch to make.
  • Toggling tessellation on/off more than necessary.
    • This is also a heavyweight switch to make.
    • It is really important to create PSO asynchronously and early enough before they are used.
    • Tread carefully with thread priorities for PSO compilation threads.
    • Use Idle priority if there is no ‘hurry’ to prevent slowdowns for game threads.
    • Consider temporarily boosting priorities when there is a ‘hurry’
  • Using FXC to generate DXBC in DX12.
    • This causes extra DXBC to DXIL translation, increasing compilation time and PSO library size.
  • Serializing very large (hundreds of thousands) numbers of PSOs to disk in PSO libraries at once.
    • This may significantly bloat the usage of system memory.
    • Use the “miss and update the PSO library” strategy instead.

This post covers best practices when working with pipeline state objects on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Acknowledgments

Thanks to Patrick Neil and Dhiraj Kumar for their advice and assistance.

Categories
Misc

Developing a Pallet Detection Model Using OpenUSD and Synthetic Data

Stacked palletsImagine you are a robotics or machine learning (ML) engineer tasked with developing a model to detect pallets so that a forklift can manipulate them. ‌You are…Stacked pallets

Imagine you are a robotics or machine learning (ML) engineer tasked with developing a model to detect pallets so that a forklift can manipulate them. ‌You are familiar with traditional deep learning pipelines, you have curated manually annotated datasets, and you have trained successful models. 

You are ready for the next challenge, which comes in the form of large piles of densely stacked pallets. You might wonder, where should I begin? ‌Is 2D bounding box detection or instance segmentation most useful for this task? ‌Should I do 3D bounding box detection and, if so, how will I annotate it? ‌Would it be best to use a monocular camera, stereo camera, or lidar for detection? ‌Given the sheer quantity of pallets that occur in natural warehouse scenes, manual annotation will not be an easy endeavor. And if I get it wrong, it could be costly.

This is what I wondered when faced with a similar situation. Fortunately, I had an easy way to get started with relatively low commitment: synthetic data.

Overview of synthetic data

Synthetic Data Generation (SDG) is a technique for generating data to train neural networks using rendered images rather than real-world images. ‌The advantage of using synthetically rendered data is that you implicitly know the full shape and location of objects in the scene and can generate annotations like 2D bounding boxes, keypoints, 3D bounding boxes, segmentation masks, and more. ‌

Synthetic data can be a great way to bootstrap a deep learning project, as it enables you to rapidly iterate on ideas before committing to large manual data annotation efforts or in cases where data is limited, restricted, or simply does not exist. For such cases, you might find that synthetic data with domain randomization works very well for your application out-of-the-box first try. ‌And viola–you save time. 

Alternatively, you might find that you need to redefine the task or use a different sensor modality.  Using synthetic data, you can experiment with these decisions without committing to a costly annotation effort.  

In many cases, you may still benefit from using some real-world data. ‌The nice part is, by experimenting with synthetic data you will have more familiarity with the problem, and can invest your annotation effort where it counts the most. Each ML task presents its own challenges, so it is difficult to determine exactly how synthetic data will fit in, whether you will need to use real-world data, or a mix of synthetic and real data.  

Using synthetic data to train a pallet segmentation model

When considering how to use synthetic data to train a pallet detection model, our team started small. Before we considered 3D box detection or anything complex, we first wanted to see if we could detect anything at all using a model trained with synthetic data. To do so, we rendered a simple dataset of scenes containing just one or two pallets with a box on top. ‌We used this data to train a semantic segmentation model.  

We chose to train a semantic segmentation model because the task is well defined and the model architectures are relatively simple. It is also possible to visually identify where the model is failing (the incorrectly segmented pixels).

To train the segmentation model, the team first rendered coarse synthetic scenes (Figure 1).

A rendering of two pallets with a box on top. ‌The rendering is coarse, and the box is a uniform gray color.
Figure 1. A coarse synthetic rendering of two pallets with a box on top

The team suspected that these rendered images alone would lack the diversity to train a meaningful pallet detection model. ‌We also decided to experiment with augmenting the synthetic renderings using generative AI to produce more realistic images.‌ Before training, we applied generative AI to these images to add variation that we believed would improve the ability of the model to generalize to the real world.  

This was done using a depth conditioned generative model, which roughly preserved the pose of objects in the rendered scene. Note that using generative AI is not required when working with SDG. You could also try using traditional domain randomization, like varying the synthetic textures, colors, location, and orientation of the pallets. ‌You may find that traditional domain randomization by varying the rendered textures is sufficient for the application.

An image of the synthetically rendered scene augmented using generative AI.  The augmented image looks photorealistic, and the uniform gray box is replaced with a plastic wrapped box.
Figure 2. The synthetic rendering, augmented using generative AI

After rendering about 2,000 of these synthetic images, we trained a resnet18-based Unet segmentation model using PyTorch. ‌Quickly, the results showed great promise on real-world images (Figure 3).

An image showing a single pallet with a box on top. ‌The pallet is highlighted in green to show the semantic segmentation result.
Figure 3. Real-world pallet image, tested with segmentation model 

The model could accurately segment the pallet. Based on this result, we developed more confidence in the workflow, but the challenge was far from over. Up to this point, the team’s approach did not distinguish between instances of pallets, and it did not detect pallets that were not placed on the floor. ‌For images like the one shown in Figure 4, the results were barely usable. This likely meant that we needed to adjust our training distribution.

An image showing the semantic segmentation results on a warehouse scene with pallets and stacked boxes.  The segmentation model fails to detect pallets that aren't on the floor.
Figure 4. Semantic segmentation model fails to detect stacked pallets

Iteratively increasing the data diversity to improve accuracy

To improve the accuracy of the segmentation model, the team added more images of a wider variety of pallets stacked in different random configurations. We added about 2,000 more images to our dataset, bringing the total to about 4,000 images. ‌We created the stacked pallet scenes using the USD Scene Construction Utilities open-source project. 

USD Scene Construction Utilities was used to position pallets relative to each other in configurations that reflect the distribution you might see in the real world. ‌We used Universal Scene Description (OpenUSD) SimReady Assets, which offered a large diversity of pallet models to choose from.

Images of stacked pallets rendered using Omniverse Replicator.  The pallets vary in type, color and orientation.
Figure 5. Structured scenes created using the USD Python API and USD Scene Construction Utilities, and further randomized and rendered with Omniverse Replicator

Training with the stacked pallets, and with a wider variety of viewpoints, we were able to improve the accuracy of the model for these cases.

If adding this data helped the model, why generate only 2,000 images if there is no added annotation cost? We did not start with many images because we were sampling from the same synthetic distribution. ‌Adding more images would not necessarily add much diversity to our dataset. Instead, we might just be adding many similar images without‌ improving the model’s real-world accuracy.  

Starting small enabled the team to quickly train the model, see where it failed, and adjust the SDG pipeline and add more data. ‌For example, after noticing the model had a bias towards specific colors and shapes of pallets, we added more synthetic data to address these failure cases.

A rendering of scenes containing plastic pallets in many different colors.
Figure 6. ‌A rendering of plastic pallets in various colors

These data variations improved the model’s ability to handle the failure scenarios it encountered (plastic and colored pallets).

If data variation is good, why not just go all-out and add a lot of variation at once? Until our team began testing on real-world data, it was difficult to tell what variance might be required. ‌We might have missed important factors needed to make the model work well. Or, we might have overestimated the importance of other factors, exhausting our effort unnecessarily. ‌By iterating, we better understood what data was needed for the task.

Extending the model for pallet side face center detection

Once we had some promising results with segmentation, the next step was to adjust the task from semantic segmentation to something more practical. ‌We decided that the simplest next task to evaluate was detecting the center of the pallet side faces. 

An image showing a rendered sample with a heat map overlaid on top of the center of the pallet’s side faces.
Figure 7. Example data for the pallet side face center detection task

The pallet side face center points are where a forklift would center itself when manipulating the pallet. ‌While more information may be necessary in practice to manipulate the pallet (such as the distance and angle at this point), we considered this point a simple next step in this process that enables the team to assess how useful our data is for any downstream application.  

Detecting these points could be done with heat map regression, which, like segmentation, is done in the image domain, is easy to implement, and simple to visually interpret. ‌By training a model for this task, we could quickly assess how useful our synthetic dataset is at training a model to detect important key points for manipulation.

The results after training were promising, as shown in Figure 8.

Multiple images showing the heat maps of the pallet side face detection model in multiple scenarios. ‌The scenarios include pallets side by side on the floor, pallets stacked neatly on top of each other, and pallets stacked with boxes.
Figure 8. Real-world detection results for the pallet side face detection model

The team confirmed the ability to detect the pallet side faces using synthetic data, even with closely stacked pallets. We continued to iterate on the data, model, and training pipeline to improve the model for this task. 

Extending the model for corner detection

‌When we reached a satisfactory point for the side face center detection model, we explored taking the task to the next level: detecting the corners of the box.  The initial approach was to use a heat map for each corner, similar to the approach for the pallet side face centers.

An image showing the heatmap detection for the corners of a pallet with a box on top.  The heat map for the corners that are occluded are blurry, indicating the difficulty the model has in predicting the precise location of these points.
Figure 9. ‌Pallet corner detection model using heat maps

However, this approach quickly presented a challenge. Because the object for detection had unknown dimensions, it was difficult for the model to precisely infer where the corner of the pallet should be if it was not directly visible. Using heat maps, if the peak values are inconsistent, it is difficult to parse them reliably.

So, instead of using heat maps, we chose to regress the corner locations after detecting the face center peak. We trained a model to infer a vector field that contains the offset of the corners from a given pallet face center. ‌This approach quickly showed promise for this task, and we could provide meaningful estimates of corner locations, even with large occlusions.

An image showing four pallets in a cluttered scene. The pallets are detected and their shape is approximately determined. This shows the ability of the regression model to handle the heat map model’s failure case.
Figure 10. ‌The pallet detection results using face center heat map and vector field-based corner regression

Now that the team had a promising working pipeline, we iterated and scaled this process to address different failure cases that arose. In total, our final model was trained on roughly 25,000 rendered images. Trained at a relatively low resolution (256 x 256 pixels), our model was capable of detecting small pallets by running inference at higher resolutions. In the end, we were able to detect challenging scenes, like the one above, with relatively high accuracy.

This was something we could use–all created with synthetic data. This is where our pallet detection model stands today.

An image showing nearly 100 pallets, some of varied shape, stacked in a warehouse.  The model detects each pallet except a few in the background.
Figure 11. ‌The final pallet model detection results, with only the front face of the detection shown for ease of visualization
A gif of the pallet detection model running in real time detecting a single black plastic pallet.  The video is shaky and blurry, demonstrating the ability of the model to detect the pallet even under adverse conditions.
Figure 12. The pallet detection model running in real time

Get started building your own model with synthetic data

By iteratively developing with synthetic data, our team developed a pallet detection model that works on real-world images. Further progress may be possible with more iteration. Beyond this point, our task might benefit from the addition of real-world data. However, without synthetic data generation, we could not have iterated as quickly, as each change we made would have required new annotation efforts.

If you are interested in trying this model, or are working on an application that could use a pallet detection model, you can find both the model and inference code by visiting SDG Pallet Model on GitHub. The repo includes the pretrained ONNX model as well as instructions to optimize the model with TensorRT and run inference on an image. The model can run in real time on NVIDIA Jetson AGX Orin, so you will be able to run it at the edge. 

You can also check out the recently open-sourced project, USD Scene Construction Utilities, which contains examples and utilities for building USD scenes using the USD Python API. 

We hope our experience inspires you to explore how you can use synthetic data to bootstrap your AI application. If you’d like to get started with synthetic data generation, NVIDIA offers a suite of tools to simplify the process. These include:

  1. Universal Scene Description (OpenUSD): Described as HTML of the metaverse, USD is a framework for fully describing 3D worlds. Not only does USD include primitives like 3D object meshes, but it also has the ability to describe materials, lighting, cameras, physics and more. 
  2. NVIDIA Omniverse Replicator: A core extension of the NVIDIA Omniverse platform, Replicator enables developers to generate large and diverse synthetic training data to bootstrap perception model training. With features such as easy-to-use APIs, domain randomization, and multi-sensor simulation, Replicator can address the lack of data challenge and accelerate the model training process. 
  3. SimReady Assets: Simulation-ready assets are physically accurate 3D objects that encompass accurate physical properties, behavior, and connected data streams to represent the real world in simulated digital worlds. NVIDIA offers a collection of realistic assets and materials that can be used out-of-the-box for constructing 3D scenes. This includes a variety of assets related to warehouse logistics, like pallets, hand trucks, and cardboard boxes. To search, display, inspect, and configure SimReady assets before adding them to an active stage, you can use the SimReady Explorer extension. Each SimReady asset has its own predefined semantic label, making it easier to generate annotated data for segmentation or object detection models. 

If you have questions about the pallet model, synthetic data generation with NVIDIA Omniverse, or inference with NVIDIA Jetson, reach out on GitHub or visit the NVIDIA Omniverse Synthetic Data Generation Developer Forum and the NVIDIA Jetson Orin Nano Developer Forum.

Explore what’s next in AI at SIGGRAPH

Join us at SIGGRAPH 2023 for a powerful keynote by NVIDIA CEO Jensen Huang. You’ll get an exclusive look at some of our newest technologies, including award-winning research, OpenUSD developments, and the latest AI-powered solutions for content creation.

Get started with NVIDIA Omniverse by downloading the standard license free, or learn how Omniverse Enterprise can connect your team. If you’re a developer, get started building your first extension or developing a Connector with Omniverse resources. Stay up-to-date on the platform by subscribing to the newsletter, and following NVIDIA Omniverse on Instagram, Medium, and Twitter. For resources, check out our forums, Discord server, Twitch, and YouTube channels.

Categories
Misc

Research Unveils Breakthrough Deep Learning Tool for Understanding Neural Activity and Movement Control

A black and white GIF out a mouse walking on a wheel.A primary goal in the field of neuroscience is understanding how the brain controls movement. By improving pose estimation, neurobiologists can more precisely…A black and white GIF out a mouse walking on a wheel.

A primary goal in the field of neuroscience is understanding how the brain controls movement. By improving pose estimation, neurobiologists can more precisely quantify natural movement and in turn, better understand the neural activity that drives it. This enhances scientists’ ability to characterize animal intelligence, social interaction, and health. 

Columbia University researchers recently developed a video-centric deep learning package that tracks animal movement more robustly from video, which helps: 

  • obtain reliable pose predictions in the face of occlusions and dataset shifts. 
  • train on images and videos simultaneously, while significantly shortening training time.
  • simplify the software engineering needed to train models, form predictions, and visualize the results

Named Lightning Pose, the tool trains deep learning models in PyTorch Lightning on both labeled images and unlabeled videos, which are decoded and processed on the GPU using NVIDIA DALI.

In this blog post, you’ll see how contemporary computer vision architectures benefit from open-source, GPU-accelerated video processing. 

Deep learning algorithms for automatic pose tracking in video have recently garnered much attention in neuroscience. ‌The standard approach involves training a convolutional network in a fully supervised approach on a set of annotated images. ‌

Most convolutional architectures are built for handling single images and don’t use the useful temporal information hidden in videos. ‌By tracking each keypoint individually, these networks may generate nonsensical poses or ones that are inconsistent across multiple cameras.‌ Despite its wide adoption and success, the prevailing approach tends to overfit the training set and struggles to generalize to unseen animals or laboratories.

An efficient approach to animal pose tracking

The Lightning Pose package, represented in Figure 1, is a set of deep learning models for animal pose tracking, implemented in PyTorch Lightning. It takes a video-centric and semi-supervised approach to training of the pose estimation models. ‌In addition to training on a set of labeled frames, it trains on many unlabeled video clips and penalizes itself when its sequences of pose predictions are incoherent (that is, violate basic spatiotemporal constraints). ‌The unlabeled videos are decoded and processed on the fly directly on a GPU using DALI.

The three-layered approach to pose estimation. The PyTorch Lighting layer (0) covers the data loaders, the architecture, and losses calculation. ‌The second layer (1) covers the model design. The third layer (2) is where Hydra covers the configuration and hyperparameters are swept.
 Figure 1: The structure of the Lightning Pose package. Data loading (including DALI video readers), models, and a loss factory, are wrapped inside a PyTorch Lightning trainer and a Hydra configurator

During training, videos are randomly modified, or augmented, in various ways by DALI. This exposes the network to a wider range of training examples and prepares it better for unexpected systematic variations in the data it may encounter when deployed.

Its semi-supervised architecture, shown in Figure 2, learns from both labeled and unlabeled frames.

Lighting Pose consists of a backbone that consumes a few labeled frames and many unlabeled videos. The results are transferred to the head that predicts keypoints for both labeled and unlabeled frames. When labels are available, a supervised loss is applied. For unlabeled videos, Lightning Pose applies a set of unsupervised losses.
Figure 2. The Lightning pose architecture diagram combining supervised learning (top) with unsupervised learning (bottom)

Lightning Pose results in more accurate and precise tracking compared to standard supervised networks, across different species (mice, fish, and so on) and tasks (full-body locomotion, eye tracking, and so on). The traditional fully supervised approach requires extensive image labeling and struggles to generalize to new videos. It often produces noisy outputs that hinder downstream analyses.

Its new pose estimation networks generalize better to unseen videos and provide smoother and more reliable pose trajectories. The tool also enhances robustness and usability. ‌Through semi-supervised learning, Bayesian ensembling, and cloud-native open-source tools, models have lower pixel errors compared to DeepLabCut (with as few as 75 labeled frames). Lightning Pose estimation improves by 40, lowering pixel error and average keypoint pixel error across frames (DeepLabCut 14.60±4).

The clearest gains were seen in a mouse pupil tracking dataset from the International Brain Lab, where, even with over 3,000 labeled frames, the predictions were more accurate, and led to more reliable scientific analyses. 

Prediction comparison of mouse pupil tracking between DeepLabCut model and Lightning Pose, and Lightning Pose combined with Ensemble Kalman Smoothing
Figure 3. Visualization of a mouse pupil tracking 

Figure 3 shows the tracking top, bottom, left, and right corners of a mouse’s pupil during a neuroscience experiment. On the left, the DeepLabCut model provides a significant number of predictions in implausible parts of the image (red boxes). 

The center shows Lightning Pose predictions and the right, combines Lightning Pose with the authors’ Ensemble Kalman smoothing approach. Both Lightning Pose approaches nicely track the four points and predict them in plausible areas. 

Improved pupil tracking in turn exposes stronger correlations with neural activity. The authors performed a regression between neural activity and tracked pupil diameter across 66 neuroscience experiments, and found that the model outputs were decoded more reliably from brain activity. 

Pupil diameter value comparison. Blue values are those extracted by Lightning Pose tracking (+Ensemble Kalman Smoothing) compared to the prediction of a decoder trained on neural data (ridge regression).
Figure 4. Pupil diameter extracted from the model compared to ‌neural data

Figure 4 shows ‌pupil diameter decoding from brain recordings. The left side of Figure 4 graphs pupil diameter time series derived from a Lightning Pose model (LP+EKS; blue), and the predictions from applying linear regression to neural data (orange). 

The right side of Figure 4 shows R2 goodness-of-fit values quantifying how well pupil diameter can be decoded from neural activity. As shown, Lightning Pose and the ensemble version produce significantly better results DLC R2=0.27±0.02; LP 0.33±0.02; LP+EKS 0.35±0.02.

The following video shows the robustness of the predictions for a mouse running on a treadmill.

Video 1: Example prediction of the mouse leg position (blue: lightning pose, red: supervised baseline model)

Improving the image-centric approach to convolutional architectures with DALI 

Applying convolutional networks to videos presents a unique challenge: these networks typically operate on individual images. Despite the growing computational power of deep learning accelerators, such as new GPU generations, Tensor Cores, and CUDAGraphs, this image-centric approach has remained largely unchanged. Current architectures require videos to be split into individual frames during pre-processing, where they are often saved on a Disc for later loading. These frames are then augmented and transformed on the CPU before being fed to the network waiting on the GPU.

Lightning Pose leverages DALI for GPU-accelerated decoding and processing of videos. This stands in contrast to most computer vision deep learning architectures, such as ResNets and Transformers, that typically operate only on single images. When applied sequentially to videos, these architectures (and the popular neuroscience tools of DeepLabCut and SLEAP that are based on them) often form discontinuous predictions that violate the laws of physics. For example, an object jumping from one corner of a room to another, in two consecutive video frames.  

DALI Stack showing how it takes the data from the storage (image, video, or AU), uses GPU acceleration to decode and transform, and makes it ready to be used further in the training. Or for the inference process by the deep learning framework.
Figure 5: DALI functional flow

DALI offers an efficient solution for Lightning Pose, by:

  1. reading the videos. 
  2. handling the decoding process (thanks to the NVIDIA Video Codec SDK).
  3. applying various augmentations (rotation, resize, brightness, and contrast adjustment, or even adding shot noise). 

Using DALI, Lightning Pose increases training throughput for video data and maintains the desired performance of the whole solution by fully using GPUs.

DALI can also be combined with additional data loaders working in parallel. The International Brain Laboratory, a consortium of 16 different neuroscience labs, is currently integrating DALI loaders to predict poses in 30,000 neuroscience experiments.

The benefit of open-source cooperation

The research is a great example of value created by the cooperation of the open-source community. DALI and Lightning Pose, both open-source projects, are highly responsive to community feedback and inquiries on GitHub. The collaboration between these projects began in mid-2021 when Dan Biderman, a community member, started evaluating DALI technology. Dan’s proactive engagement and the DALI team’s swift responses fostered a productive dialogue, which led to its integration into Lightning Pose.

Download and try DALI and Lightning Pose and DALI; you can reach out to contacts for both directly through their GitHub pages.

Read the study, Improved animal estimation through semi-supervised learning, Bayesian ensembling, and cloud-native open-source tools.

Categories
Misc

Reborn, Remastered and Remixed: ‘Portal: Prelude RTX’ Rejuvenates Legendary Gaming Mod

The “Portal: Prelude RTX” gaming mod — a remastering of the popular unofficial “Portal” prequel — comes with full ray tracing, DLSS 3 and RTX IO technology for cutting-edge, AI-powered graphics that rejuvenate the legendary mod for gamers, creators, developers and others to experience it anew.

Categories
Misc

New Video: Visualizing Census Data with RAPIDS cuDF and Plotly Dash

A US map showing different colors representing data visualization.Gathering business insights can be a pain, especially when you’re dealing with countless data points.  It’s no secret that GPUs can be a time-saver for…A US map showing different colors representing data visualization.

Gathering business insights can be a pain, especially when you’re dealing with countless data points. 

It’s no secret that GPUs can be a time-saver for data scientists. Rather than wait for a single query to run, GPUs help speed up the process and get you the insights you need quickly.

In this video, Allan Enemark, RAPIDS data visualization lead, uses a US Census dataset with over 300 million data points to demo running queries uninterrupted during the analysis process when using RAPIDS cuDF and Plotly Dash.

Key takeaways

  • Using cuDF over pandas for millions of data points results in significant performance benefits, with each query taking less than 1 second to run.
  • There are several advantages to using integrated accelerated visualization frameworks, such as faster analysis iterations.
  • Replacing CPU-based libraries with the pandas-like RAPIDS GPU-accelerated libraries (such as cuDF) helps data scientists swiftly go through the EDA process, as data sizes increase between 2 and 10 GB
  • Visualization compute and render times are brought down to interactive sub-second speeds, unblocking the insight discovery process.

Video 1. Visualizing Census Data with RAPIDS cuDF and Plotly Dash

Summary

Swapping pandas with a RAPIDS framework like cuDF can help speed up data analytics workflows, making the analysis process more effective and enjoyable.  Additionally, the RAPIDS libraries make it easy to chart all kinds of data–like time series, geospatial, and graphs–by using simple Python code.

To learn more about speeding up your traditional GPU data science workflows, visit these resources: 

Data science promo box.
Categories
Misc

GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks

Stylized image of a computer chip.We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on…Stylized image of a computer chip.

We were stuck. Really stuck. With a hard delivery deadline looming, our team needed to figure out how to process a complex extract-transform-load (ETL) job on trillions of point-of-sale transaction records in a few hours. The results of this job would feed a series of downstream machine learning (ML) models that would make critical retail assortment allocation decisions for a global retailer. Those models needed to be tested and validated on real transactional data.

However, up to that point, not a single ETL job ran to completion. Each test run took several days of processing time and all had to be terminated before completion.

Using NVIDIA RAPIDS Accelerator for Apache Spark, we observed significantly faster run times with additional cost savings when compared to a conventional approach using Spark on CPUs. Let us back up a bit.

Getting unstuck: ETL for a global retailer

The Artificial Intelligence & Analytics practice at Capgemini is a data science team that provides bespoke, platform–, and language-agnostic solutions that span the data science continuum, from data engineering to data science to ML engineering and MLOps. We are a team with deep technical experience and knowledge, having 100+ North America-based data science consultants, and a global team of 1,600+ data scientists.

For this project, we were tasked with providing an end-to-end solution for an international retailer with the following deliverables:

  • Creating the foundational ETL
  • Building a series of ML models
  • Creating an optimization engine
  • Designing a web-based user interface to visualize and interpret all data science and data engineering work

This work ultimately provided an optimal retail assortment allocation solution for each retail store. What made the project more complex was the state-space explosion that occurs after we begin to incorporate halo effects, such as interaction effects across departments. For example, if we allocated shelf space to fruit, what effect does that have on KPIs associated with allocating further shelf space to vegetables, and how can we jointly optimize those interaction effects?

But none of that ML, optimization, or front end would matter without the foundational ETL. So here we were, stuck. We were operating in an Azure cloud environment, using Databricks and Spark SQL, and even then, we were not observing the results we needed in the timeframe required by the downstream models.

Spurred by a sense of urgency, we explored potential variations that might enable us to significantly speed up our ETL process.

Accelerating ETL

Was the code inefficiently written? Did it maximize compute speed? Did it have to be refactored?

We rewrote code several times, and tested various cluster configurations, only to observe marginal gains. However, we had limited options to scale up owing to cost limitations, none of which provided the horsepower we needed to make significant gains. Remember when cramming for final exams, and time was just a little too tight, that pit in your stomach getting deeper by the minute? We were quickly running out of options and time. We needed help. Now.

With the Databricks Runtime 9.1 LTS, Databricks released a native vectorized query engine named Photon. Photon is a C++ runtime environment that can run faster and be more configurable than its traditional Java runtime environment. Databricks support assisted us for several weeks in configuring a Photon runtime for our ETL application.

We also reached out to our partners at NVIDIA, who recently updated the RAPIDS suite of accelerated software libraries. Built on CUDA-X AI, RAPIDS executes data science and analytics pipelines entirely on GPUs with APIs that look and feel like the most popular open-source libraries. They include a plug-in that integrates with Spark’s query planner to speed up Spark jobs.

With support from both Databricks and NVIDIA over the course of the following month, we developed both solutions in parallel, getting previously untenable run times down to sub-two hours, an amazing jump in speed!

This was the target speed that we needed to hit for the downstream ML and optimization models. The pressure was off, and—owing solely to having solved the ETL problem with Photon a few days earlier than we did with RAPIDS—the Databricks Photon solution was put into production.

Having emerged from the haze of anxiety surrounding the tight deadlines around the ETL processes, we collected our thoughts and results and conducted a posthoc analysis. Which solution was the fastest to implement? Which solution provided the fastest ETL? The cheapest ETL? Which solution would we implement for similar future projects?

Experimental results

To evaluate our hypotheses, we created a set of experiments. We ran these experiments on Azure using two approaches:

  1. Databricks Photon would be run on third-generation Intel Xeon Platinum 8370C (Ice Lake) CPUs in a hyper-threaded configuration. This is what was ultimately put into production for the client.
  2. RAPIDS Accelerator for Apache Spark would be run on NVIDIA GPUs.

We would run the same ETL jobs on both, using two different data sets. The data sets were five and 10 columns of mixed numeric and unstructured (text) data, each with 20 million rows that measured 156 and 565 terabytes, respectively. The number of workers was maximized as permitted by infrastructure spending limits. Each individual experiment was run three times.

The experimental parameters are summarized in Table 1.

Worker type Driver type Number of workers Platform Number of columns Data size
Standard_NC6s_v3 Standard_NC6s_v3 12 RAPIDS 10 565
Standard_E20s_v5 Standard_E16s_v5 6 PHOTON 10 565
Standard_NC6s_v3 Standard_NC6s_v3 16 RAPIDS 10 565
Standard_NC6s_v3 Standard_NC6s_v3 14 RAPIDS 10 565
Standard_NC6s_v3 Standard_NC6s_v3 14 RAPIDS 5 157
Standard_E20s_v5 Standard_E16s_v5 6 PHOTON 5 148
Table 1. ETL experimentation parameters

We examined the pure speed of runtimes. The experimental results demonstrated that run times across all different combinations of worker types, driver types, workers, data set size, platform, columns of data, and data set size were remarkably consistent and statistically and practically indifferentiable at an average of 4 min 37 sec per run, with min and max run times at 4 min 28 sec and 4 min 54 sec, respectively.

We had a DBU/hour infrastructure spending limit and, as a result, a limit on the varying workers per cluster tested. In response, we developed a composite metric that enabled the most balanced evaluation of results, which we named adjusted DBUs per minute (ADBUs). DBUs are Databricks units, a proprietary Databricks unit of computational cost. ADBUs are computed as follows:

text{emph{Adjusted DBUs per Minute}} = frac{text{emph{Runtime (mins)}}}{text{emph{Cluster DBUs Cost per Hour}}}

In the aggregate, we observed a 6% decrease in ADBUs by using RAPIDS Accelerator for Apache Spark when compared to running Spark on the Photon runtime, when accounting for the cloud platform cost. This meant we could achieve similar run times using RAPIDS at a lower cost.

Considerations

Other considerations include the ease of implementation and the need for rewriting code, both of which were similar for RAPIDS and Photon. A first-time implementation of either is not for the faint of heart.

Having done it one time, we are quite certain we can replicate the required cluster configuration tasks in a matter of hours for each. Moreover, neither RAPIDS nor Photon required us to refactor the Spark SQL code, which was a huge time savings.

The limitations of this experiment were the small number of replications, the limited number of worker and driver types, and the number of worker combinations, all owing to infrastructure cost limitations.

What’s next?

In the end, combining Databricks with RAPIDS Accelerator for Apache Spark helped expand the breadth of our data engineering toolkit, and demonstrated a new and viable paradigm for ETL processing on GPUs.

For more information, see RAPIDS Accelerator for Apache Spark.

Categories
Offsites

Symbol tuning improves in-context learning in language models

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.

In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.

  • Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
  • Symbol-tuned models are much stronger at algorithmic reasoning tasks.
  • Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.
An overview of symbol tuning, where models are fine-tuned on tasks where natural language labels are replaced with arbitrary symbols. Symbol tuning relies on the intuition that when instruction and relevant labels are not available, models must use in-context examples to learn the task.

Motivation

Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.

In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.

For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.

In-context learning

In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)

Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context examples. When these features are not available, models must reason with the given in-context examples to successfully perform the task.

Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).

Large-enough symbol-tuned models are better at in-context learning than baselines, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks.

Algorithmic reasoning

We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).

On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.

We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.

We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.

Future work

Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.

Acknowledgements

The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.