Categories
Offsites

Locked-image Tuning: Adding Language Understanding to Image Models

The ability to classify images into categories has been transformed by deep learning. It has also been significantly accelerated by transfer learning, whereby models are first pre-trained on large datasets, like ImageNet, to learn visual representations that are then transferred via fine-tuning to a new task with less data (e.g., classifying animals). Previous works such as BiT and ViT employed these methods to achieve state-of-the-art performance on a wide range of classification tasks, such as the VTAB benchmark.

However, fine-tuning has some downsides: though pre-training is done only once, fine-tuning is necessary on every new dataset for which task-specific data is needed. Multimodal contrastive learning is an alternative, recently popularized paradigm (e.g., CLIP, ALIGN) that overcomes these issues by instead learning how to match free-form text with images. These models can then solve new tasks by reformulating them as image-text matching problems, without extra data (referred to as “zero-shot” learning). Contrastive learning is flexible and easy to adapt to new tasks, but has its own limitations, namely the need for a lot of paired image-text data and weaker performance than transfer learning approaches.

With those limitations in mind, we propose “LiT: Zero-Shot Transfer with Locked-image Text Tuning”, to appear at CVPR 2022. LiT models learn to match text to an already pre-trained image encoder. This simple yet effective setup provides the best of both worlds: strong image representations from pre-training, plus flexible zero-shot transfer to new tasks via contrastive learning. LiT achieves state-of-the-art zero-shot classification accuracy, significantly closing the gap between the two styles of learning. We think the best way to understand is to try it yourself, so we’ve included a demo of LiT models at the end of this post.

Fine-tuning (left) requires task-specific data and training to adapt a pre-trained model to a new task. An LiT model (right) can be used with any task, without further data or adaptation.

Contrastive Learning on Image-Text Data
Contrastive learning models learn representations from “positive” and “negative” examples, such that representations for “positive” examples are similar to each other but different from “negative” examples.

Multimodal contrastive learning applies this to pairs of images and associated texts. An image encoder computes representations from images, and a text encoder does the same for texts. Each image representation is encouraged to be close to the representation of its associated text (“positive”), but distinct from the representation of other texts (“negatives”) in the data, and vice versa. This has typically been done with randomly initialized models (“from scratch”), meaning the encoders have to simultaneously learn representations and how to match them.

Multimodal contrastive learning trains models to produce similar representations for closely matched images and texts.

This training can be done on noisy, loosely aligned pairs of image and text, which naturally occur on the web. This circumvents the need for manual labeling, and makes data scaling easy. Furthermore, the model learns much richer visual concepts — it’s not constrained to what’s defined in the classification label space. Instead of classifying an image as “coffee”, it can understand whether it’s “a small espresso in a white mug” or “a large latte in a red flask”.

Once trained, a model that aligns image and text can be used in many ways. For zero-shot classification, we compare image representations to text representations of the class names. For example, a “wombat vs jaguar” classifier can be built by computing the representations of the texts “jaguar” and “wombat”, and classifying an image as a jaguar if its representation better matches the former. This approach scales to thousands of classes and makes it very easy to solve classification tasks without the extra data necessary for fine-tuning. Another application of contrastive models is image search (a.k.a. image-text retrieval), by finding the image whose representation best matches that of a given text, or vice versa.

The Best of Both Worlds with Locked-image Tuning
As mentioned earlier, transfer learning achieves state-of-the-art accuracy, but requires per-task labels, datasets, and training. On the other hand, contrastive models are flexible, scalable, and easily adaptable to new tasks, but fall short in performance. To compare, at the time of writing, the state of the art on ImageNet classification using transfer learning is 90.94%, but the best contrastive zero-shot models achieve 76.4%.

LiT tuning bridges this gap: we contrastively train a text model to compute representations well aligned with the powerful ones available from a pre-trained image encoder. Importantly, for this to work well, the image encoder should be “locked“, that is: it should not be updated during training. This may be unintuitive since one usually expects the additional information from further training to increase performance, but we find that locking the image encoder consistently leads to better results.

LiT-tuning contrastively trains a text encoder to match a pre-trained image encoder. The text encoder learns to compute representations that align to those from the image encoder.

This can be considered an alternative to the classic fine-tuning stage, where the image encoder is separately adapted to every new classification task; instead we have one stage of LiT-tuning, after which the model can classify any data. LiT-tuned models achieve 84.5% zero-shot accuracy on ImageNet classification, showing significant improvements over previous methods that train models from scratch, and halving the performance gap between fine-tuning and contrastive learning.

Left: LiT-tuning significantly closes the gap between the best contrastive models and the best models fine-tuned with labels. Right: Using a pre-trained image encoder is always helpful, but locking it is surprisingly a key part of the recipe to success; unlocked image models (dashed) yield significantly worse performance.

An impressive benefit of contrastive models is increased robustness — they retain high accuracy on datasets that typically fool fine-tuned models, such as ObjectNet and ImageNet-C. Similarly, LiT-tuned models have high performance across various challenging versions of ImageNet, for example achieving a state-of-the-art 81.1% accuracy on ObjectNet.

LiT-tuning has other advantages. While prior contrastive works require large amounts of data and train for a very long time, the LiT approach is much less data hungry. LiT models trained on 24M publicly available image-text pairs rival the zero-shot classification performance of prior models trained on 400M image-text pairs of private data. The locked image encoder also leads to faster training with a smaller memory footprint. On larger datasets, image representations can be pre-computed; not running the image model during training further improves efficiency and also unlocks much larger batch sizes, which increases the number of “negatives” the model sees and is key to high-performance contrastive learning. The method works well with varied forms of image pre-training (e.g., including self-supervised learning), and with many publicly available image models. We hope that these benefits make LiT a great testbed for researchers.

Conclusion
We present Locked-image Tuning (LiT), which contrastively trains a text encoder to match image representations from a powerful pre-trained image encoder. This simple method is data and compute efficient, and substantially improves zero-shot classification performance compared to existing contrastive learning approaches.

Want to try it yourself?

A preview of the demo: use it to match free-form text descriptions to images and build your own zero-shot classifier!

We have prepared a small interactive demo to try some LiT-tuned models. We also provide a Colab with more advanced use cases and larger models, which are a great way to get started.

Acknowledgments
We would like to thank Xiaohua Zhai, Xiao Wang, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer who have co-authored the LiT paper and been involved in all aspects of its development, as well as the Brain team in Zürich. We also would like to thank Tom Small for creating the animations used in this blogpost.

Categories
Misc

Metropolis Spotlight: Bluecity Combines Vision AI and Lidar for Real-Time Road Safety and Traffic Congestion

Cities now have access to real-time, multimodal traffic data that improves road safety and reduces traffic congestion.

Bluecity, an NVIDIA Metropolis partner, recently launched a new traffic management solution for safer roads and shorter commutes. The technology combines vision AI and lidar technology to better understand round-the-clock traffic data, providing information that could help city planning departments identify problem intersections, reduce congestion, plan smarter, and lower emissions. 

Road safety and congestion are a priority for city and transportation planners, but sparse data has limited their ability to address traffic issues, especially in areas with ever-expanding populations. While video cameras are used to capture information, like the number of cars at a particular intersection, poor lighting and bad weather conditions can interfere with capturing this data accurately. Studies have also shown that accidents are most likely to happen during those times when visibility is low.

New technologies that can overcome these obstacles and collect multimodal data about drivers, vehicle speed, and trajectories can help make the roads safer. Especially when it can be done in real time. 

Bluecity is solving this problem by combining vision AI and lidar technology to understand and evaluate traffic data. 

Bluecity developed IndiGO, its computer vision and traffic data platform, to provide real-time, traffic data and analytics
Figure 1. Bluecity analytics platform

Similar to radar, lidar sensors emit pulsed light waves into the environment and sense objects when the pulses bounce off them. Lidar uses lasers with a lower wavelength than radar, and as a result can detect smaller objects, offering precise measurement data. Lidar can do this even in poor lighting or weather conditions and it captures data anonymously. Each lidar sensor can provide 360-degree coverage and a radius of up to 120 meters (400 feet).

The Bluecity system employs the powerful capabilities of the NVIDIA Jetson edge AI platform—which provides GPU-accelerated computing in a compact and energy-efficient module—along with NVIDIA TensorRT to accelerate its application’s inference throughput at the edge. The edge computing system runs a proprietary 3D perception software and powers the traffic management solution to process up to 50 lidar frames per second in real time to detect all road users.

Their platform provides information on which turning directions and intersections are the riskiest, near misses, time-to-collision, and the speed of the vehicles involved.  It can also classify road users and gives important insight into not only driver behavior, but also that of cyclists and pedestrians. 

The ability to collect data regardless of lighting or weather conditions helps city planners make decisions about things like road design and traffic-light timing that are based on actual data. The startup’s AI component turns raw data into valuable information that guides practical and timely decision-making. AI powers better visualization, while conflict analyses help identify dangerous intersections before accidents occur. For instance, in Repentigny, Quebec, Bluecity installed their solution to provide multimodal traffic data so an engineering firm could better understand mobility in a region where they are updating a bridge. 

Subscribers can view, select, filter, and download the data in order to improve their traffic planning with Bluecity’s easy-to-understand dashboards.

Bluecity solutions are deployed in several Canadian, U.S. and European cities, including Irvine, Austin, Texas, Boca Raton, Trois-Rivières in Canada, and Helsinki, Finland. The startup’s vision is to provide better multimodal data to make intersections in our cities safer, improve road safety, and lead to reduced carbon emissions for smart cities.

Categories
Misc

GFN Thursday Gears Up With More Electronic Arts Games on GeForce NOW

This GFN Thursday delivers more gr-EA-t games as two new titles from Electronic Arts join the GeForce NOW library. Gamers can now enjoy Need for Speed HEAT  and Plants vs. Zombies Garden Warfare 2 streaming from GeForce NOW to underpowered PCs, Macs, Chromebooks, SHIELD TV and mobile devices. It’s all part of the eight  total Read article >

The post GFN Thursday Gears Up With More Electronic Arts Games on GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

Creating a Two-Hand Pose Classifier

I am trying to create a hand pose classifier on Tensorflow.js using JS. I am familiar with the libraries that make hand pose classification possible, but they all support only one hand. I want to train a neural network using landmarks from both hands (or classify each hand seperately) and ideally save and use that dataset on a web-based project. Can anyone point me in the right direction/tutorials to implement this idea?

submitted by /u/SumusSolis
[visit reddit] [comments]

Categories
Misc

TensorFlow Object Detection API

TensorFlow Object Detection API

I am running Faster R_CNN on a custom dataset of 250 images for the object detection task. I downloaded the tfrecords from roboflow and I started training a Faster R-CNN with Inception ResNet v2 640×640. However after 200 iterations, the loss on some tasks becomes 0:

https://preview.redd.it/5radkzb1fgt81.png?width=889&format=png&auto=webp&s=64a86d2903fab990d942517a31fbd51ff31ea4c1

What should be the cause of this problem?

submitted by /u/giakou4
[visit reddit] [comments]

Categories
Misc

Fast Fine-Tuning of AI Transformers Using RAPIDS Machine Learning

Find out how RAPIDS and the cuML support vector machine can achieve faster training time and maximum accuracy when fine-tuning transformers.

In recent years, transformers have emerged as a powerful deep neural network architecture that has been proven to beat the state of the art in many application domains, such as natural language processing (NLP) and computer vision.

This post uncovers how you can achieve maximum accuracy with the fastest training time possible when fine-tuning transformers. We demonstrate how the cuML support vector machine (SVM) algorithm, from the RAPIDS Machine Learning library, can dramatically accelerate this process. CuML SVM on GPU is 500x faster than the CPU-based implementation. This approach uses SVM heads instead of the conventional multi-layer perceptron (MLP) head, making it possible to fine-tune with precision and ease.

What is fine-tuning and why do you need it?

A transformer is a deep learning model consisting of many multi-head, self-attention, and feedforward fully connected layers. It is mainly used for sequence-to-sequence tasks, including NLP tasks, such as machine translation and question-answering, and computer vision tasks, such as object detection and more.

Training a transformer from scratch is a compute-intensive process, often taking days or even weeks. In practice, fine-tuning is the most efficient way of applying pretrained transformers to new tasks, thereby reducing training time.

MLP head for fine-tuning transformers

As shown in Figure 1, transformers have two distinct components:

  • The backbone, which contains multiple blocks of self-attention and feedforward layers.
  • The head, where final predictions take place for either classification or regression tasks.

During fine-tuning, the backbone network of the transformer is frozen while only the lightweight head module is trained for the new task. The most common choice for the head module is a multi-layer perceptron (MLP) for both classification and regression tasks.

During fine-tuning of transformers, the pretrained backbone is frozen and only the head module is trained for the new task. In this post, we show that NVIDIA cuML SVM is both faster and more accurate than MLP as the head module.
Figure 1. Using cuML SVM as the head speeds up the fine-tuning of transformers

As it turns out, implementing and tuning a MLP can be much harder than it looks. Why is that?

  • There are multiple hyperparameters to tune: number of layers, dropout, learning rate, regularization, types of optimizers, and more. Choosing which hyperparameter to tune is dependent on the problem that you are trying to solve. For example, standard techniques such as dropout and batchnorm could lead to performance degradation for regression problems.
  • Additional efforts must be made to prevent overfitting. The transformer’s output is often a long embedding vector, with a length ranging from hundreds to thousands. Overfitting is common when the training data size is not large enough.
  • Performance in terms of execution time is typically not optimized. Users must write boilerplate code for data processing and training. Batch generation and data movement from CPU to GPU can also become a bottleneck for performance.

Advantages of SVM heads for fine-tuning transformers

Support vector machines (SVMs) are one of the most popular supervised learning methods and most potent when there are meaningful, predictive features available. This is especially true with high-dimensional data due to SVM’s robustness against overfitting.

Yet, data scientists are sometimes hesitant to try SVMs for several reasons: 

  • It requires handcraft feature engineering that can be difficult to implement.
  • SVMs are traditionally slow.

RAPIDS cuML revives interest in revisiting this classic model by providing a speedup of up to 500x on GPU. With RAPIDS cuML, SVM is gaining popularity again in the data science community.

For example, RAPIDS cuML SVM notebooks have been frequently used in several Kaggle competitions: 

As transformers have already learned to extract meaningful representations in the form of long embedding vectors, cuML SVM is an ideal candidate for the head classifier or regressor.

When compared to the MLP head, cuML SVM has the following advantages:

  • Easy to tune. In practice, we have found in most instances that tuning just one parameter, C, is enough for SVM.
  • Speed. cuML moves all data to the GPU at once, before processing on the GPU.
  • Diversity. The predictions of SVM are statistically different from the MLP predictions, rendering it useful in ensembles.
  • Simple API. cuML SVM API provides scikit-learn style fit and predict functions.

Case study: PetFinder.my Pawpularity Contest

This proposed fine-tuning methodology with SVM heads applies to both NLP and computer vision tasks. To demonstrate this, we looked at the PetFinder.my Pawpularity Contest, a Kaggle data science competition that predicted the popularity of shelter pets based on their photos.

The dataset used for this project consists of 10,000 hand-labeled images, each with a target pawpularity that we aimed to predict. With pawpularity values ranging from 0 to 100, we used regression to solve this problem.

As there are only 10,000 labeled images, it is impractical to train a deep neural network to achieve high accuracy from scratch. Instead, we approached this by using a pretrained swin transformer backbone and then fine-tuning it with the labeled pet images.

Three steps to fine-tune transformers with RAPIDS cuML SVM. Step 1: train both the backbone and the mlp head with BCE loss. Step 2: Freeze the backbone and train the RAPIDS SVM head. Step 3: infer with both heads and average their predictions to achieve the best accuracy.
Figure 2. How to use cuML SVM head in fine-tuning.

As shown in Figure 2, our approach requires three steps:

  1. First, a regression head using MLP is added to the backbone swin transformer, and the backbone and head are fine tuned. One interesting finding is that the binary cross entropy loss outperforms the common mean square error loss (MSE) due to the distribution of the target
  2. Next, the backbone is frozen, and the MLP head is replaced with the cuML SVM head. The SVM head is then trained with the regular MSE loss.
  3. To achieve the best prediction accuracy, we averaged the MLP head and SVM head. The evaluation metric root means that square error is optimized going from 18 to 17.8, which is significant for this dataset.

It is worth noting that steps 1 and 3 are optional and have been implemented here to optimize the model’s score for this competition. Step 2 alone is the most common scenario for fine-tuning. For this reason, we measured the run time at step 2 and compared three options: cuML SVM (GPU), sklearn SVM (CPU), and PyTorchMLP (GPU). The results are shown in Figure 3.

Runtime comparison between cuML SVM on GPU, sklearn SVM on CPU and MLP on GPU. cuML SVM is up to 15x faster and 28x faster than its CPU counterpart for fit and inference, respectively.
Figure 3. Runtime comparison

The runtime is normalized by sklearn SVM and cuML SVM achieved 15x speedup for training and 28.18x speedup for inference. It is noteworthy that cuML SVM is faster than PyTorch MLP due to high GPU utilization. The notebook can be found on Kaggle.

Key takeaways on transformer fine-tuning

Transformers are revolutionary deep learning models, but training them is time-consuming. Fast fine-tuning of transformers on a GPU can benefit many applications by providing significant speedup. RAPIDS cuML SVM can also be used as a drop-in replacement of the classic MLP head, as it is both faster and more accurate. 

GPU acceleration infuses new energy into classic ML models like SVM. With RAPIDS, it is possible to combine the best of the two worlds: classic machine learning (ML) models and cutting-edge deep learning (DL) models. In RAPIDS cuML, you will find more lightning-fast and easy-to-use models.

Postscript

At the time of writing and editing this post, the PetFinder.my Pawpularity Contest concluded. NVIDIA KGMON Gilberto Titericz won first place by using RAPIDS SVM. His winning solution was to concentrate embeddings from transformers and other deep CNNs, and use RAPIDS SVM as the regression head. For more information, see his winning solution write-up.

Categories
Offsites

Simple and Effective Zero-Shot Task-Oriented Dialogue

Modern conversational agents need to integrate with an ever-increasing number of services to perform a wide variety of tasks, from booking flights and finding restaurants, to playing music and telling jokes. Adding this functionality can be difficult — for each new task, one needs to collect new data and retrain the models that power the conversational agent. This is because most task-oriented dialogue (TOD) models are trained on a single task-specific ontology. An ontology is generally represented as a list of possible user intents (e.g., if the user wants to book a flight, if the user wants to play some music, etc.) and possible parameter slots to extract from the conversation (e.g., the date of the flight, the name of a song, and so on). A rigid ontology can be limiting, preventing the model from generalizing to new tasks or domains. For instance, a TOD model trained on a certain ontology only knows the intents in that ontology, and lacks the ability to generalize its knowledge to unseen intents. This is true even for new ontologies that overlap with ones already known to the agent — for example, if an agent already knows how to book train tickets, adding the ability to book airline tickets would require training on completely new data. Ideally, the agent should be able to leverage its existing knowledge from one ontology, and apply it to new ones.

New benchmarks, such as the the Schema Guided Dialogue (SGD) dataset, have been designed to evaluate the ability to generalize to unseen tasks, by distilling each ontology into a schema of slots and intents. In the SGD setting, TOD models are trained on multiple schemas, and evaluated on how well they generalize to unseen ones — instead of how well they overfit to a single ontology. However, recent work shows the top models still have room for improvement.

To address this problem, we introduce two different sequence-to-sequence approaches toward zero-shot transfer for dialogue modeling, presented in the papers “Description-Driven Task-Oriented Dialogue” and “Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue”. Both models condition on additional contextual information, either slot and intent descriptions, or single demonstrative examples. Results obtained on multiple dialogue state tracking benchmarks show that by doing away with the fixed schemas and ontologies, these new approaches lead to state-of-the-art results on the dialogue state tracking task with more efficient models. The source code for the described approaches can be found here.

Background: Dialogue State Tracking
To address the challenge of zero-shot transfer for dialogue models, we focus on the problem of Dialogue State Tracking (DST). DST is a fundamental problem for conversational agents, in which a model predicts the belief state of a conversation, i.e., the agent’s understanding of the user’s indicated preferences. The belief state is typically modeled as an assignment of values to slots for which the user has indicated a preference in the conversation. An example is shown below.

An example conversation and its ground truth slots and intents for dialogue state tracking. Here, the active user intent is “Book a train”, and pertinent information for booking this train is recorded in the slot values.

Description-Driven Task-Oriented Dialogue
In our first paper, we introduce Description-Driven Dialogue State Tracking (D3ST), a DST model that leverages slot and intent descriptions when making predictions about the belief state. D3ST is built on top of the T5 sequence-to-sequence language model, which was shown in previous work to be pretrained effectively for DST problems.

D3ST prompts the input sequence with slot and intent descriptions, allowing the T5 model to attend to both this contextual information and the conversation. Its ability to generalize comes from the formulation of these descriptions. Instead of using a name for each slot, we assign a random index for every slot. For categorical slots (i.e., slots that only take values from a small, predefined set), possible values are also arbitrarily enumerated and then listed. The same is done with intents, and together these descriptions form the schema representation to be included in the input string. This is concatenated with the conversation text and fed into the T5 model. The target output is the belief state and user intent, again identified by their assigned indices. An example is shown below.

An example of the D3ST input and output format. The red text contains slot descriptions, while the blue text contains intent descriptions. The yellow text contains the conversation utterances.

This forces the model to predict conversation contexts using a slot’s index, and not that specific slot. By randomizing the index we assign to each slot between different examples, we prevent the model from learning specific schema information. The slot with index 0 could be the “Train Departure” slot in one example, and the “Train Destination” in another — as such, the model is encouraged to use the slot description given in index 0 to find the correct value, and discouraged from overfitting to a specific schema. With this setup, a model that sees enough different tasks or domains will learn to generalize the action of belief state tracking and intent prediction.

Show Don’t Tell
In our subsequent paper, “Show, Don’t Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue”, we employ a single annotated dialogue example that demonstrates the possible slots and values in a conversation, instead of relying on slot descriptions. In this sense, we “show” the semantics of the schema rather than “tell” the model through descriptions — hence the name “Show Don’t Tell” (SDT). SDT is also built on T5, and improves zero-shot performance beyond D3ST.

n example of the SDT input and output format. The text in red contains the demonstrative example, while the text in blue contains its ground truth belief state. The actual conversation for the model to predict is in yellow. While the D3ST prompt relies entirely on slot descriptions, the SDT prompt contains a concise example dialogue followed by the expected dialogue state annotations, resulting in more direct supervision.

The rationale for SDT’s single example demonstration is simple: there can still be ambiguities that are not fully captured in a slot or intent description, and require a concrete example to demonstrate. Moreover, from a developer’s standpoint, creating short dialogue examples to describe a schema can often be easier than writing descriptions that fully capture the meaning behind each slot and intent.

Benchmark Results
We evaluate both D3ST and SDT on a number of benchmarks, most notably the SGD dataset, which tests zero-shot generalization to unseen schemas in its test set. We evaluate our state tracking models on joint goal accuracy (JGA), the fraction of dialogue turns for which the model predicts an exactly correct belief state.

Both of our models either match or outperform existing state-of-the-art baselines (T5DST and paDST) at comparable model sizes, as shown below. In general, SDT performs slightly better than D3ST. Note that our models can be trained on different sizes of the underlying T5 language model. In addition, while the baseline models can only make predictions for one slot per forward pass, both our models can decode the entire dialogue state in a single forward pass — a much more efficient method in both training and inference.

Joint Goal Accuracy on the SGD dataset plotted against model size for existing baselines and our proposed models D3ST and SDT. Note that paDST* includes additional data augmentation.

Additional metrics are reported in both papers. D3ST exhibits state-of-the-art quality on the MultiWOZ dataset, with 75.9% JGA on MultiWOZ 2.4. Both D3ST and SDT show state-of-the-art performance in the MultiWOZ cross-domain leave-one-out setting. In addition, both D3ST and SDT were evaluated using the SGD-X dataset, and demonstrated strong robustness to linguistic variations in schema. These benchmarks all indicate that D3ST and SDT are state-of-the-art TOD models, with the ability to generalize to unseen tasks and domains.

Zero-Shot Capability
D3ST and SDT sometimes demonstrate a surprising ability to generalize to unseen tasks, and we saw many interesting examples when trying completely new dialogues with the model. We’ve included one such example below:

A D3ST model trained on the SGD dataset makes predictions (right) for an unseen meta conversation (left) about creating this blog post. The model predicts a completely correct belief state, even though it is not fine-tuned on anything related to blogs, authors or NLP.

Future Work
These papers demonstrate the feasibility of a zero-shot TOD system that can generalize to unseen tasks or domains. However, we’ve limited ourselves to the DST problem for now — we plan to extend this research to enable zero-shot dialogue policy modeling, allowing TOD systems to take actions following arbitrary instructions. In addition, the current input format can often lead to long input sequences, which can be slow for inference — we’re exploring new and more efficient methods to encode schema information.

Acknowledgements
This post reflects the combined work of Jeffrey Zhao, Raghav Gupta, Harrison Lee, Mingqiu Wang, Dian Yu, Yuan Cao, and Abhinav Rastogi. We’d like to thank Yonghui Wu and Izhak Shafran for their continued advice and guidance.

Categories
Misc

MLCommons’ David Kanter, NVIDIA’s David Galvez on Improving AI with Publicly Accessible Datasets

In deep learning and machine learning, having a large enough dataset is key to training a system and getting it to produce results. So what does a ML researcher do when there just isn’t enough publicly accessible data? Enter the MLCommons Association, a global engineering consortium with the aim of making ML better for everyone. Read article >

The post MLCommons’ David Kanter, NVIDIA’s David Galvez on Improving AI with Publicly Accessible Datasets appeared first on NVIDIA Blog.

Categories
Offsites

Lidar-Camera Deep Fusion for Multi-Modal 3D Detection

LiDAR and visual cameras are two types of complementary sensors used for 3D object detection in autonomous vehicles and robots. LiDAR, which is a remote sensing technique that uses light in the form of a pulsed laser to measure ranges, provides low-resolution shape and depth information, while cameras provide high-resolution shape and texture information. While the features captured by LiDAR and cameras should be merged together to provide optimal 3D object detection, it turns out that most state-of-the-art 3D object detectors use LiDAR as the only input. The main reason is that to develop robust 3D object detection models, most methods need to augment and transform the data from both modalities, making the accurate alignment of the features challenging.

Existing algorithms for fusing LiDAR and camera outputs, such as PointPainting, PointAugmenting, EPNet, 4D-Net and ContinuousFusion, generally follow two approaches — input-level fusion where the features are fused at an early stage, decorating points in the LiDAR point cloud with the corresponding camera features, or mid-level fusion where features are extracted from both sensors and then combined. Despite realizing the importance of effective alignment, these methods struggle to efficiently process the common scenario where features are enhanced and aggregated before fusion. This indicates that effectively fusing the signals from both sensors might not be straightforward and remains challenging.

In our CVPR 2022 paper, “DeepFusion: LiDAR-Camera Deep Fusion for Multi-Modal 3D Object Detection”, we introduce a fully end-to-end multi-modal 3D detection framework called DeepFusion that applies a simple yet effective deep-level feature fusion strategy to unify the signals from the two sensing modalities. Unlike conventional approaches that decorate raw LiDAR point clouds with manually selected camera features, our method fuses the deep camera and deep LiDAR features in an end-to-end framework. We begin by describing two novel techniques, InverseAug and LearnableAlign, that improve the quality of feature alignment and are applied to the development of DeepFusion. We then demonstrate state-of-the-art performance by DeepFusion on the Waymo Open Dataset, one of the largest datasets for automotive 3D object detection.

InverseAug: Accurate Alignment under Geometric Augmentation
To achieve good performance on existing 3D object detection benchmarks for autonomous cars, most methods require strong data augmentation during training to avoid overfitting. However, the necessity of data augmentation poses a non-trivial challenge in the DeepFusion pipeline. Specifically, the data from the two modalities use different augmentation strategies, e.g., rotating along the z-axis for 3D point clouds combined with random flipping for 2D camera images, often resulting in alignment that is inaccurate. Then the augmented LiDAR data has to go through a voxelization step that converts the point clouds into volume data stored in a three dimensional array of voxels. The voxelized features are quite different compared to the raw data, making the alignment even more difficult. To address the alignment issue caused by geometry-related data augmentation, we introduce Inverse Augmentation (InverseAug), a technique used to reverse the augmentation before fusion during the model’s training phase.

In the example below, we demonstrate the difficulties in aligning the augmented LiDAR data with the camera data. In this case, the LiDAR point cloud is augmented by rotation with the result that a given 3D key point, which could be any 3D coordinate, such as a LiDAR data point, cannot be easily aligned in 2D space simply through use of the original LiDAR and camera parameters. To make the localization feasible, InverseAug first stores the augmentation parameters before applying the geometry-related data augmentation. At the fusion stage, it reverses all data augmentation to get the original coordinate for the 3D key point, and then finds its corresponding 2D coordinates in the camera space.

During training, InverseAug resolves the inaccurate alignment from geometric augmentation.
Left: Alignment without InverseAug. Right: Alignment quality improvement with InverseAug.

LearnableAlign: A Cross-Modality-Attention Module to Learn Alignment
We also introduce Learnable Alignment (LearnableAlign), a cross-modality-attention–based feature-level alignment technique, to improve the alignment quality. For input-level fusion methods, such as PointPainting and PointAugmenting, given a 3D LiDAR point, only the corresponding camera pixel can be exactly located as there is a one-to-one mapping. In contrast, when fusing deep features in the DeepFusion pipeline, each LiDAR feature represents a voxel containing a subset of points, and hence, its corresponding camera pixels are in a polygon. So the alignment becomes the problem of learning the mapping between a voxel cell and a set of pixels.

A naïve approach is to average over all pixels corresponding to the given voxel. However, intuitively, and as supported by our visualized results, these pixels are not equally important because the information from the LiDAR deep feature unequally aligns with every camera pixel. For example, some pixels may contain critical information for detection (e.g., the target object), while others may be less informative (e.g., consisting of backgrounds such as roads, plants, occluders, etc.).

LearnableAlign leverages a cross-modality attention mechanism to dynamically capture the correlations between two modalities. Here, the input contains the LiDAR features in a voxel cell, and all its corresponding camera features. The output of the attention is essentially a weighted sum of the camera features, where the weights are collectively determined by a function of the LiDAR and camera features. More specifically, LearnableAlign uses three fully-connected layers to respectively transform the LiDAR features to a vector (ql), and camera features to vectors (kc) and (vc). For each vector (ql), we compute the dot products between (ql) and (kc) to obtain the attention affinity matrix that contains correlations between the LiDAR features and the corresponding camera features. Normalized by a softmax operator, the attention affinity matrix is then used to calculate weights and aggregate the vectors (vc) that contain camera information. The aggregated camera information is then processed by a fully-connected layer, and concatenated (Concat) with the original LiDAR feature. The output is then fed into any standard 3D detection framework, such as PointPillars or CenterPoint for model training.

LearnableAlign leverages the cross-attention mechanism to align LiDAR and camera features.

DeepFusion: A Better Way to Fuse Information from Different Modalities
Powered by our two novel feature alignment techniques, we develop DeepFusion, a fully end-to-end multi-modal 3D detection framework. In the DeepFusion pipeline, the LiDAR points are first fed into an existing feature extractor (e.g., pillar feature net from PointPillars) to obtain LiDAR features (e.g., pseudo-images). In the meantime, the camera images are fed into a 2D image feature extractor (e.g., ResNet) to obtain camera features. Then, InverseAug and LearnableAlign are applied in order to fuse the camera and LiDAR features together. Finally, the fused features are processed by the remaining components of the selected 3D detection model (e.g., the backbone and detection head from PointPillars) to obtain the detection results.

The pipeline of DeepFusion.

Benchmark Results
We evaluate DeepFusion on the Waymo Open Dataset, one of the largest 3D detection challenges for autonomous cars, using the Average Precision with Heading (APH) metric under difficulty level 2, the default metric to rank a model’s performance on the leaderboard. Among the 70 participating teams all over the world, the DeepFusion single and ensemble models achieve state-of-the-art performance in their corresponding categories.

The single DeepFusion model achieves new state-of-the-art performance on Waymo Open Dataset.
The Ensemble DeepFusion model outperforms all other methods on Waymo Open Dataset, ranking No. 1 on the leaderboard.

The Impact of InverseAug and LearnableAlign
We also conduct ablation studies on the effectiveness of the proposed InverseAug and LearnableAlign techniques. We demonstrate that both InverseAug and LearnableAlign individually contribute to a performance gain over the LiDAR-only model, and combining both can further yield an even more significant boost.

Ablation studies on InverseAug (IA) and LearnableAlign (LA) measured in average precision (AP) and APH. Combining both techniques contributes to the best performance gain.

Conclusion
We demonstrate that late-stage deep feature fusion can be more effective when features are aligned well, but aligning features from two different modalities can be challenging. To address this challenge, we propose two techniques, InverseAug and LearnableAlign, to improve the quality of alignment among multimodal features. By integrating these techniques into the fusion stage of our proposed DeepFusion method, we achieve state-of-the-art performance on the Waymo Open Dataset.

Acknowledgements:
Special thanks to co-authors Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Bo Wu, Yifeng Lu, Denny Zhou, Quoc Le, Alan Yuille, Mingxing Tan.

Categories
Misc

Capture 6x Better Temporal Resolution Cardiac Imaging at any Heart Rate with FujiFilm Healthcare Cardio StillShot 

Using NVIDIA GPUs, Fujifilm Healthcare developed Cardio StillShot to capture cardiac imaging at any heart rate, with 6x better temporal resolution of cardiac CT images.

Capturing clear diagnostic images of the heart and its vasculature is challenging in cardiac computed tomography (CT) imaging because the heart is always moving and the resulting images can be blurry. When a heart is beating quickly, at above 75 beats per minute or irregularly, good image resolution is almost impossible. 

Global diagnostic imaging leader Fujifilm Healthcare developed Cardio StillShot software, which uses NVIDIA GPUs and integrates with their existing whole-body X-ray CT system SCENARIA View, for precise cardiac imaging at any heart rate. This software improves diagnostic imaging without a high-speed rotation scanner. Also, Cardio StillShot achieves over 6x better temporal resolution than conventional image reconstruction methods by detecting cardiac motion and preventing image blurring through motion correction. 

Clear cardiac CT images help clinical teams visualize structures such as coronary arteries, aortic valves, and myocardium noninvasively and diagnose heart problems such as heart failure, cardiomyopathy, and structural abnormalities.

Cardiovascular disease rates and noninvasive diagnostic tools

Cardiovascular disease (CVD) is the leading cause of death globally. According to WHO, an estimated 17.9 million people died from CVDs in 2019, representing 32% of all global deaths. Of those deaths, 85% were due to heart attack and stroke. Imaging techniques such as coronary computed tomography angiography (CCTA) is a widely available noninvasive diagnostic tool for assessing a patient’s cardiovascular disease risk early.

CCTA helps identify plaque deposits in the coronary arteries, which supply oxygen and nutrients to the heart. Plaque is the build up of fats, cholesterol, and other substances in artery walls leading to constricted blood flow to the heart. 

Identifying plaque buildup early can help prevent heart attacks. In ECG-gated cardiac CT, X-ray images are obtained during the cardiac phase with little cardiac motion, or image reconstruction is performed using multiple samples to create a static image of the coronary artery.

SCENARIA View, a whole-body X-ray CT system
Figure 1: Fujifilm Healthcare’s latest model of SCENARIA View, pictured above, will have Cardio StillShot as a software enablement option along with a RTX A6000 GPU console.

Difficulties with imaging during high heart rates

Patients with high heart rates or irregular heart rates need to be scanned just like every other patient. Unfortunately, it is hard for scanners to get clear diagnostic images under these conditions. At heart rates of 60-75 beats per minute (BPM), there is adequate time to take images between heartbeats. But, when the heart rates rise above 75 BPM, the imaging time window becomes too short, leading to blurry images. Detailed imaging of the coronary arteries requires high temporal resolution. 

Cardio StillShot was developed to achieve high temporal resolution by detecting and correcting motions in the heart even when the patient’s heart rate is high, without using beta-blockers or other medications to lower heart rate.

Transitioning from CPUs to GPUs to develop Cardio StillShot

Cardio StillShot image reconstruction software addresses the conventional issues of time resolution. Previously, Fujifilm Healthcare was using CPUs to reconstruct images and remove blurriness. However, CPUs are no longer a viable option for Cardio StillShot due to a 10x increase in the number of calculations required for each image. Fujifilm Healthcare transitioned to NVIDIA GPUs and NVIDIA software to develop Cardio StillShot. The adoption of NVIDIA RTX A6000 GPUs with 77 TFLOPS of compute performance helps calculate the motion vector field (MVF), resulting in clear images for clinical use. Fujifilm Healthcare also used NVIDIA software stack and tools, including NVIDIA Optical Flow SDK to estimate pixel-level motion, CUDA for accelerated calculations, and NVIDIA Nsight Compute to optimize performance.

Exploring 4D motion vector fields to improve image clarity

Fujifilm Healthcare used a 4D MVF to estimate the motion in CCTA images. The MVF approach automatically tracks and corrects the heart’s motion resulting in sharper images. The improvement is a 6.25x higher temporal resolution—from 175msec temporal resolution in a standard reconstruction to 28 msec with the Cardio StillShot software. With NVIDIA GPUs, clear views of the heart can be reconstructed in as little as 30 seconds.

Workflow of Motion Vector Field Synthesis
Figure 2: Motion Vector Field Synthesis from CT Scan.

Accelerated compute adds premium capabilities to existing scanners

For Fujifilm Healthcare, using accelerated compute shifted the system performance and cost of the CT design. Usually, high-performance features require costly design and manufacturing upgrades. Fujifilm Healthcare broke this trend with NVIDIA GPUs to add premium capabilities to scanners via a software enhancement. Adding GPU acceleration to the StillShot image reconstruction software improved cardiac image quality of an existing CT scanner with over 6x temporal resolution. The Cardio StillShot software runs on Fujifilm Healthcare’s latest model of SCENARIA View, which is available in Japan today and offered worldwide soon. Fujifilm Healthcare will be demonstrating the Cardio StillShot software and SCENARIA View CT scanner at the International Technical Exhibition of Medical Imaging 2022 Conference held in Yokohama, Japan from April 15 to 17.