Categories
Offsites

Using Deep Learning to Annotate the Protein Universe

Proteins are essential molecules found in all living things. They play a central role in our bodies’ structure and function, and they are also featured in many products that we encounter every day, from medications to household items like laundry detergent. Each protein is a chain of amino acid building blocks, and just as an image may include multiple objects, like a dog and a cat, a protein may also have multiple components, which are called protein domains. Understanding the relationship between a protein’s amino acid sequence — for example, its domains — and its structure or function are long-standing challenges with far-reaching scientific implications.

An example of a protein with known structure, TrpCF from E. coli, for which areas used by a model to predict function are highlighted (green). This protein produces tryptophan, which is an essential part of a person’s diet.

<!–

An example of a protein with known structure, TrpCF from E. coli, for which areas used by a model to predict function are highlighted (green). This protein produces tryptophan, which is an essential part of a person’s diet.

–>

Many are familiar with recent advances in computationally predicting protein structure from amino acid sequences, as seen with DeepMind’s AlphaFold. Similarly, the scientific community has a long history of using computational tools to infer protein function directly from sequences. For example, the widely-used protein family database Pfam contains numerous highly-detailed computational annotations that describe a protein domain’s function, e.g., the globin and trypsin families. While existing approaches have been successful at predicting the function of hundreds of millions of proteins, there are still many more with unknown functions — for example, at least one-third of microbial proteins are not reliably annotated. As the volume and diversity of protein sequences in public databases continue to increase rapidly, the challenge of accurately predicting function for highly divergent sequences becomes increasingly pressing.

In “Using Deep Learning to Annotate the Protein Universe”, published in Nature Biotechnology, we describe a machine learning (ML) technique to reliably predict the function of proteins. This approach, which we call ProtENN, has enabled us to add about 6.8 million entries to Pfam’s well-known and trusted set of protein function annotations, about equivalent to the sum of progress over the last decade, which we are releasing as Pfam-N. To encourage further research in this direction, we are releasing the ProtENN model and a distill-like interactive article where researchers can experiment with our techniques. This interactive tool allows the user to enter a sequence and get results for a predicted protein function in real time, in the browser, with no setup required. In this post, we’ll give an overview of this achievement and how we’re making progress toward revealing more of the protein universe.

The Pfam database is a large collection of protein families and their sequences. Our ML model ProtENN helped annotate 6.8 million more protein regions in the database.

Protein Function Prediction as a Classification Problem
In computer vision, it’s common to first train a model for image classification tasks, like CIFAR-100, before extending it to more specialized tasks, like object detection and localization. Similarly, we develop a protein domain classification model as a first step towards future models for classification of entire protein sequences. We frame the problem as a multi-class classification task in which we predict a single label out of 17,929 classes — all classes contained in the Pfam database — given a protein domain’s sequence of amino acids.

Models that Link Sequence to Function
While there are a number of models currently available for protein domain classification, one drawback of the current state-of-the-art methods is that they are based on the alignment of linear sequences and don’t consider interactions between amino acids in different parts of protein sequences. But proteins don’t just stay as a line of amino acids, they fold in on themselves such that nonadjacent amino acids have strong effects on each other.

Aligning a new query sequence to one or more sequences with known function is a key step of current state-of-the-art methods. This reliance on sequences with known function makes it challenging to predict a new sequence’s function if it is highly dissimilar to any sequence with known function. Furthermore, alignment-based methods are computationally intensive, and applying them to large datasets, such as the metagenomic database MGnify, which contains >1 billion protein sequences, can be cost prohibitive.

To address these challenges, we propose to use dilated convolutional neural networks (CNNs), which should be well-suited to modeling non-local pairwise amino-acid interactions and can be run on modern ML hardware like GPUs. We train 1-dimensional CNNs to predict the classification of protein sequences, which we call ProtCNN, as well as an ensemble of independently trained ProtCNN models, which we call ProtENN. Our goal for using this approach is to add knowledge to the scientific literature by developing a reliable ML approach that complements traditional alignment-based methods. To demonstrate this, we developed a method to accurately measure our method’s accuracy.

Evaluation with Evolution in Mind
Similar to well-known classification problems in other fields, the challenge in protein function prediction is less in developing a completely new model for the task, and more in creating fair training and test sets to ensure that the models will make accurate predictions for unseen data. Because proteins have evolved from shared common ancestors, different proteins often share a substantial fraction of their amino acid sequence. Without proper care, the test set could be dominated by samples that are highly similar to the training data, which could lead to the models performing well by simply “memorizing” the training data, rather than learning to generalize more broadly from it.

We create a test set that requires ProtENN to generalize well on data far from its training set.

<!–

We create a test set that requires ProtENN to generalize well on data far from its training set.

–>

To guard against this, it is essential to evaluate model performance using multiple separate setups. For each evaluation, we stratify model accuracy as a function of similarity between each held-out test sequence and the nearest sequence in the train set.

The first evaluation includes a clustered split training and test set, consistent with prior literature. Here, protein sequence samples are clustered by sequence similarity, and entire clusters are placed into either the train or test sets. As a result, every test example is at least 75% different from every training example. Strong performance on this task demonstrates that a model can generalize to make accurate predictions for out-of-distribution data.

For the second evaluation, we use a randomly split training and test set, where we stratify examples based on an estimate of how difficult they will be to classify. These measures of difficulty include: (1) the similarity between a test example and the nearest training example, and (2) the number of training examples from the true class (it is much more difficult to accurately predict function given just a handful of training examples).

To place our work in context, we evaluate the performance of the most widely used baseline models and evaluation setups, with the following baseline models in particular: (1) BLAST, a nearest-neighbor method that uses sequence alignment to measure distance and infer function, and (2) profile hidden Markov models (TPHMM and phmmer). For each of these, we include the stratification of model performance based on sequence alignment similarity mentioned above. We compared these baselines against ProtCNN and the ensemble of CNNs, ProtENN.

We measure each model’s ability to generalize, from the hardest examples (left) to the easiest (right).

Reproducible and Interpretable Results
We also worked with the Pfam team to test whether our methodological proof of concept could be used to label real-world sequences. We demonstrated that ProtENN learns complementary information to alignment-based methods, and created an ensemble of the two approaches to label more sequences than either method could by itself. We publicly released the results of this effort, Pfam-N, a set of 6.8 million new protein sequence annotations.

After seeing the success of these methods and classification tasks, we inspected these networks to understand whether the embeddings were generally useful. We built a tool that enables users to explore the relation between the model predictions, embeddings, and input sequences, which we have made available through our interactive manuscript, and we found that similar sequences were clustered together in embedding space. Furthermore, the network architecture that we selected, a dilated CNN, allows us to employ previously-discovered interpretability methods like class activation mapping (CAM) and sufficient input subsets (SIS) to identify the sub-sequences responsible for the neural network predictions. With this approach, we find that our network generally focuses on the relevant elements of a sequence to predict its function.

Conclusion and Future Work
We’re excited about the progress we’ve seen by applying ML to the understanding of protein structure and function over the last few years, which has been reflected in contributions from the broader research community, from AlphaFold and CAFA to the multitude of workshops and research presentations devoted to this topic at conferences. As we look to build on this work, we think that continuing to collaborate with scientists across the field who’ve shared their expertise and data, combined with advances in ML will help us further reveal the protein universe.

Acknowledgments
We’d like to thank all of the co-authors of the manuscripts, Maysam Moussalem, Jamie Smith, Eli Bixby, Babak Alipanahi, Shanqing Cai, Cory McLean, Abhinay Ramparasad, Steven Kearnes, Zack Nado, and Tom Small.

Categories
Misc

Validate Applications for Secure Edge Deployments with the Expanded NVIDIA Metropolis Partner Program

The Metropolis Partner Program expanded to include a certification that ensures partner applications can be securely deployed to any location with NVIDIA Fleet Command.

Building production-ready AI applications is hard, especially when starting from scratch. That’s why NVIDIA created Metropolis, a suite of tools to help developers build and bring to market vision AI applications. 

Deploying these applications in production, especially outside of the data center, can be just as difficult. For many organizations, determining the best way to deploy an application at customer sites is a task started after customer conversations are underway. The Metropolis Partner Program is addressing this with a certification that ensures partner applications can be securely deployed to any location with NVIDIA Fleet Command.

Fleet Command—a cloud service that centrally connects systems at edge locations—helps organizations securely deploy, manage, and scale AI applications from one dashboard. It’s the best way to orchestrate AI across hundreds or even thousands of devices covering vast physical distances. Now, Metropolis partners can use Fleet Command for free to deploy and scale their applications in production environments. 

A cloud representing Fleet Command is connected to several locations that represent various different types of locations across several industries including healthcare, retail, warehouse, manufacturing, and transportation.
Figure 1: Centrally manage AI application deployments across all of your edge locations with Fleet Command.

Deploying an application using Fleet Command gives partners a platform to easily conduct POCs at customer sites. This can be done without building custom tools to get applications operational in unique customer environments. Additionally, after the evaluation is complete, the partner has all of the necessary infrastructure to easily scale an application from an evaluation environment to the entire production environment. Saving time gives partners the freedom to focus all their resources on building valuable AI applications.

Once an application is ready for deployment on Fleet Command, partners have free access to the platform for a year. The platform includes cloud access to servers with the latest NVIDIA GPUs, and NVIDIA experts on AI and optimizations, amounting to over a $100,000 value. 

Partners looking to demonstrate the value of their application to customers have access to NVIDIA LaunchPad. LaunchPad provides enterprises immediate, short-term access to all of the features and functionality of Fleet Command including provisioning edge infrastructure, deploying and managing applications, and monitoring their edge fleet. Partners can use Fleet Command on LaunchPad for customers to experience the application in an isolated environment before moving to a full evaluation in a production setting. 

Over 50 Metropolis Partner Program members are integrating their applications to be deployed and managed on Fleet Command. Partners include Milestone for AI-enabled video management software, OSARO for efficient pick and place robotics solutions, and IronYun for intelligent video analytics in smart buildings.

“With NVIDIA Fleet Command, we can easily deploy and manage our vision AI apps across the edge infrastructure in minutes rather than days,” said Paul Sun, CEO of IronYun.  “Our sales and testing cycles are drastically reduced.”

The Metropolis Partner Program enables AI developers with tools to streamline every stage of the process to acquire and keep new customers. This ranges from application development on Metropolis to demos and evaluations on LaunchPad, to POCs and deployments on Fleet Command.

Get access to Fleet Command and LaunchPad today by applying to join the Metropolis Partner Program. If you’re already a member, reach out to your representative to learn how to get started.

Categories
Misc

Podsplainer: What’s a Recommender System? NVIDIA’s Even Oldridge Breaks It Down

The very thing that makes the internet so useful to so many people — the vast quantity of information that’s out there — can also make going online frustrating. There’s so much available that the sheer volume of choices can be overwhelming. That’s where recommender systems come in, explains NVIDIA AI Podcast host Noah Kravitz. Read article >

The post Podsplainer: What’s a Recommender System? NVIDIA’s Even Oldridge Breaks It Down appeared first on NVIDIA Blog.

Categories
Misc

Conversion of SVC with RBF Kernel to Tensorflow model?

Is there a way to convert support vector classifier with rbf kernel to tf model?

I am aware of converting support vector classifier with linear kernel because there exists coef_ where we can find parameters and assign to tf model. Got this idea from how to convert saved model from sklearn into tensorflow/lite.

However _coef wont be there for rbf , so i am not sure on how to convert this model to an tf model.

Any suggestions are highly helpful. Thanks

submitted by /u/Mother-Beyond9493
[visit reddit] [comments]

Categories
Misc

[TFLite] Mozilla + Coqui Speech Technology Hackathon

[TFLite] Mozilla + Coqui Speech Technology Hackathon submitted by /u/josh-r-meyer
[visit reddit] [comments]
Categories
Misc

Tensorflow-GPU=2.6.0 and cudatoolkit >=11 with conda

I have a problem, which I could not solve yet.

I need to create a conda environment with tensorflow=2.6.0, python=3.8 and cudatoolkit>=11 (using ampere GPU)

I cannot use conda default channel. Until now I tried using conda-forge and nvidia channel.

I have not succeeded in creating a conda environment with the above requirements. Anyone got a hint?

The machine got centos with Nvidia driver 495 and cuda 11.5

submitted by /u/alex_bababu
[visit reddit] [comments]

Categories
Misc

(beginner) am I doing something wrong is it supposed to take this long to import TensorFlow into ide

I am using Dataspell and have a jupyter notebook set up inside dataspell, I’ve done simple imports such as

import TensorFlow as tf
import tensorflow_hub as hub
import tensorflow_text as text

but its has been like 2 hours and it still hasn’t finished loading, I’ve made sure that i already installed the packages using terminal pip install TensorFlow etc

submitted by /u/ratmanreturns265
[visit reddit] [comments]

Categories
Misc

classify known images and reject unknown images; Resnet50?

I am trying to build a sorting system to sort parts by taking a picture of the part, identifying the part, and then telling a robot to move the part into the appropriate bucket. 99% of the parts are known and can be trained for. 1% of the parts are not known. I would like the known parts to be put in their buckets, and all of the unknown parts to be rejected into a single bucket called unknown.

I naively thought I would be able to do this with Resnet50 by looking at the weights returned in the prediction array. I thought the predictor would have uncertainty when presented with an unknown part. However, I have discovered that Resnet50 (and perhaps all image classifiers) will force any image to be in one of it’s trained for buckets with a high level of confidence.

Because only 1% of the parts are unknown, I can’t realistically gather enough of them to train for. Furthermore, I don’t know when and where they will show up.

Does anyone know of a technique I could use to sort images of known parts into their buckets, and reject unknown parts?

submitted by /u/marlon1492
[visit reddit] [comments]

Categories
Misc

Saving Time and Money in the Cloud with the Latest NVIDIA-Powered Instances

The greater performance delivered by current-generation NVIDIA GPU-accelerated instances more than outweighs the per-hour pricing differences of prior-generation GPUs.

AI is transforming every industry, enabling powerful new applications and use cases that simply weren’t possible with traditional software. As AI continues to proliferate, and with the size and complexity of AI models on the rise, significant advances in AI compute performance are required to keep up.

That’s where the NVIDIA platform comes in.

With a full-stack approach spanning chips, systems, software, and even the entire data center, NVIDIA delivers both the highest performance and the greatest versatility for all AI workloads, including AI training. NVIDIA demonstrated this in the MLPerf Training v1.1, the latest edition of an industry-standard, peer-reviewed benchmark suite that measures ML training performance across a wide range of networks. Systems powered by the NVIDIA A100 Tensor Core GPU, including the Azure NDm A100 v4 cloud instance, delivered chart-topping results, set new records, and were the only ones to complete all eight MLPerf Training tests.

All major cloud service providers offer NVIDIA GPU-accelerated instances powered by the A100, making the public cloud a great place to tap into the performance and capabilities of the NVIDIA platform.  In this post, I show how a strategy of selecting current-generation instances based on the A100 not only delivers the fastest time to train AI models in the cloud but is also the most cost-effective.

NVIDIA A100 turbocharges AI training

The NVIDIA A100 is based on the Ampere architecture, which incorporates a host of innovations that speed up AI training compared to the prior-generation NVIDIA V100, such as third-generation Tensor Cores, a new generation of NVLink, and much greater memory bandwidth. These enhancements deliver a giant performance leap, enabling the reduction in the time to train a wide range of AI networks.

In this post, I use ResNet-50 to represent image classification, BERT Large for natural language processing, and DLRM for recommender systems.

Chart shows the performance speed-ups of the NVIDIA A100 40GB vs the NVIDIA V100 32GB in DLRM, BERT Large fine tuning, and ResNet-50. They are 2X, 2.6X, and 1.5X, respectively.
Figure 1. The NVIDIA A100 dramatically reduces the time to train AI models compared to the NVIDIA V100

GPU Server: Dual socket AMD EPYC 7742 @ 2.25GHz w/ 8x NVIDIA A100 SXM4-40GB and Dual socket Intel Xeon E5-2698 v4 @ 2.2GHz w/ 8x NVIDIA V100 SXM2-32GB.  Frameworks: TensorFlow for ResNet-50 v1.5, PyTorch for BERT-Large and DLRM; Precision: Mixed+XLA for ResNet-50 v1.5, Mixed for BERT-Large and DLRM.  NVIDIA Driver: 465.19.01; Dataset: ImageNet2012 for ResNet-50 v1.5, SQuaD v1.1 for BERT Large Fine Tuning, Criteo Terabyte Dataset for DLRM, Batch sizes for ResNet-50: A100, V100 = 256; Batch sizes for BERT Large: A100 = 32, V100 = 10; Batch sizes for DLRM: A100, V100 = 65536.

Faster training times speed time to insight, maximizing the productivity of an organization’s data science teams and getting the trained network deployed sooner. There’s also another important benefit: lower costs!

Cloud instances are commonly priced per unit of time, with hourly pricing typical for on-demand usage. The cost to train a model is the product of both hourly instance pricing and the time required to train a model.

Although it can be tempting to select the instances with the lowest hourly price, this might not lead to the lowest cost to train. An instance might be slightly cheaper on a per-hour basis but take significantly longer to train a model. The total cost to train is higher than it would be with the higher-priced instance that gets the job done more quickly. In addition, there’s the time lost waiting for the slower instance to complete the training run.

In the performance numbers shown earlier, the NVIDIA A100 can train models much more quickly than NVIDIA V100. That’s almost 3x as fast in the case of BERT Large Fine Tuning. At the same time, A100-based instances from major cloud providers are often only priced at modest premiums to their prior-generation, V100-based counterparts.

In this post, I discuss how using A100-based cloud instances enables you to save time and money while training AI models, compared to V100-based cloud instances.

Translating performance into savings

Given the immense computational demands of AI training, it is common to train models using multiple GPUs working in concert to reduce training times significantly.

The NVIDIA platform has been designed to deliver industry-leading per-accelerator performance and achieve the best performance and highest ROI at scale, thanks to technologies like NVLink and NVSwitch. That’s why, in this post, I estimate the cost savings that instances with eight NVIDIA A100 GPUs can deliver compared to instances with eight NVIDIA V100 GPUs.

For this analysis, I estimate the relative costs to train ResNet-50, fine tune BERT Large, and train DLRM on V100- and A100-based instances from three major cloud service providers: Amazon Web Services, Google Cloud Platform, and Microsoft Azure.

CSP Instance GPU Configuration
Amazon Web Services p4d.24xlarge 8x NVIDIA A100 40GB
p3dn.24xlarge 8x NVIDIA V100 32GB
p3.16xlarge 8x NVIDIA V100 16GB
Google Cloud Platform a2-highgpu-8g 8x NVIDIA A100 40GB
n1-highmem-96 8x NVIDIA V100 16GB
Microsoft Azure Standard_ND96asr_v4 8x NVIDIA A100 40GB
ND40rs v2 8x NVIDIA V100 32GB
Table 1. NVIDIA GPU-accelerated instances from AWS, GCP, and Microsoft Azure.

Estimate methodology

To estimate the training performance of the cloud instances, I used measured time to train data on NVIDIA DGX systems with GPU configurations that correspond to those in the instances. As a result of the deep engineering collaboration with these cloud partners, the performance of NVIDIA-powered cloud instances should be similar to the performance achievable on the DGX systems.

Then, with the measured time-to-train data, I used on-demand, per-hour instance pricing to estimate the cost to train ResNet-50, fine tune BERT Large, and train DLRM. 

Estimated cost savings

The following charts all tell a similar story: no matter which cloud service provider you choose, selecting instances based on the latest NVIDIA A100 GPUs can translate into significant cost savings when training a range of AI models. This is even though, on a per-hour basis, instances based on the NVIDIA A100 are more expensive than instances using prior-generation V100 GPUs.

Amazon Web Services:

Chart shows the relative estimated cost to train ResNet-50, BERT Large Fine Tuning, and DLRM on AWS instances based on NVIDIA GPUs: p3.16xlarge (V100 16GB), p3dn.24xlarge (V100 32GB), and p4d.24xlarge (A100 40GB). Estimated savings with A100 = 41% for ResNet-50, 60% for BERT Large Fine Tuning, and 47% for DLRM.
Figure 2. Estimated cost savings for training models using A100 instances on AWS compared to V100 (16GB and 32GB) instances

GPU Server: Dual socket AMD EPYC 7742 @ 2.25GHz w/ 8x NVIDIA A100 SXM4-40GB, Dual socket Intel Xeon E5-2698 v4 @ 2.2GHz w/ 8x NVIDIA V100 SXM2-32GB, and Dual socket Intel Xeon E5-2698 v4 @ 2.2GHz w/ 8x NVIDIA V100 SXM2-16GB.  Frameworks: TensorFlow for ResNet-50 v1.5, PyTorch for BERT-Large and DLRM; Precision: Mixed+XLA for ResNet-50 v1.5, Mixed for BERT-Large and DLRM.  NVIDIA Driver: 465.19.01; Dataset: Imagenet2012 for ResNet-50 v1.5, SQuaD v1.1 for BERT Large Fine Tuning, Criteo Terabyte Dataset for DLRM, Batch sizes for ResNet-50: A100, V100 = 256; Batch sizes for BERT Large: A100 = 32, V100 = 10; Batch sizes for DLRM: A100, V100 = 65536; Cost estimated using performance data run on the earlier configurations as well as on-demand instance pricing as of 2/8/2022.

Google Cloud Platform:

Chart shows the relative estimated cost to train ResNet-50, BERT Large Fine Tuning, and DLRM on GCP instances based on NVIDIA GPUs: a2-highgpu-8g (A100 40GB), n1-highmem-96 (V100 16GB), and n1-highmem-96 with a sustained use discount applied. Estimated savings with A100 = 38% for ResNet-50 and 53% for BERT Large Fine Tuning.
Figure 3. Estimated cost savings for training models using A100 instances on GCP compared to V100 (16GB) instances

GPU Server: Dual socket AMD EPYC 7742 @ 2.25GHz w/ 8x NVIDIA A100 SXM4-40GB, Dual socket Intel Xeon E5-2698 v4 @ 2.2GHz w/ 8x NVIDIA V100 SXM2-32GB, and Dual socket Intel Xeon E5-2698 v4 @ 2.2GHz w/ 8x NVIDIA V100 SXM2-16GB.  Frameworks: TensorFlow for ResNet-50 v1.5, PyTorch for BERT-Large and DLRM; Precision: Mixed+XLA for ResNet-50 v1.5, Mixed for BERT-Large and DLRM.  NVIDIA Driver: 465.19.01; Dataset: ImageNet2012 for ResNet-50 v1.5, SQuaD v1.1 for BERT Large Fine Tuning, Criteo Terabyte Dataset for DLRM, Batch sizes for ResNet-50: A100, V100 = 256; Batch sizes for BERT Large: A100 = 32, V100 = 10; Batch sizes for DLRM: A100, V100 = 65536; Cost estimated using performance data run on the earlier configurations as well as on-demand instance pricing as of 2/8/2022.

Microsoft Azure:

Chart shows the relative estimated cost to train ResNet-50, BERT Large Fine Tuning, and DLRM on Azure instances based on NVIDIA GPUs: ND96asr A100 v4 (A100 40GB) and ND40rs v2 (V100 16GB). Estimated savings with A100 = 31% for ResNet-50, 53% for BERT Large Fine Tuning, and 37% for DLRM.
Figure 4. Estimated cost savings for training models using A100 instances on GCP compared to V100 (32-GB) instances

GPU Server: Dual socket AMD EPYC 7742 @ 2.25GHz w/ 8x NVIDIA A100 SXM4-40GB, Dual socket Intel Xeon E5-2698 v4 @ 2.2GHz w/ 8x NVIDIA V100 SXM2-32GB, and Dual socket Intel Xeon E5-2698 v4 @ 2.2GHz w/ 8x NVIDIA V100 SXM2-16GB.  Frameworks: TensorFlow for ResNet-50 v1.5, PyTorch for BERT-Large and DLRM; Precision: Mixed+XLA for ResNet-50 v1.5, Mixed for BERT-Large and DLRM.  NVIDIA Driver: 465.19.01; Dataset: ImageNet2012 for ResNet-50 v1.5, SQuaD v1.1 for BERT Large Fine Tuning, Criteo Terabyte Dataset for DLRM, Batch sizes for ResNet-50: A100, V100 = 256; Batch sizes for BERT Large: A100 = 32, V100 = 10; Batch sizes for DLRM: A100, V100 = 65536; Costs estimated using performance data run on the earlier configurations as well as on-demand instance pricing as of 2/8/2022.

In addition to delivering lower training costs and saving users a significant amount of time, there’s another benefit to using current-generation instances:  they enable fundamentally new AI use cases. For example, AI-based recommendation engines are becoming increasingly popular and NVIDIA GPUs are commonly used to train them. Figure 5 summarizes the cost and time savings that A100-instances deliver across different cloud providers:

Table summarizes the estimated time and cost benefits of using cloud instances with the NVIDIA A100 40GB GPU compared to instances with NVIDIA V100 16GB and V100 32GB GPUs.
Figure 5. Time and cost savings of A100-based instances compared to V100 counterparts. Based on on-demand instance pricing as of 2/8/2022.

Higher performance also means higher savings

These results presented here show that the much greater performance delivered by current-generation NVIDIA GPU-accelerated instances more than outweighs the per-hour pricing differences compared to older instances that use prior-generation GPUs.

Instances based on the latest NVIDIA A100 GPUs not only maximize the productivity of your data science teams by minimizing training time, but they’re also the most cost-effective way to train your models in the cloud.

To learn more about the many options for using NVIDIA acceleration in the cloud, see Cloud Computing.

Categories
Offsites

Co-training Transformer with Videos and Images Improves Action Recognition

Action recognition has become a major focus area for the research community because many applications can benefit from improved modeling, such as video retrieval, video captioning, video question-answering, etc. Transformer-based approaches have recently demonstrated state-of-the-art performance on several benchmarks. While Transformer models require data to learn better visual priors compared to ConvNets, action recognition datasets are relatively small in scale. Large Transformer models are typically first trained on image datasets and later fine-tuned on a target action recognition dataset.

While the current pre-training and fine-tuning action recognition paradigm is straightforward and manifests strong empirical results, it may be overly restrictive for building general-purpose action-recognition models. Compared to a dataset like ImageNet that covers a large range of object recognition classes, action recognition datasets like Kinetics and Something-Something-v2 (SSv2) pertain to limited topics. For example, Kinetics include object-centric actions like “cliff diving” and “ice climbing’ while SSv2 contains object-agnostic activities like ’pretending to put something onto something else.’ As a result, we observed poor performance adapting an action recognition model that has been fine-tuned on one dataset to another disparate dataset.

Differences in objects and video backgrounds among datasets further exacerbate learning a general-purpose action recognition classification model. Despite the fact that video datasets may be increasing in size, prior work suggests significant data augmentation and regularization is necessary to achieve strong performance. This latter finding may indicate the model quickly overfits on the target dataset, and as a result, hinders its capacity to generalize to other action recognition tasks.

In “Co-training Transformer with Videos and Images Improves Action Recognition”, we propose a training strategy, named CoVeR, that leverages both image and video data to jointly learn a single general-purpose action recognition model. Our approach is buttressed by two main findings. First, disparate video datasets cover a diverse set of activities, and training them together in a single model could lead to a model that excels at a wide range of activities. Second, video is a perfect source for learning motion information, while images are great for exploiting structural appearance. Leveraging a diverse distribution of image examples may be beneficial in building robust spatial representations in video models. Concretely, CoVeR first pre-trains the model on an image dataset, and during fine-tuning, it simultaneously trains a single model on multiple video and image datasets to build robust spatial and temporal representations for a general-purpose video understanding model.

Architecture and Training Strategy
We applied the CoVeR approach to the recently proposed spatial-temporal video transformer, called TimeSFormer, that contains 24 layers of transformer blocks. Each block contains one temporal attention, one spatial attention, and one multilayer perceptron (MLP) layer. To learn from multiple video and image datasets, we adopt a multi-task learning paradigm and equip the action recognition model with multiple classification heads. We pre-train all non-temporal parameters on the large-scale JFT dataset. During fine-tuning, a batch of videos and images are sampled from multiple video and image datasets. The sampling rate is proportional to the size of the datasets. Each sample within the batch is processed by the TimeSFormer and then distributed to the corresponding classifier to get the predictions.

Compared with the standard training strategy, CoVeR has two advantages. First, as the model is directly trained on multiple datasets, the learned video representations are more general and can be directly evaluated on those datasets without additional fine-tuning. Second, Transformer-based models may easily overfit to a smaller video distribution, thus degrading the generalization of the learned representations. Training on multiple datasets mitigates this challenge by reducing the risk of overfitting.

CoVeR adopts a multi-task learning strategy trained on multiple datasets, each with their own classifier.

Benchmark Results
We evaluate the CoVeR approach to train on Kinetics-400 (K400), Kinetics-600 (K600), Kinetics-700 (K700), SomethingSomething-V2 (SSv2), and Moments-in-Time (MiT) datasets. Compared with other approaches — such as TimeSFormer, Video SwinTransformer, TokenLearner, ViViT, MoViNet, VATT, VidTr, and OmniSource — CoVeR established the new state-of-the-art on multiple datasets (shown below). Unlike previous approaches that train a dedicated model for one single dataset, a model trained by CoVeR can be directly applied to multiple datasets without further fine-tuning.

Model Pretrain Finetune K400 Accuracy
VATT AudioSet+Videos K400 82.1
Omnisource IG-Kinetics-65M K400 83.6
ViViT JFT-300M K400 85.4
Video SwinTrans   ImageNet21K+external   K400 86.8
CoVeR JFT-3B K400+SSv2+MiT+ImNet 87.2
Accuracy comparison on Kinetics-400 (K400) dataset.
Model Pretrain Finetune SSv2 Accuracy
TimeSFormer ImageNet21k SSv2 62.4
VidTr ImageNet21k SSv2 63.0
ViViT ImageNet21k SSv2 65.9
Video SwinTrans   ImageNet21K+external   SSv2 69.6
CoVeR JFT-3B K400+SSv2+MiT+ImNet 70.9
Accuracy comparison on SomethingSomething-V2 (SSv2) dataset.
Model Pretrain Finetune MiT Accuracy
ViViT ImageNet21k MiT 38.5
VidTr ImageNet21k SSv2 41.1
CoVeR JFT-3B K400+SSv2+MiT+ImNet 46.1
Accuracy comparison on Moments-in-Time (MiT) dataset.

Transfer Learning
We use transfer learning to further verify the video action recognition performance and compare with co-training on multiple datasets, results are summarized below. Specifically, we train on the source datasets, then fine-tune and evaluate on the target dataset.

We first consider K400 as the target dataset. CoVeR co-trained on SSv2 and MiT improves the top-1 accuracy on K400→K400 (where the model is trained on K400 and then fine-tuned on K400) by 1.3%, SSv2→K400 by 1.7%, and MiT→K400 by 0.4%. Similarly, we observe that by transferring to SSv2, CoVeR achieves 2%, 1.8%, and 1.1% improvement over SSv2→SSv2, K400→SSv2, and MiT→SSv2, respectively. The 1.2% and 2% performance improvement on K400 and SSv2 indicates that CoVeR co-trained on multiple datasets could learn better visual representations than the standard training paradigm, which is useful for downstream tasks.

Comparison of transfer learning the representation learned by CoVeR and standard training paradigm. A→B means the model is trained on dataset A and then fine-tuned on dataset B.

Conclusion
In this work, we present CoVeR, a training paradigm that jointly learns action recognition and object recognition tasks in a single model for the purpose of constructing a general-purpose action recognition framework. Our analysis indicates that it may be beneficial to integrate many video datasets into one multi-task learning paradigm. We highlight the importance of continuing to learn on image data during fine-tuning to maintain robust spatial representations. Our empirical findings suggest CoVeR can learn a single general-purpose video understanding model which achieves impressive performance across many action recognition datasets without an additional stage of fine-tuning on each downstream application.

Acknowledgements
We would like to thank Christopher Fifty, Wei Han, Andrew M. Dai, Ruoming Pang, and Fei Sha for preparation of the CoVeR paper, Yue Zhao, Hexiang Hu, Zirui Wang, Zitian Chen, Qingqing Huang, Claire Cui and Yonghui Wu for helpful discussions and feedbacks, and others on the Brain Team for support throughout this project.