Categories
Offsites

Do Modern ImageNet Classifiers Accurately Predict Perceptual Similarity?

The task of determining the similarity between images is an open problem in computer vision and is crucial for evaluating the realism of machine-generated images. Though there are a number of straightforward methods of estimating image similarity (e.g., low-level metrics that measure pixel differences, such as FSIM and SSIM), in many cases, the measured similarity differences do not match the differences perceived by a person. However, more recent work has demonstrated that intermediate representations of neural network classifiers, such as AlexNet, VGG and SqueezeNet trained on ImageNet, exhibit perceptual similarity as an emergent property. That is, Euclidean distances between encoded representations of images by ImageNet-trained models correlate much better with a person’s judgment of differences between images than estimating perceptual similarity directly from image pixels.

Two sets of sample images from the BAPPS dataset. Trained networks agree more with human judgements as compared to low-level metrics (PSNR, SSIM, FSIM). Image source: Zhang et al. (2018).

In “Do better ImageNet classifiers assess perceptual similarity better?” published in Transactions on Machine Learning Research, we contribute an extensive experimental study on the relationship between the accuracy of ImageNet classifiers and their emergent ability to capture perceptual similarity. To evaluate this emergent ability, we follow previous work in measuring the perceptual scores (PS), which is roughly the correlation between human preferences to that of a model for image similarity on the BAPPS dataset. While prior work studied the first generation of ImageNet classifiers, such as AlexNet, SqueezeNet and VGG, we significantly increase the scope of the analysis incorporating modern classifiers, such as ResNets and Vision Transformers (ViTs), across a wide range of hyper-parameters.

Relationship Between Accuracy and Perceptual Similarity
It is well established that features learned via training on ImageNet transfer well to a number of downstream tasks, making ImageNet pre-training a standard recipe. Further, better accuracy on ImageNet usually implies better performance on a diverse set of downstream tasks, such as robustness to common corruptions, out-of-distribution generalization and transfer learning on smaller classification datasets. Contrary to prevailing evidence that suggests models with high validation accuracies on ImageNet are likely to transfer better to other tasks, surprisingly, we find that representations from underfit ImageNet models with modest validation accuracies achieve the best perceptual scores.

Plot of perceptual scores (PS) on the 64 × 64 BAPPS Dataset (y-axis) against the ImageNet 64 × 64 validation accuracies (x-axis). Each blue dot represents an ImageNet classifier. Better ImageNet classifiers achieve better PS up to a certain point (dark blue), beyond which improving the accuracy lowers the PS. The best PS are attained by classifiers with moderate accuracy (20.0–40.0).

<!–

Plot of perceptual scores (PS) on the 64 × 64 BAPPS Dataset (y-axis) against the ImageNet 64 × 64 validation accuracies (x-axis). Each blue dot represents an ImageNet classifier. Better ImageNet classifiers achieve better PS up to a certain point (dark blue), beyond which improving the accuracy lowers the PS. The best PS are attained by classifiers with moderate accuracy (20.0–40.0).

–>

We study the variation of perceptual scores as a function of neural network hyperparameters: width, depth, number of training steps, weight decay, label smoothing and dropout. For each hyperparameter, there exists an optimal accuracy up to which improving accuracy improves PS. This optimum is fairly low and is attained quite early in the hyperparameter sweep. Beyond this point, improved classifier accuracy corresponds to worse PS.

As illustration, we present the variation of PS with respect to two hyperparameters: training steps in ResNets and width in ViTs. The PS of ResNet-50 and ResNet-200 peak very early at the first few epochs of training. After the peak, PS of better classifiers decrease more drastically. ResNets are trained with a learning rate schedule that causes a stepwise increase in accuracy as a function of training steps. Interestingly, after the peak, they also exhibit a step-wise decrease in PS that matches this step-wise accuracy increase.

Early-stopped ResNets attain the best PS across different depths of 6, 50 and 200.

ViTs consist of a stack of transformer blocks applied to the input image. The width of a ViT model is the number of output neurons of a single transformer block. Increasing its width is an effective way to improve its accuracy. Here, we vary the width of two ViT variants, B/8 and L/4 (i.e., Base and Large ViT models with patch sizes 4 and 8 respectively), and evaluate both the accuracy and PS. Similar to our observations with early-stopped ResNets, narrower ViTs with lower accuracies perform better than the default widths. Surprisingly, the optimal width of ViT-B/8 and ViT-L/4 are 6 and 12% of their default widths. For a more comprehensive list of experiments involving other hyperparameters such as width, depth, number of training steps, weight decay, label smoothing and dropout across both ResNets and ViTs, check out our paper.

Narrow ViTs attain the best PS.

Scaling Down Models Improves Perceptual Scores
Our results prescribe a simple strategy to improve an architecture’s PS: scale down the model to reduce its accuracy until it attains the optimal perceptual score. The table below summarizes the improvements in PS obtained by scaling down each model across every hyperparameter. Except for ViT-L/4, early stopping yields the highest improvement in PS, regardless of architecture. In addition, early stopping is the most efficient strategy as there is no need for an expensive grid search.

Model Default Width Depth Weight
Decay
Central
Crop
Train
Steps
Best
ResNet-6 69.1 +0.4 +0.3 0.0 +0.5 69.6
ResNet-50 68.2 +0.4 +0.7 +0.7 +1.5 69.7
ResNet-200 67.6 +0.2 +1.3 +1.2 +1.9 69.5
ViT B/8 67.6 +1.1 +1.0 +1.3 +0.9 +1.1 68.9
ViT L/4 67.9 +0.4 +0.4 -0.1 -1.1 +0.5 68.4
Perceptual Score improves by scaling down ImageNet models. Each value denotes the improvement obtained by scaling down a model across a given hyperparameter over the model with default hyperparameters.

Global Perceptual Functions
In prior work, the perceptual similarity function was computed using Euclidean distances across the spatial dimensions of the image. This assumes a direct correspondence between pixels, which may not hold for warped, translated or rotated images. Instead, we adopt two perceptual functions that rely on global representations of images, namely the style-loss function from the Neural Style Transfer work that captures stylistic similarity between two images, and a normalized mean pool distance function. The style-loss function compares the inter-channel cross-correlation matrix between two images while the mean pool function compares the spatially averaged global representations.

Global perceptual functions consistently improve PS across both networks trained with default hyperparameters (top) and ResNet-200 as a function of train epochs (bottom).

We probe a number of hypotheses to explain the relationship between accuracy and PS and come away with a few additional insights. For example, the accuracy of models without commonly used skip-connections also inversely correlate with PS, and layers close to the input on average have lower PS as compared to layers close to the output. For further exploration involving distortion sensitivity, ImageNet class granularity, and spatial frequency sensitivity, check out our paper.

Conclusion
In this paper, we explore the question of whether improving classification accuracy yields better perceptual metrics. We study the relationship between accuracy and PS on ResNets and ViTs across many different hyperparameters and observe that PS exhibits an inverse-U relationship with accuracy, where accuracy correlates with PS up to a certain point, and then exhibits an inverse-correlation. Finally, in our paper, we discuss in detail a number of explanations for the observed relationship between accuracy and PS, involving skip connections, global similarity functions, distortion sensitivity, layerwise perceptual scores, spatial frequency sensitivity and ImageNet class granularity. While the exact explanation for the observed tradeoff between ImageNet accuracy and perceptual similarity is a mystery, we are excited that our paper opens the door for further research in this area.

Acknowledgements
This is joint work with Neil Houlsby and Nal Kalchbrenner. We would additionally like to thank Basil Mustafa, Kevin Swersky, Simon Kornblith, Johannes Balle, Mike Mozer, Mohammad Norouzi and Jascha Sohl-Dickstein for useful discussions.

Categories
Misc

Changing Cybersecurity with Natural Language Processing

If you’ve used a chatbot, predictive text to finish a thought in an email, or pressed “0” to speak to an operator, you’ve come across natural language…

If you’ve used a chatbot, predictive text to finish a thought in an email, or pressed “0” to speak to an operator, you’ve come across natural language processing (NLP). As more enterprises adopt NLP, the sub-field is developing beyond those popular use cases of machine-human communication to machines interpreting both human and non-human language. This creates an exciting opportunity for organizations to stay ahead of evolving cybersecurity threats.

This post was originally published on CIO.com

NLP combines linguistics, computer science, and AI to support machine learning of human language. Human language is astonishingly complex. Relying on structured rules leaves machines with an incomplete understanding of it.

NLP enables machines to contextualize and learn instead of relying on rigid encoding so that they can adapt to different dialects, new expressions, or questions that the programmers never anticipated.

NLP research has driven the evolution of AI tech, like neural networks that are instrumental to machine learning across various fields and use cases. NLP has been primarily leveraged across machine-to-human communication to simplify interactions for enterprises and consumers.

NLP for cybersecurity

NLP was designed to enable machines to learn to communicate like humans, with humans. Many services that we use today leverage machine communications either to each other or in translation to become intelligible by humans. Cybersecurity is the perfect example of such a field where IT analysts can feel like they speak to more machines than people.

NLP can be leveraged in cybersecurity workflows to assist in breach protection, identification, and scale and scope analysis.

Phishing

In the short term, NLP can be easily leveraged to enhance and simplify breach protection from phishing attempts.

In the context of phishing, NLP can be leveraged to understand bot or spam behavior in email text sent by a machine posing as a human. It can also be used to understand the internal structure of the email itself to identify patterns of spammers and the types of messages they send.

This example is the first extension of NLP, originally designed to understand just human language and now being applied to understand the combination of human language mixed with machine-level headers.

Log parsing

In the medium term, NLP can be leveraged to parse logs, a cyBERT use case.

In the current rules-based system, the mechanisms and systems required to parse raw logs and make them ready for analysts are brittle and need significant development and maintenance resources.

 Using NLP, parsing of raw logs becomes more flexible and less prone to breaking when changes occur to the log generators and sensors.

Going further, the neural networks used for parsing can generalize beyond the logs they were exposed to during training, creating methods to transform raw data into rich content ready for an analyst without the need to write explicit rules for these new or changed log types. 

As a result, NLP models are more accurate at parsing logs than traditional rules while being more flexible and fault-tolerant.

Synthetic languages

In the longer term, entirely synthetic languages can be created that represent machine-to-machine and human-to-machine communications.

If two machines can create an entirely new language, that language can then be analyzed using NLP techniques to identify errors in grammar, syntax, and composition. All these can be interpreted as anomalies and contextualized for analysts.

This new development can help identify known issues or attacks when they occur, and can also identify completely unknown misconfigurations and attacks, which helps analysts be more efficient and effective.

Summary

The phishing protection, log parsing, and synthetic language applications are just the beginning for NLP. To learn more about AI and cybersecurity, see Learn About the Latest Developments with AI-Powered Cybersecurity, one of many on-demand sessions from  NVIDIA GTC.

Categories
Misc

Achieving 100x Faster Single-Cell Modality Prediction with NVIDIA RAPIDS cuML

Single-cell measurement technologies have advanced rapidly, revolutionizing the life sciences. We have scaled from measuring dozens to millions of cells and…

Single-cell measurement technologies have advanced rapidly, revolutionizing the life sciences. We have scaled from measuring dozens to millions of cells and from one modality to multiple high dimensional modalities. The vast amounts of information at the level of individual cells present a great opportunity to train machine learning models to help us better understand the intrinsic link of cell modalities, which could be transformative for synthetic biology and drug target discovery.

This post introduces modality prediction and explains how we accelerated the winning solution of the NeurIPS Single-Cell Multi-Modality Prediction Challenge by drop-in replacing the CPU-based TSVD and kernel ridge regression (KRR), implemented in scikit-learn, with NVIDIA GPU-based RAPIDS cuML implementations.

Using cuML and changing only six lines of code, we accelerated the scikit-learn–based winning solution, reducing the training time from 69 minutes to 40 seconds: a 103.5x speedup! Even when compared to sophisticated deep learning models developed in PyTorch, we observed that the cuML solution is both faster and more accurate for this prediction challenge.

Challenges of single-cell modality prediction

Diagram shows the process of DNA transcription to RNA and RNA translation to protein. The latter is what we focus on in this post.
Figure 1. Overview of the single-cell modality prediction problem

Thanks to single-cell technology, we can measure multiple modalities within the same single cell such as DNA accessibility (ATAC), mRNA gene expression (GEX), and protein abundance (ADT). Figure 1 shows that these modalities are intrinsically linked. Only accessible DNA can produce mRNA, which in turn is used as a template to produce protein.

The problem of modality prediction arises naturally where it is desirable to predict one modality from another. In the 2021 NeurIPS challenge, we were asked to predict the flow of information from ATAC to GEX and from GEX to ADT.

If a machine learning model can make good predictions, it must have learned intricate states of the cell and it could provide a deeper insight into cellular biology. Extending our understanding of these regulatory processes is also transformative for drug target discovery.

The modality prediction is a multi-output regression problem, and it presents unique challenges:

  • High cardinality. For example, GEX and ADT information are described in vectors of length 13953 and 134, respectively.
  • Strong bias. The data is collected from 10 diverse donors and four sites. Training and test data come from different sites. Both donor and site strongly influence the distribution of the data.
  • Sparsity, redundancy, and non-linearity. The modality data is sparse, and the columns are highly correlated.

In this post, we focus on the task of GEX to ADT predictions to demonstrate the efficiency of a single-GPU solution. Our methods can be extended to other single-cell modality prediction tasks with larger data size and higher cardinality using multi-node multi-GPU architectures.

Using TSVD and KRR algorithms for multi-target regression

As our baseline, we used the first-place solution of the NeurIPS Modality Prediction Challenge “GEX to ADT” from Kaiwen Deng of University of Michigan. The workflow of the core model is shown in Figure 2. The training data includes both GEX and ADT information while the test data only has GEX information.

The task is to predict the ADT of the test data given its GEX. To address the sparsity and redundancy of the data, we applied truncated singular value decomposition (TSVD) to reduce the dimension of both GEX and ADT.

In particular, two TSVD models fit GEX and ADT separately:

  • For GEX, TSVD fits the concatenated data of both training and testing.
  • For ADT, TSVD only fits the training data.

In Deng’s solution, dimensionality is reduced aggressively from 13953 to 300 for GEX and from 134 to 70 for ADT.

The number of principal components 300 and 70 are hyperparameters of the model, which are obtained through cross-validation and tuning. The reduced version of GEX and ADT of training data are then fed into KRR with the RBF kernel. Matching Deng’s approach, at inference time, we used the trained KRR model to perform the following tasks:

  • Predict the reduced version of ADT of the test data.
  • Apply the inverse transform of TSVD.
  • Recover the ADT prediction of the test data.
Blocks showing the input and outputs of each stage of the workflow.
Figure 2. Model overview. The blocks represent input and output data and the numbers beside the blocks represent the dimensions.

Generally, TSVD is the most popular choice to perform dimension reduction for sparse data, typically used during feature engineering. In this case, TSVD is used to reduce the dimension of both the features (GEX) and the targets (ADT). Dimension reduction of the targets makes it much easier for the downstream multi-output regression model because the TSVD outputs are more independent across the columns.

KRR is chosen as the multi-output regression model. Compared to SVM, KRR computes all the columns of the output concurrently while SVM predicts one column at a time so KRR can learn the nonlinearity like SVM but be much faster.

Implementing a GPU-accelerated solution with cuML

cuML is one of the RAPIDS libraries. It contains a suite of GPU-accelerated machine learning algorithms that provide many highly optimized models, including both TSVD and KRR. You can quickly adapt the baseline model from a scikit-learn implementation to a cuML implementation.

In the following code example, we only needed to change six lines of code and three of them are imports. For simplicity, many preprocessing and utility codes are omitted.

Baseline sklearn implementation:

from sklearn.decomposition import TruncatedSVD
from sklearn.gaussian_process.kernels import RBF
from sklearn.kernel_ridge import KernelRidge

tsvd_gex = TruncatedSVD(n_components=300)
tsvd_adt = TruncatedSVD(n_components=70)

gex_train_test = tsvd_gex.fit_transform(gex_train_test)
gex_train, gex_test = split(get_train_test)
adt_train = tsvd_adt.fit_transform(adt_train)
adt_comp = tsvd_adt.components_

y_pred = 0
for seed in seeds:
    gex_tr,_,adt_tr,_=train_test_split(gex_train, 
                                       adt_train,
                                       train_size=0.5, 
                                       random_state=seed)
    kernel = RBF(length_scale = scale)
    krr = KernelRidge(alpha=alpha, kernel=kernel)
    krr.fit(gex_tr, adt_tr)
    y_pred += (krr.predict(gex_test) @ adt_comp)
y_pred /= len(seeds)

RAPIDS cuML implementation:

from cuml.decomposition import TruncatedSVD
from cuml.kernel_ridge import KernelRidge
import gc

tsvd_gex = TruncatedSVD(n_components=300)
tsvd_adt = TruncatedSVD(n_components=70)

gex_train_test = tsvd_gex.fit_transform(gex_train_test)
gex_train, gex_test = split(get_train_test)
adt_train = tsvd_adt.fit_transform(adt_train)
adt_comp = tsvd_adt.components_.to_output('cupy')

y_pred = 0
for seed in seeds:
    gex_tr,_,adt_tr,_=train_test_split(gex_train, 
                                       adt_train,
                                       train_size=0.5, 
                                       random_state=seed)
    krr = KernelRidge(alpha=alpha,kernel='rbf')
    krr.fit(gex_tr, adt_tr)
    gc.collect()
    y_pred += (krr.predict(gex_test) @ adt_comp)
y_pred /= len(seeds)

The syntax of cuML kernels is slightly different from scikit-learn. Instead of creating a standalone kernel object, we specified the kernel type in the KernelRidge’s constructor. This is because the Gaussian process is not supported by cuML yet.

Another difference is that explicit garbage collection is needed for the current version cuML implementations. Some form of reference cycles are created in this particular loop and objects are not freed automatically without garbage collection. For more information, see the complete notebooks in the /daxiongshu/rapids_nips_blog GitHub repo.

Results

We compared the cuML implementation of TSVD+KRR against the CPU baseline and other top solutions in the challenge. The GPU solutions run on a single V100 GPU and the CPU solutions run on dual 20-core Intel Xeon CPUs. The metric for the competition is root mean square error (RMSE).

We found that the cuML implementation of TSVD+KRR is 103x faster than the CPU baseline with a slight degradation of the score due to the randomness in the pipeline. However, the score is still better than any other models in the competition.

We also compared our solution with two deep learning models:

Both deep learning models are implemented in PyTorch and run on a single V100 GPU. Both deep learning models have many layers with millions of parameters to train and hence are prone to overfitting for this dataset. In comparison, TSVD+KRR only has to train less than 30K parameters. Figure 4 shows that the cuML TSVD+KRR model is both faster and more accurate than the deep learning models, thanks to its simplicity.

Chart compares RMSE and training time between the proposed TSVD+KRR cuML GPU and three baseline solutions: TSVD+KRR CPU, MLP PyTorch GPU, and GNN PyTorch GPU. The proposed TSVD+KRR cuML GPU is at least 100x faster than the baselines and only slightly worse RMSE than the best baseline.
Figure 4. Performance and training time comparison. The horizontal axis is with a logarithmic scale.

Figure 5 shows a detailed speedup analysis, where we present timings for the two stages of the algorithm: TSVD and KRR. cuML TSVD and KRR are 15x and 103x faster than the CPU baseline, respectively.

Bar chart shows running time breakdown for cuML GPU over sklearn CPU. The TSVD running time is reduced from 120 seconds with sklearn to 8 seconds with cuML. The KRR running time is reduced from 4,140 seconds with sklearn to 40 seconds with cuML.
Figure 5. Run time comparison

Figure 5. Run time comparison

Conclusion

Due to its lightning speed and user-friendly API, RAPIDS cuML is incredibly useful for accelerating the analysis of the single-cell data. With a few minor code changes, you can boost your existing scikit-learn workflows.

In addition, when dealing with single-cell modality prediction, we recommend starting with cuML TSVD to reduce the dimension of data and KRR for the downstream tasks to achieve the best speedup.

Try out this RAPIDS cuML implementation with the code on the /daxiongshu/rapids_nips_blog GitHub repo.

Categories
Misc

Upcoming Event: Why GPUs Are Important to AI

Join us on October 20 to learn how NVIDIA GPUs can dramatically accelerate your machine learning workloads.

Join us on October 20 to learn how NVIDIA GPUs can dramatically accelerate your machine learning workloads.

Categories
Misc

NVIDIA, Oracle CEOs in Fireside Chat Light Pathways to Enterprise AI

Speeding adoption of enterprise AI and accelerated computing, Oracle CEO Safra Catz and NVIDIA founder and CEO Jensen Huang discussed their companies’ expanding collaboration in a fireside chat live streamed today from Oracle CloudWorld in Las Vegas. Oracle and NVIDIA announced plans to bring NVIDIA’s full accelerated computing stack to Oracle Cloud Infrastructure (OCI). It Read article >

The post NVIDIA, Oracle CEOs in Fireside Chat Light Pathways to Enterprise AI appeared first on NVIDIA Blog.

Categories
Offsites

Table Tennis: A Research Platform for Agile Robotics

Robot learning has been applied to a wide range of challenging real world tasks, including dexterous manipulation, legged locomotion, and grasping. It is less common to see robot learning applied to dynamic, high-acceleration tasks requiring tight-loop human-robot interactions, such as table tennis. There are two complementary properties of the table tennis task that make it interesting for robotic learning research. First, the task requires both speed and precision, which puts significant demands on a learning algorithm. At the same time, the problem is highly-structured (with a fixed, predictable environment) and naturally multi-agent (the robot can play with humans or another robot), making it a desirable testbed to investigate questions about human-robot interaction and reinforcement learning. These properties have led to several research groups developing table tennis research platforms [1, 2, 3, 4].

The Robotics team at Google has built such a platform to study problems that arise from robotic learning in a multi-player, dynamic and interactive setting. In the rest of this post we introduce two projects, Iterative-Sim2Real (to be presented at CoRL 2022) and GoalsEye (IROS 2022), which illustrate the problems we have been investigating so far. Iterative-Sim2Real enables a robot to hold rallies of over 300 hits with a human player, while GoalsEye enables learning goal-conditioned policies that match the precision of amateur humans.

Iterative-Sim2Real policies playing cooperatively with humans (top) and a GoalsEye policy returning balls to different locations (bottom).

Iterative-Sim2Real: Leveraging a Simulator to Play Cooperatively with Humans
In this project, the goal for the robot is cooperative in nature: to carry out a rally with a human for as long as possible. Since it would be tedious and time-consuming to train directly against a human player in the real world, we adopt a simulation-based (i.e., sim-to-real) approach. However, because it is difficult to simulate human behavior accurately, applying sim-to-real learning to tasks that require tight, close-loop interaction with a human participant is difficult.

In Iterative-Sim2Real, (i.e., i-S2R), we present a method for learning human behavior models for human-robot interaction tasks, and instantiate it on our robotic table tennis platform. We have built a system that can achieve rallies of up to 340 hits with an amateur human player (shown below).

A 340-hit rally lasting over 4 minutes.

Learning Human Behavior Models: a Chicken and Egg Problem
The central problem in learning accurate human behavior models for robotics is the following: if we do not have a good-enough robot policy to begin with, then we cannot collect high-quality data on how a person might interact with the robot. But without a human behavior model, we cannot obtain robot policies in the first place. An alternative would be to train a robot policy directly in the real world, but this is often slow, cost-prohibitive, and poses safety-related challenges, which are further exacerbated when people are involved. i-S2R, visualized below, is a solution to this chicken and egg problem. It uses a simple model of human behavior as an approximate starting point and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined.

i-S2R Methodology.

Results
To evaluate i-S2R, we repeated the training process five times with five different human opponents and compared it with a baseline approach of ordinary sim-to-real plus fine-tuning (S2R+FT). When aggregated across all players, the i-S2R rally length is higher than S2R+FT by about 9% (below on the left). The histogram of rally lengths for i-S2R and S2R+FT (below on the right) shows that a large fraction of the rallies for S2R+FT are shorter (i.e., less than 5), while i-S2R achieves longer rallies more frequently.

Summary of i-S2R results. Boxplot details: The white circle is the mean, the horizontal line is the median, box bounds are the 25th and 75th percentiles.

We also break down the results based on player type: beginner (40% players), intermediate (40% of players) and advanced (20% players). We see that i-S2R significantly outperforms S2R+FT for both beginner and intermediate players (80% of players).

i-S2R Results by player type.

More details on i-S2R can be found on our preprint, website, and also in the following summary video.

GoalsEye: Learning to Return Balls Precisely on a Physical Robot
While we focused on sim-to-real learning in i-S2R, it is sometimes desirable to learn using only real-world data — closing the sim-to-real gap in this case is unnecessary. Imitation learning (IL) provides a simple and stable approach to learning in the real world, but it requires access to demonstrations and cannot exceed the performance of the teacher. Collecting expert human demonstrations of precise goal-targeting in high speed settings is challenging and sometimes impossible (due to limited precision in human movements). While reinforcement learning (RL) is well-suited to such high-speed, high-precision tasks, it faces a difficult exploration problem (especially at the start), and can be very sample inefficient. In GoalsEye, we demonstrate an approach that combines recent behavior cloning techniques [5, 6] to learn a precise goal-targeting policy, starting from a small, weakly-structured, non-targeting dataset.

Here we consider a different table tennis task with an emphasis on precision. We want the robot to return the ball to an arbitrary goal location on the table, e.g. “hit the back left corner” or ”land the ball just over the net on the right side” (see left video below). Further, we wanted to find a method that can be applied directly on our real world table tennis environment with no simulation involved. We found that the synthesis of two existing imitation learning techniques, Learning from Play (LFP) and Goal-Conditioned Supervised Learning (GCSL), scales to this setting. It is safe and sample efficient enough to train a policy on a physical robot which is as accurate as amateur humans at the task of returning balls to specific goals on the table.

 
GoalsEye policy aiming at a 20cm diameter goal (left). Human player aiming at the same goal (right).

The essential ingredients of success are:

  1. A minimal, but non-goal-directed “bootstrap” dataset of the robot hitting the ball to overcome an initial difficult exploration problem.
  2. Hindsight relabeled goal conditioned behavioral cloning (GCBC) to train a goal-directed policy to reach any goal in the dataset.
  3. Iterative self-supervised goal reaching. The agent improves continuously by setting random goals and attempting to reach them using the current policy. All attempts are relabeled and added into a continuously expanding training set. This self-practice, in which the robot expands the training data by setting and attempting to reach goals, is repeated iteratively.
GoalsEye methodology.

Demonstrations and Self-Improvement Through Practice Are Key
The synthesis of techniques is crucial. The policy’s objective is to return a variety of incoming balls to any location on the opponent’s side of the table. A policy trained on the initial 2,480 demonstrations only accurately reaches within 30 cm of the goal 9% of the time. However, after a policy has self-practiced for ~13,500 attempts, goal-reaching accuracy rises to 43% (below on the right). This improvement is clearly visible as shown in the videos below. Yet if a policy only self-practices, training fails completely in this setting. Interestingly, the number of demonstrations improves the efficiency of subsequent self-practice, albeit with diminishing returns. This indicates that demonstration data and self-practice could be substituted depending on the relative time and cost to gather demonstration data compared with self-practice.

Self-practice substantially improves accuracy. Left: simulated training. Right: real robot training. The demonstration datasets contain ~2,500 episodes, both in simulation and the real world.
 
Visualizing the benefits of self-practice. Left: policy trained on initial 2,480 demonstrations. Right: policy after an additional 13,500 self-practice attempts.

More details on GoalsEye can be found in the preprint and on our website.

Conclusion and Future Work
We have presented two complementary projects using our robotic table tennis research platform. i-S2R learns RL policies that are able to interact with humans, while GoalsEye demonstrates that learning from real-world unstructured data combined with self-supervised practice is effective for learning goal-conditioned policies in a precise, dynamic setting.

One interesting research direction to pursue on the table tennis platform would be to build a robot “coach” that could adapt its play style according to the skill level of the human participant to keep things challenging and exciting.

Acknowledgements
We thank our co-authors, Saminda Abeyruwan, Alex Bewley, Krzysztof Choromanski, David B. D’Ambrosio, Tianli Ding, Deepali Jain, Corey Lynch, Pannag R. Sanketi, Pierre Sermanet and Anish Shankar. We are also grateful for the support of many members of the Robotics Team who are listed in the acknowledgement sections of the papers.

Categories
Misc

Meta’s Grand Teton Brings NVIDIA Hopper to Its Data Centers

Meta today announced its next-generation AI platform, Grand Teton, including NVIDIA’s collaboration on design. Compared to the company’s previous generation Zion EX platform, the Grand Teton system packs in more memory, network bandwidth and compute capacity, said Alexis Bjorlin, vice president of Meta Infrastructure Hardware, at the 2022 OCP Global Summit, an Open Compute Project Read article >

The post Meta’s Grand Teton Brings NVIDIA Hopper to Its Data Centers appeared first on NVIDIA Blog.

Categories
Misc

Oracle and NVIDIA Partner to Speed AI Adoption for Enterprises

Expanding their longstanding alliance, Oracle and NVIDIA today announced a multi-year partnership to help customers solve business challenges with accelerated computing and AI. The collaboration aims to bring the full NVIDIA accelerated computing stack — from GPUs to systems to software — to Oracle Cloud Infrastructure (OCI).

Categories
Misc

Adobe MAX Kicks Off With Creative App Updates and 3D Artist Anna Natter Impresses This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. In the coming weeks, we’ll be deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically Read article >

The post Adobe MAX Kicks Off With Creative App Updates and 3D Artist Anna Natter Impresses This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Souped-Up Auto Quotes: ProovStation Delivers GPU-Driven AI Appraisals

Vehicle appraisals are getting souped up with a GPU-accelerated AI overhaul. ProovStation, a four-year-old startup based in Lyon, France, is taking on the ambitious computer-vision quest of automating vehicle inspection and repair estimates, aiming AI-driven super-high-resolution stations at businesses worldwide. It recently launched three of its state-of-the-art vehicle inspection scanners at French retail giant Carrefour’s Read article >

The post Souped-Up Auto Quotes: ProovStation Delivers GPU-Driven AI Appraisals appeared first on NVIDIA Blog.