Categories
Misc

Dealing with Outliers Using Three Robust Linear Regression Models

Photo by Ricardo Gomez Angel on UnsplashLinear regression is one of the simplest machine learning models out there. It is often the starting point not only for learning about data science but also for building quick and…Photo by Ricardo Gomez Angel on Unsplash

Linear regression is one of the simplest machine learning models out there. It is often the starting point not only for learning about data science but also for building quick and simple minimum viable products (MVPs), which then serve as benchmarks for more complex algorithms.

In general, linear regression fits a line (in two dimensions) or a hyperplane (in three and more dimensions) that best describes the linear relationship between the features and the target value. The algorithm also assumes that the probability distributions of the features are well-behaved; for example, they follow the Gaussian distribution.

Outliers are values that are located far outside of the expected distribution. They cause the distributions of the features to be less well-behaved. As a consequence, the model can be skewed towards the outlier values, which, as I’ve already established, are far away from the central mass of observations. Naturally, this leads to the linear regression finding a worse and more biased fit with inferior predictive performance.

It is important to remember that the outliers can be found both in the features and the target variable, and all the scenarios can worsen the performance of the model.

There are many possible approaches to dealing with outliers: removing them from the observations, treating them (capping the extreme observations at a reasonable value, for example), or using algorithms that are well-suited for dealing with such values on their own. This post focuses on these robust methods.

Setup

I use fairly standard libraries: numpy, pandas, scikit-learn. All the models I work with here are imported from the linear_model module of scikit-learn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.linear_model import (LinearRegression, HuberRegressor,
                              	RANSACRegressor, TheilSenRegressor)

Data

Given that the goal is to show how different robust algorithms deal with outliers, the first step is to create a tailor-made dataset to show clearly the differences in the behavior. To do so, use the functionalities available in scikit-learn

Start with creating a dataset of 500 observations, with one informative feature. With only one feature and the target, plot the data, together with the models’ fits. Also, specify the noise (standard deviation applied to the output) and create a list containing the coefficient of the underlying linear model; that is, what the coefficient would be if the linear regression model was fit to the generated data. In this example, the value of the coefficient is 64.6. Extract those coefficients for all the models and use them to compare how well they fit the data.

Next, replace the first 25 observations (5% of the observations) with outliers, far outside of the mass of generated observations. Bear in mind that the coefficient stored earlier comes from the data without outliers. Including them makes a difference.

N_SAMPLES = 500
N_OUTLIERS = 25

X, y, coef = datasets.make_regression(
	n_samples=N_SAMPLES,
	n_features=1,
	n_informative=1,
	noise=20,
	coef=True,
	random_state=42
)
coef_list = [["original_coef", float(coef)]]

# add outliers          	 
np.random.seed(42)
X[:N_OUTLIERS] = 10 + 0.75 * np.random.normal(size=(N_OUTLIERS, 1))
y[:N_OUTLIERS] = -15 + 20 * np.random.normal(size=N_OUTLIERS)

plt.scatter(X, y);
Graph showing the generated data, together with the outliers, which are far away from the main bulk of the observations.
Figure 1. The generated data and the outliers that have been manually added

Linear regression

Start with the good old linear regression model, which is likely highly influenced by the presence of the outliers. Fit the model to the data using the following example:

lr = LinearRegression().fit(X, y)
coef_list.append(["linear_regression", lr.coef_[0]])

Then prepare an object to use for plotting the fits of the models. The plotline_X object is a 2D array containing evenly spaced values within the interval dictated by the generated data set. Use this object for getting the fitted values for the models. It must be a 2D array, given it is the expected input of the models in scikit-learn. Then create a fit_df DataFrame in which to store the fitted values, created by fitting the models to the evenly spaced values.

plotline_X = np.arange(X.min(), X.max()).reshape(-1, 1)

fit_df = pd.DataFrame(
	index = plotline_X.flatten(),
	data={"linear_regression": lr.predict(plotline_X)}
)

Having prepared the DataFrame, plot the fit of the linear regression model to the data with outliers.

fix, ax = plt.subplots()
fit_df.plot(ax=ax)
plt.scatter(X, y, c="k")
plt.title("Linear regression on data with outliers");

Figure 2 shows the significant impact that outliers have on the linear regression model.

Graph showing the impact of the outliers on the linear regression model.
Figure 2. The fit of the linear regression model to the data with outliers

The benchmark model has been obtained using linear regression. Now it is time to move toward robust regression algorithms.

Huber regression

Huber regression is an example of a robust regression algorithm that assigns less weight to observations identified as outliers. To do so, it uses the Huber loss in the optimization routine. Here’s a better look at what is actually happening in this model.

Huber regression minimizes the following loss function:

minlimits_{omega,sigma}sumlimits_{i=1}^{n}(sigma+H_{epsilon}(frac{X_iomega-y_i}{sigma})sigma)+alpha|omega|2^2

Where sigma denotes the standard deviation, X_i represents the set of features, y_i is the regression’s target variable, omega is a vector of the estimated coefficients and alpha is the regularization parameter. The formula also indicates that outliers are treated differently from the regular observations according to the Huber loss:

H_{epsilon}(z)=begin{cases} z^2, & text{if}|z|<epsilon \ 2epsilon|z|-epsilon^2, & text{otherwise}end{cases}

The Huber loss identifies outliers by considering the residuals, denoted by z. If the observation is considered to be regular (because the absolute value of the residual is smaller than some threshold epsilon, then apply the squared loss function. Otherwise, the observation is considered to be an outlier and you apply the absolute loss. Having said that, Huber loss is basically a combination of the squared and absolute loss functions.

An inquisitive reader might notice that the first equation is similar to Ridge regression, that is, including the L2 regularization. The difference between Huber regression and Ridge regression lies in the treatment of outliers.

You might recognize this approach to loss functions from analyzing the differences between two of the popular regression evaluation metrics: mean squared error (MSE) and mean absolute error (MAE). Similar to what the Huber loss implies, I recommend using MAE when you are dealing with outliers, as it does not penalize those observations as heavily as the squared loss does. 

Connected to the previous point is the fact that optimizing the squared loss results in an unbiased estimator around the mean, while the absolute difference leads to an unbiased estimator around the median. The median is much more robust to outliers than the mean, so expect this to provide a less biased estimate.

Use the default value of 1.35 for epsilon, which determines the regression’s sensitivity to outliers. Huber (2004) shows that when the errors follow a normal distribution with sigma = 1 and epsilon = 1.35, an efficiency of 95% is achieved relative to the OLS regression.

For your own use cases, I recommend tuning the hyperparameters alpha and epsilon, using a method such as grid search. 

Fit the Huber regression to the data using the following example:

huber = HuberRegressor().fit(X, y)
fit_df["huber_regression"] = huber.predict(plotline_X)
coef_list.append(["huber_regression", huber.coef_[0]])

Figure 3 presents the fitted model’s best fit line.

Graph showing the fit of the Huber regression model to the data with outliers.
Figure 3. The fit of the Huber regression model to the data with outliers

RANSAC regression

Random sample consensus (RANSAC) regression is a non-deterministic algorithm that tries to separate the training data into inliers (which may be subject to noise) and outliers. Then, it estimates the final model only using the inliers.

RANSAC is an iterative algorithm in which iteration consists of the following steps:

  1. Select a random subset from the initial data set.
  2. Fit a model to the selected random subset. By default, that model is a linear regression model; however, you can change it to other regression models.
  3. Use the estimated model to calculate the residuals for all the data points in the initial data set. All observations with absolute residuals smaller than or equal to the selected threshold are considered inliers and create the so-called consensus set. By default, the threshold is defined as the median absolute deviation (MAD) of the target values.
  4. The fitted model is saved as the best one if sufficiently many points have been classified as part of the consensus set. If the current estimated model has the same number of inliers as the current best one, it is only considered to be better if it has a better score.

The steps are performed iteratively either a maximum number of times or until a special stop criterion is met. Those criteria can be set using three dedicated hyperparameters. As I mentioned earlier, the final model is estimated using all inlier samples.

Fit the RANSAC regression model to the data.

ransac = RANSACRegressor(random_state=42).fit(X, y)
fit_df["ransac_regression"] = ransac.predict(plotline_X)
ransac_coef = ransac.estimator_.coef_
coef_list.append(["ransac_regression", ransac.estimator_.coef_[0]])

As you can see, the procedure for recovering the coefficient is a bit more complex, as it’s first necessary to access the final estimator of the model (the one trained using all the identified inliers) using estimator_. As it is a LinearRegression object, proceed to recover the coefficient as you did earlier. Then, plot the fit of the RANSAC regression (Figure 4).

Graph showing the fit of the RANSAC regression model to the data with outliers.
Figure 4. The fit of the RANSAC regression model to the data with outliers

With RANSAC regression, you can also inspect the observations that the model considered to be inliers and outliers. First, check how many outliers the model identified in total and then how many of those that were manually introduced overlap with the models’ decision. The first 25 observations of the training data are all the outliers that have been introduced.

inlier_mask = ransac.inlier_mask_
outlier_mask = ~inlier_mask
print(f"Total outliers: {sum(outlier_mask)}")
print(f"Outliers you added yourself: {sum(outlier_mask[:N_OUTLIERS])} / {N_OUTLIERS}")

Running the example prints the following summary:

Total outliers: 51
Outliers you added yourself: 25 / 25

Roughly 10% of data was identified as outliers and all the observations introduced were correctly classified as outliers. It’s then possible to quickly visualize the inliers compared to outliers to see the remaining 26 observations flagged as outliers.

plt.scatter(X[inlier_mask], y[inlier_mask], color="blue", label="Inliers")
plt.scatter(X[outlier_mask], y[outlier_mask], color="red", label="Outliers")
plt.title("RANSAC - outliers vs inliers");

Figure 5 shows that the observations located farthest from the hypothetical best-fit line of the original data are considered outliers.

Graph showing inliers compared to outliers as identified by the RANSAC algorithm
Figure 5. Inliers compared to outliers as identified by the RANSAC algorithm

Theil-Sen regression

The last of the robust regression algorithms available in scikit-learn is the Theil-Sen regression. It is a non-parametric regression method, which means that it makes no assumption about the underlying data distribution. In short, it involves fitting multiple regression models on subsets of the training data and then aggregating the coefficients at the last step.

Here’s how the algorithm works. First, it calculates the least square solutions (slopes and intercepts) on subsets of size p (hyperparameter n_subsamples) created from all the observations in the training set X. If you calculate the intercept (it is optional), then the following condition must be satisfied p >= n_features + 1. The final slope of the line (and possibly the intercept) is defined as the (spatial) median of all the least square solutions.

A possible downside of the algorithm is its computational complexity, as it can consider a total number of least square solutions equal to n_samples choose n_subsamples, where n_samples is the number of observations in X. Given that this number can quickly explode in size, there are a few things that can be done:

  • Use the algorithm only for small problems in terms of the number of samples and features. However, for obvious reasons, this might not always be feasible.
  • Tune the n_subsamples hyperparameter. A lower value leads to higher robustness to outliers at the cost of lower efficiency, while a higher value leads to lower robustness and higher efficiency.
  • Use the max_subpopulation hyperparameter. If the total value of n_samples choose n_subsamples is larger than max_subpopulation, the algorithm only considers a stochastic subpopulation of a given maximal size. Naturally, using only a random subset of all the possible combinations leads to the algorithm losing some of its mathematical properties.

Also, be aware that the estimator’s robustness decreases quickly with the dimensionality of the problem. To see how that works out in practice, estimate the Theil-Sen regression using the following example:

theilsen = TheilSenRegressor(random_state=42).fit(X, y)
fit_df["theilsen_regression"] = theilsen.predict(plotline_X)
coef_list.append(["theilsen_regression", theilsen.coef_[0]])
Graph showing the Theil-Sen regression results in a similar fit to the RANSAC model.
Figure 6. The fit of the Theil-Sen regression model to the data with outliers

Comparison of the models

So far, three robust regression algorithms have been fitted to the data containing outliers and the individual best fit lines have been identified. Now it is time for a comparison.

Start with the visual inspection of Figure 7. To show too many lines, the fit line of the original data is not printed. However, it is quite easy to imagine what it looks like, given the direction of the majority of the data points. Clearly, the RANSAC and Theil-Sen regressions have resulted in the most accurate best fit lines.

Graph showing a comparison of all the considered regression models.
Figure 7. Comparison of all the considered regression models

To be more precise, look at the estimated coefficients. Table 1 shows that the RANSAC regression results in the fit closest to the one of the original data. It is also interesting to see how big of an impact the 5% of outliers had on the regular linear regression’s fit.

model coefficient
0 original_coef 64.59
1 linear_regression 8.77
2 huber_regression 37.52
3 ransac_regression 62.85
4 theilsen_regression 59.49
Table 1. The comparison of the coefficients of the different models fitted to the data with outliers

You might ask which robust regression algorithm is the best? As is often the case, the answer is, “It depends.” Here are some guidelines that might help you find the right model for your specific problem:

  • In general, robust fitting in a high-dimensional setting is difficult. 
  • In contrast to Theil-Sen and RANSAC, Huber regression is not trying to completely filter out the outliers. Instead, it lessens their effect on the fit.
  • Huber regression should be faster than RANSAC and Theil-Sen, as the latter ones fit on smaller subsets of the data. 
  • Theil-Sen and RANSAC are unlikely to be as robust as the Huber regression using the default hyperparameters.
  • RANSAC is faster than Theil-Sen and it scales better with the number of samples.
  • RANSAC should deal better with large outliers in the y-direction, which is the most common scenario.

Taking all the preceding information into consideration, you might also empirically experiment with all three robust regression algorithms and see which one fits your data best. 

You can find the code used in this post in my /erykml GitHub repo. I look forward to hearing from you in the comments.

Categories
Misc

Lucid Motors’ Mike Bell on Software-Defined Innovation for the Luxury EV Brand

AI and electric vehicle technology breakthroughs are transforming the automotive industry. These developments pave the way for new innovators, attracting technical prowess and design philosophies from Silicon Valley. Mike Bell, senior vice president of digital at Lucid Motors, sees continuous innovation coupled with over-the-air updates as key to designing sustainable, award-winning intelligent vehicles that provide Read article >

The post Lucid Motors’ Mike Bell on Software-Defined Innovation for the Luxury EV Brand appeared first on NVIDIA Blog.

Categories
Offsites

Simplified Transfer Learning for Chest Radiography Model Development

Every year, nearly a billion chest X-ray (CXR) images are taken globally to aid in the detection and management of health conditions ranging from collapsed lungs to infectious diseases. Generally, CXRs are cheaper and more accessible than other forms of medical imaging. However, existing challenges continue to impede the optimal use of CXRs. For example, in some areas, trained radiologists that can accurately interpret CXR images are in short supply. In addition, interpretation variability between experts, workflow differences between institutions, and the presence of rare conditions familiar only to subspecialists all contribute to making high-quality CXR interpretation a challenge.

Recent research has leveraged machine learning (ML) to explore potential solutions for some of these challenges. There is significant interest and effort devoted to building deep learning models that detect abnormalities in CXRs and improve access, accuracy, and efficiency to identify diseases and conditions that affect the heart and lungs. However, building robust CXR models requires large labeled training datasets, which can be prohibitively expensive and time-consuming to create. In some cases, such as working with underrepresented populations or studying rare medical conditions, only limited data are available. Additionally, CXR images vary in quality across populations, geographies, and institutions, making it difficult to build robust models that perform well globally.

In “Simplified Transfer Learning for Chest Radiography Models Using Less Data”, published in the journal Radiology, we describe how Google Health utilizes advanced ML methods to generate pre-trained “CXR networks” that can convert CXR images to embeddings (i.e., information-rich numerical vectors) to enable the development of CXR models using less data and fewer computational resources. We demonstrate that even with less data and compute, this approach has enabled performance comparable to state-of-the-art deep learning models across various prediction tasks. We are also excited to announce the release of CXR Foundation, a tool that utilizes our CXR-specific network to enable developers to create custom embeddings for their CXR images. We believe this work will help accelerate the development of CXR models, aiding in disease detection and contributing to more equitable health access throughout the world.

Developing a Chest X-ray Network
A common approach to building medical ML models is to pre-train a model on a generic task using non-medical datasets and then refine the model on a target medical task. This process of transfer learning may improve the target task performance or at least speed up convergence by applying the understanding of natural images to medical images. However, transfer learning may still require large labeled medical datasets for the refinement step.

Expanding on this standard approach, our system supports modeling CXR-specific tasks through a three-step model training setup composed of (1) generic image pre-training similar to traditional transfer learning, (2) CXR-specific pre-training, and (3) task-specific training. The first and third steps are common in ML: first pre-training on a large dataset and labels that are not specific to the desired task, and then fine-tuning on the task of interest.

We built a CXR-specific image classifier that employs supervised contrastive learning (SupCon). SupCon pulls together representations of images that have the same label (e.g., abnormal) and pushes apart representations of images that have a different label (e.g., one normal image and one abnormal image). We pre-trained this model on de-identified CXR datasets of over 800,000 images generated in partnership with Northwestern Medicine and Apollo Hospitals in the US and India, respectively. We then leveraged noisy abnormality labels from natural language processing of radiology reports to build our “CXR-specific” network.

This network creates embeddings (i.e., information-rich numerical vectors that can be used to distinguish classes from each other) that can more easily train models for specific medical prediction tasks, such as image finding (e.g., airspace opacity), clinical condition (e.g., tuberculosis), or patient outcome (e.g., hospitalization). For example, the CXR network can generate embeddings for every image in a given CXR dataset. For these images, the generated embeddings and the labels for the desired target task (such as tuberculosis) are used as examples to train a small ML model.

Left: Training a CXR model for a given task generally requires a large number of labeled images and a significant amount of computational resources to create a foundation of neural network layers. Right: With the CXR network and tool providing this foundation, each new task requires only a fraction of the labeled images, computational resources, and neural network parameters compared to rebuilding the entire network from scratch.

Effects of CXR Pre-training
We visualized these embedding layers at each step of the process using airspace opacity as an example (see the figure below). Before SupCon-based pre-training, there was poor separation of normal and abnormal CXR embeddings. After SupCon-based pre-training, the positive examples were grouped more closely together, and the negative examples more closely together as well, indicating that the model had identified that images from each category resembled themselves.

Visualizations of the t-distributed stochastic neighbor embedding for generic vs. CXR-specific network embeddings. Embeddings are information-rich numerical vectors that alone can distinguish classes from each other, in this case, airspace opacity positive vs. negative.

Our research suggests that adding the second stage of pre-training enables high-quality models to be trained with up to 600-fold less data in comparison to traditional transfer learning approaches that leverage pre-trained models on generic, non-medical datasets. We found this to be true regardless of model architecture (e.g., ResNet or EfficientNet) or dataset used for natural image pre-training (e.g., ImageNet or JFT-300M). With this approach, researchers and developers can significantly reduce dataset size requirements.

Top: In a deep learning model, the neural network contains multiple layers of artificial neurons, with the first layer taking the CXR image as input, intermediate layers doing additional computation, and the final layer making the classification (e.g., airspace opacity: present vs. absent). The embedding layer is usually one of the last layers. Bottom left: The traditional transfer learning approach involves a two-step training setup where a generic pre-trained network is optimized directly on a prediction task of interest. Our proposed three-step training setup generates a CXR network using a SupCon ML technique (step 2) before optimization for prediction tasks of interest (step 3). Bottom right: Using the embeddings involves either training smaller models (the first two strategies) or fine-tuning the whole network if there are sufficient data (strategy 3).

Results
After training the initial model, we measured performance using the area under the curve (AUC) metric with both linear and non-linear models applied to CXR embeddings; and a non-linear model produced by fine-tuning the entire network. On public datasets, such as ChestX-ray14 and CheXpert, our work substantially and consistently improved the data-accuracy tradeoff for models developed across a range of training dataset sizes and several findings. For example, when evaluating the tool’s ability to develop tuberculosis models, data efficiency gains were more striking: models trained on the embeddings of just 45 images achieved non-inferiority to radiologists in detecting tuberculosis on an external validation dataset. For both tuberculosis and severe COVID-19 outcomes, we show that non-linear classifiers trained on frozen embeddings outperformed a model that was fine-tuned on the entire dataset.

Comparing CXR-specific networks for transfer learning (red), with a baseline transfer learning approach (blue) across a variety of CXR abnormalities (top left), tuberculosis (bottom left), and COVID-19 outcomes (bottom right). This approach improves performance at the same dataset size, or reduces the dataset size required to reach the same performance. Interestingly, using the CXR network with simpler ML models that are faster to train (red) performs better than training the full network (black) at dataset sizes up to 85 images.

Conclusion and Future Work
To accelerate CXR modeling efforts with low data and computational requirements, we are releasing our CXR Foundation tool, along with scripts to train linear and nonlinear classifiers. Via these embeddings, this tool will allow researchers to jump-start CXR modeling efforts using simpler transfer learning methods. This approach can be particularly useful for predictive modeling using small datasets, and for adapting CXR models when there are distribution shifts in patient populations (whether over time or across different institutions). We are excited to continue working with partners, such as Northwestern Medicine and Apollo Hospitals, to explore the impact of this technology further. By enabling researchers with limited data and compute to develop CXR models, we’re hoping more developers can solve the most impactful problems for their populations.

Acknowledgements
Key contributors to this project at Google include Christina Chen, Yun Liu, Dilip Krishnan, Zaid Nabulsi, Atilla Kiraly, Arnav Agharwal, Eric Wu, Yuanzhen Li, Aaron Maschinot, Aaron Sarna, Jenny Huang, Marilyn Zhang, Charles Lau, Neeral Beladia, Daniel Tse, Krish Eswaran, and Shravya Shetty. Significant contributions and input were also made by collaborators Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia-Vicente, and David Melnick. For the ChestX-ray14 dataset, we thank the NIH Clinical Center for making it publicly available. The authors would also like to acknowledge many members of the Google Health Radiology and labeling software teams. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study; Jonny Wong for coordinating the imaging annotation work; Craig Mermel and Akinori Mitani for providing feedback on the manuscript; Nicole Linton and Lauren Winer for feedback on the blogpost; and Tom Small for the animation.

Categories
Offsites

Google at ICML 2022

Google is a leader in machine learning (ML) research with groups innovating across virtually all aspects of the field, from theory to application. We build machine learning systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. Core to our approach is to actively engage with the broader research community by open-sourcing datasets and models, publishing our discoveries, and actively participating in leading conferences.

Google is proud to be a Diamond Sponsor of the thirty-ninth International Conference on Machine Learning (ICML 2022), a premier annual conference, which is being held this week in Baltimore, Maryland. Google has a strong presence at this year’s conference with over 100 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.

Registered for ICML 2022? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Take a look below to learn more about the Google research being presented at ICML 2022 (Google affiliations in bold).

Organizing Committee

Tutorial Chairs include: Hanie Sedghi

Emeritus Members include: Andrew McCallum

Board Members include: Hugo Larochelle, Csaba Szepesvari, Corinna Cortes

Publications

Individual Preference Stability for Clustering
Saba Ahmadi, Pranjal Awasthi, Samir Khuller, Matthäus Kleindessner, Jamie Morgenstern, Pattara Sukprasert, Ali Vakilian

Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Utku Evci, Vincent Dumoulin, Hugo Larochelle, Michael Mozer

H-Consistency Bounds for Surrogate Loss Minimizers
Pranjal Awasthi, Anqi Mao, Mehryar Mohri, Yutao Zhong

Cooperative Online Learning in Stochastic and Adversarial MDPs
Tal Lancewicki, Aviv Rosenberg, Yishay Mansour

Do More Negative Samples Necessarily Hurt in Contrastive Learning?
Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath

Deletion Robust Submodular Maximization Over Matroids
Paul Dütting, Federico Fusco*, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

Tight and Robust Private Mean Estimation with Few Users
Hossein Esfandiari, Vahab Mirrokni, Shyam Narayanan*

Generative Trees: Adversarial and Copycat
Richard Nock, Mathieu Guillame-Bert

Agnostic Learnability of Halfspaces via Logistic Loss
Ziwei Ji*, Kwangjun Ahn*, Pranjal Awasthi, Satyen Kale, Stefani Karp

Adversarially Trained Actor Critic for Offline Reinforcement Learning
Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal

Unified Scaling Laws for Routed Language Models
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuogu, Karen Simonyan

Large Batch Experience Replay
Thibault Lahire, Matthieu Geist, Emmanuel Rachelson

Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li*, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

The Poisson Binomial Mechanism for Unbiased Federated Learning with Secure Aggregation
Wei-Ning Chen, Ayfer Ozgur, Peter Kairouz

Global Optimization Networks
Sen Zhao, Erez Louidor, Maya Gupta

A Joint Exponential Mechanism for Differentially Private Top-k
Jennifer Gillenwater, Matthew Joseph, Andres Munoz Medina, Mónica Ribero

On the Practicality of Deterministic Epistemic Uncertainty
Janis Postels, Mattia Segu, Tao Sun, Luc Van Gool, Fisher Yu, Federico Tombari

Balancing Discriminability and Transferability for Source-Free Domain Adaptation
Jogendra Nath Kundu, Akshay Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Kulkarni, Varun Jampani, Venkatesh Babu Radhakrishnan

Transfer and Marginalize: Explaining Away Label Noise with Privileged Information
Mark Collier, Rodolphe Jenatton, Efi Kokiopoulou, Jesse Berent

In Defense of Dual-Encoders for Neural Ranking
Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Jakkam Reddi, Sanjiv Kumar

Surrogate Likelihoods for Variational Annealed Importance Sampling
Martin Jankowiak, Du Phan

Translatotron 2: High-Quality Direct Speech-to-Speech Translation with Voice Preservation (see blog post)
Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz

Differentially Private Approximate Quantiles
Haim Kaplan, Shachar Schnapp, Uri Stemmer

Continuous Control with Action Quantization from Demonstrations
Robert Dadashi, Léonard Hussenot, Damien Vincent, Sertan Girgin, Anton Raichuk, Matthieu Geist, Olivier Pietquin

Data Scaling Laws in NMT: The Effect of Noise and Architecture
Yamini Bansal*, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat

Debiaser Beware: Pitfalls of Centering Regularized Transport Maps
Aram-Alexandre Pooladian, Marco Cuturi, Jonathan Niles-Weed

A Context-Integrated Transformer-Based Neural Network for Auction Design
Zhijian Duan, Jingwu Tang, Yutong Yin, Zhe Feng, Xiang Yan, Manzil Zaheer, Xiaotie Deng

Algorithms for the Communication of Samples
Lucas Theis, Noureldin Yosri

Being Properly Improper
Tyler Sypherd, Richard Nock, Lalitha Sankar

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation
Chris Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error
Scott Fujimoto, David Meger, Doina Precup, Ofir Nachum, Shixiang Shane Gu

Public Data-Assisted Mirror Descent for Private Model Training
Ehsan Amid, Arun Ganesh*, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song, Thomas Steinke, Vinith M. Suriyakumar*, Om Thakkar, Abhradeep Thakurta

Deep Hierarchy in Bandits
Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh

Scalable Deep Reinforcement Learning Algorithms for Mean Field Games
Mathieu Lauriere, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Theophile Cabannes, Georgios Piliouras, Julien Perolat, Romuald Elie, Olivier Pietquin, Matthieu Geist

Faster Privacy Accounting via Evolving Discretization
Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi

HyperPrompt: Prompt-Based Task-Conditioning of Transformers
Yun He*, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, Ed H. Chi

Blocks Assemble! Learning to Assemble with Large-Scale Structured Reinforcement Learning
Seyed Kamyar, Seyed Ghasemipour, Daniel Freeman, Byron David, Shixiang Shane Gu, Satoshi Kataoka, Igor Mordatch

Latent Diffusion Energy-Based Model for Interpretable Text Modelling
Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu

On the Optimization Landscape of Neural Collapse Under MSE Loss: Global Optimality with Unconstrained Features
Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, Zhihui Zhu

Efficient Reinforcement Learning in Block MDPs: A Model-Free Representation Learning Approach
Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun

Robust Training Under Label Noise by Over-Parameterization
Sheng Liu, Zhihui Zhu, Qing Qu, Chong You

FriendlyCore: Practical Differentially Private Aggregation
Eliad Tsfadia, Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer

Adaptive Data Analysis with Correlated Observations
Aryeh Kontorovich, Menachem Sadigurschi,Uri Stemmer

A Resilient Distributed Boosting Algorithm
Yuval Filmus, Idan Mehalel, Shay Moran

On Learning Mixture of Linear Regressions in the Non-Realizable Setting
Avishek Ghosh, Arya Mazumdar,Soumyabrata Pal, Rajat Sen

Online and Consistent Correlation Clustering
Vincent Cohen-Addad, Silvio Lattanzi, Andreas Maggiori, Nikos Parotsidis

From Block-Toeplitz Matrices to Differential Equations on Graphs: Towards a General Theory for Scalable Masked Transformers
Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, Thomas Weingarten

Parsimonious Learning-Augmented Caching
Sungjin Im, Ravi Kumar, Aditya Petety, Manish Purohit

General-Purpose, Long-Context Autoregressive Modeling with Perceiver AR
Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, Jesse Engel

Conformal Prediction Sets with Limited False Positives
Adam Fisch, Tal Schuster, Tommi Jaakkola, Regina Barzilay

Dialog Inpainting: Turning Documents into Dialogs
Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu

Benefits of Overparameterized Convolutional Residual Networks: Function Approximation Under Smoothness Constraint
Hao Liu, Minshuo Chen, Siawpeng Er, Wenjing Liao, Tong Zhang, Tuo Zhao

Congested Bandits: Optimal Routing via Short-Term Resets
Pranjal Awasthi, Kush Bhatia, Sreenivas Gollapudi, Kostas Kollias

Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance
Zhuoning Yuan, Yuexin Wu, Zihao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, Tianbao Yang

Examining Scaling and Transfer of Language Model Architectures for Machine Translation
Biao Zhang*, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (see blog post)
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui

How to Leverage Unlabeled Data in Offline Reinforcement Learning?
Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine

Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning
Harley Wiltzer, David Meger, Marc G. Bellemare

On the Robustness of CountSketch to Adaptive Inputs
Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, Uri Stemmer

Model Selection in Batch Policy Optimization
Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai

The Fundamental Price of Secure Aggregation in Differentially Private Federated Learning
Wei-Ning Chen, Christopher A. Choquette-Choo, Peter Kairouz, Ananda Theertha Suresh

Linear-Time Gromov Wasserstein Distances Using Low Rank Couplings and Costs
Meyer Scetbon, Gabriel Peyré, Marco Cuturi*

Active Sampling for Min-Max Fairness
Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, Chris Russell, Jie Zhang

Making Linear MDPs Practical via Contrastive Representation Learning
Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph E. Gonzalez, Dale Schuurmans, Bo Dai

Achieving Minimax Rates in Pool-Based Batch Active Learning
Claudio Gentile, Zhilei Wang, Tong Zhang

Private Adaptive Optimization with Side Information
Tian Li, Manzil Zaheer, Sashank J. Reddi, Virginia Smith

Self-Supervised Learning With Random-Projection Quantizer for Speech Recognition
Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu

Wide Bayesian Neural Networks Have a Simple Weight Posterior: Theory and Accelerated Sampling
Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

The State of Sparse Training in Deep Reinforcement Learning
Laura Graesser, Utku Evci, Erich Elsen, Pablo Samuel Castro

Constrained Discrete Black-Box Optimization Using Mixed-Integer Programming
Theodore P. Papalexopoulos, Christian Tjandraatmadja, Ross Anderson, Juan Pablo Vielma, David Belanger

Massively Parallel k-Means Clustering for Perturbation Resilient Instances
Vincent Cohen-Addad, Vahab Mirrokni, Peilin Zhong

What Language Model Architecture and Pre-training Objective Works Best for Zero-Shot Generalization?
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt

Synergy and Symmetry in Deep Learning: Interactions Between the Data, Model, and Inference Algorithm
Lechao Xiao, Jeffrey Pennington

Fast Finite Width Neural Tangent Kernel
Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz

The Combinatorial Brain Surgeon: Pruning Weights that Cancel One Another in Neural Networks
Xin Yu, Thiago Serra, Srikumar Ramalingam, Shandian Zhe

Bayesian Imitation Learning for End-to-End Mobile Manipulation
Yuqing Du, Daniel Ho, Alexander A. Alemi, Eric Jang, Mohi Khansari

HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning
Andrey Zhmoginov, Mark Sandler, Max Vladymyrov

Marginal Distribution Adaptation for Discrete Sets via Module-Oriented Divergence Minimization
Hanjun Dai, Mengjiao Yang, Yuan Xue, Dale Schuurmans, Bo Dai

Correlated Quantization for Distributed Mean Estimation and Optimization
Ananda Theertha Suresh, Ziteng Sun, Jae Hun Ro, Felix Yu

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch

Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime
Leonardo Cunha, Gauthier Gidel, Fabian Pedregosa, Damien Scieur, Courtney Paquette

Learning Iterative Reasoning through Energy Minimization
Yilun Du, Shuang Li, Josh Tenenbaum, Igor Mordatch

Interactive Correlation Clustering with Existential Cluster Constraints
Rico Angell, Nicholas Monath, Nishant Yadav, Andrew McCallum

Building Robust Ensembles via Margin Boosting
Dinghuai Zhang, Hongyang Zhang, Aaron Courville, Yoshua Bengio, Pradeep Ravikumar, Arun Sai Suggala

Probabilistic Bilevel Coreset Selection
Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Tong Zhang

Model Agnostic Sample Reweighting for Out-of-Distribution Learning
Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, Tong Zhang

Sparse Invariant Risk Minimization
Xiao Zhou, Yong Lin, Weizhong Zhang, Tong Zhang

RUMs from Head-to-Head Contests
Matteo Almanza, Flavio Chierichetti, Ravi Kumar, Alessandro Panconesi, Andrew Tomkins

A Parametric Class of Approximate Gradient Updates for Policy Optimization
Ramki Gummadi, Saurabh Kumar, Junfeng Wen, Dale Schuurmans

On Implicit Bias in Overparameterized Bilevel Optimization
Paul Vico, Jonathan Lorraine, Fabian Pedregosa, David Duvenaud, Roger Grosse

Feature and Parameter Selection in Stochastic Linear Bandits
Ahmadreza Moradipari, Berkay Turan, Yasin Abbasi-Yadkori, Mahnoosh Alizadeh, Mohammad Ghavamzadeh

Neural Network Poisson Models for Behavioural and Neural Spike Train Data
Moein Khajehnejad, Forough Habibollahi, Richard Nock, Ehsan Arabzadeh, Peter Dayan and Amir Dezfouli

Deep Equilibrium Networks are Sensitive to Initialization Statistics
Atish Agarwala, Samuel Schoenholz

A Regret Minimization Approach to Multi-Agent Control
Udaya Ghai, Udari Madhushani, Naomi Leonard, Elad Hazan

Transformer Quality in Linear Time
Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

Workshops

Shift Happens: Crowdsourcing Metrics and Test Datasets Beyond ImageNet
Organizing Committee includes: Roland S. Zimmerman
Invited Speakers include: Chelsea Finn, Lucas Beyer

Machine Learning for Audio Synthesis
Organizing Committee includes: Yu Zhang
Invited Speakers include: Chris Donahue

New Frontiers in Adversarial Machine Learning
Organizing Committee includes: Sanmi Koyejo

Spurious Correlations, Invariance, and Stability (SIC)
Organizing Committee includes: Victor Veitch

DataPerf: Benchmarking Data for Data-Centric AI
Organizing Committee includes: Lora Aroyo, Peter Mattson, Praveen Paritosh
DataPerf Speakers include: Lora Aroyo, Peter Mattson, Praveen Paritosh
Invited Speakers include: Jordi Pont-Tuset

Machine Learning for Astrophysics
Invited Speakers include: Dustin Tran

Dynamic Neural Networks
Organizing Committee includes: Carlos Riquelme
Panel Chairs include: Neil Houlsby

Interpretable Machine Learning in Healthcare (IMLH)
Organizing Committee includes: Ramin Zabih
Invited Speakers include: Been Kim

Human-Machine Collaboration and Teaming
Invited Speakers include: Fernanda Viégas, Martin Wattenberg, Yuhuai (Tony) Wu

Pre-training: Perspectives, Pitfalls, and Paths Forward
Organizing Committee includes: Hugo Larochelle, Chelsea Finn
Invited Speakers include: Hanie Sedgh, Charles Sutton

Responsible Decision Making in Dynamic Environments
Invited Speakers include: Craig Boutilier

Principles of Distribution Shift (PODS)
Organizing Committee includes: Hossein Mobahi

Hardware-Aware Efficient Training (HAET)
Invited Speakers include: Tien-Ju Yang

Updatable Machine Learning
Invited Speakers include: Chelsea Finn, Nicolas Papernot
Organizing Committee includes: Ananda Theertha Suresh, Badih Ghazi, Chiyuan Zhang, Kate Donahue, Peter Kairouz, Ziteng Sun

Knowledge Retrieval and Language Models
Invited Speakers include: Fernando Diaz, Quoc Le, Kenton Lee, Ellie Pavlick
Organizing Committee includes: Urvashi Khandelwal, Chiyuan Zhang

Theory and Practice of Differential Privacy
Organizing Committee includes: Badih Ghazi, Matthew Joseph, Peter Kairouz, Om Thakkar, Thomas Steinke, Ziteng Sun

Beyond Bayes: Paths Towards Universal Reasoning Systems
Invited Speakers include: Charles Sutton
Spotlight Talk: Language Model Cascades | David Dohan, Winnie Xu, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, Charles Sutton

Safe Learning for Autonomous Driving (SL4AD)
Invited Speakers include: Chelsea Finn



*Work done while at Google.  

Categories
Misc

Living on the Edge: New Features for NVIDIA Fleet Command Deliver All-in-One Edge AI Management, Maintenance for Enterprises

NVIDIA Fleet Command — a cloud service for deploying, managing and scaling AI applications at the edge — today introduced new features that enhance the seamless management of edge AI deployments around the world. With the scale of edge AI deployments, organizations can have up to thousands of independent edge locations that must be managed Read article >

The post Living on the Edge: New Features for NVIDIA Fleet Command Deliver All-in-One Edge AI Management, Maintenance for Enterprises appeared first on NVIDIA Blog.

Categories
Misc

Running Multiple Applications on the Same Edge Devices

Smart spaces are one of the most prolific edge AI use cases. From smart retail stores to autonomous factories, organizations are quick to see the value in this innovative…

Smart spaces are one of the most prolific edge AI use cases. From smart retail stores to autonomous factories, organizations are quick to see the value in this innovative technology. However, building and scaling a smart space requires many different pieces of technology, including multiple applications. Operating multiple applications at edge locations can be tricky.

To do this, organizations may add new hardware to a location so that each application has dedicated compute resources, but the costs associated with purchasing and installing new hardware with every new application can be substantial. Many organizations deploy multiple applications on the same device.

While that is a solution for scale, it can present different challenges.

Many organizations rely on the performance of GPUs to power applications at the edge. Even with high-performance GPU-accelerated systems, having two or more AI applications running concurrently on the same device using time slicing inevitably leads to higher latency with minimal hardware optimization.

With multiple applications running on the same device, the device time-slices the applications in a queue so that applications are run sequentially as opposed to concurrently. There is always a delay in results while the device switches from processing data for one application to another. The amount of delay varies per deployment, but it could be as much as 8ms. That could present serious concerns for applications powering high-speed operations, such as a manufacturing production line. 

Because applications are running sequentially, the GPU is only ever used as much as each individual application needs while it is running. For instance, if there are three applications operating sequentially on a GPU and each application requires 60% of the GPU’s resources, then at any given time less than 60% of the GPU is used. The GPU utilization would be 0% during each context switch.

There are a few ways organizations can avoid time-slicing and better utilize their GPU resources.

NVIDIA Multi-Instance GPU

NVIDIA Multi-Instance GPU (MIG) is a feature that enables you to partition GPUs into multiple instances, each with their own compute cores enabling the full computing power of a GPU. MIG alleviates the issue of applications competing for resources by isolating applications and dedicating resources to each. MIG also allows for better resource optimization and low latency.

By providing up to seven distinct partitions, you can support every workload, from the smallest to the largest, with the exact amount of compute power needed to effectively operate each deployed application.

In addition to performance, MIG provides added security and resilience by dedicating a set of hardware resources for compute, memory, and cache for each instance. MIG provides fault isolation for workloads, where a fault caused by an application running in one instance does not impact applications running on other instances. If one workload fails, all other workloads continue operating uninterrupted since instances and workloads run in parallel while remaining separate and isolated.

MIG works equally well with containers or virtual machines (VMs). When using VMs, it is easy to virtualize GPUs using NVIDIA vGPU, which can be configured to employ either time slicing or MIG.

MIG for edge AI

When deploying edge AI, optimizing for cost, power, and space are all important considerations, especially if you want to replicate to thousands of edge nodes. By allowing organizations to run multiple applications on the same GPU, MIG eliminates the need for installing a dedicated GPU for each workload, significantly reducing resource requirements.

More than resource optimization, MIG helps ensure predictable application performance. Without MIG, different jobs running on the same GPU, such as different AI inference requests, compete for the same resources such as memory and bandwidth. Due to the competition for resources inherent in time slicing, the performance of one application can be affected by activity in another. For edge AI environments, unpredictable performance can have serious consequences.

For example, a computer vision application monitoring a production line to detect product defects has to be able to react instantaneously to its dynamic environment. It must be able to inspect products quickly, and also to communicate with other machinery to stop the production line in the case of a defective product. For safety and efficiency, organizations must know that the AI applications powering their production lines are running correctly and predictably all of the time.

Jobs running simultaneously with different resources result in predictable performance with quality of service and maximum GPU utilization, making MIG an essential addition to every edge deployment.

A GPU is spliced into four MIG instances representing quality inspection, sorting, and safety workload. Each instance has its own dedicated GPU and GPU memory resources.
Figure 1. Each MIG instance can handle an independent workload, optimizing environments where multiple use cases need to operate simultaneously

MIG on NVIDIA Fleet Command

Fleet Command is a cloud service that centrally connects systems at edge locations to securely deploy, manage, and scale AI applications from one dashboard. Purpose-built for edge AI, Fleet Command is the best way to orchestrate AI across hundreds or even thousands of devices. 

From the Fleet Command cloud platform, administrators have complete control over MIG for edge AI deployments with minimal configuration needed. Using MIG on Fleet Command enables you to make resource utilization decisions across hundreds or even thousands of devices with just a few clicks. You can easily add new MIG partitions, scale down existing partitions, and create custom deployments all from one dashboard.

The combination of MIG and Fleet Command provides organizations with all the functionality needed to have full control over edge AI deployments, leading to better used and more effective workloads. For more information about the entire workflow for using MIG on Fleet Command, see the following video.

Video 1. How to Run Multiple Applications on the Same Edge Device with Fleet Command

Try Fleet Command yourself with NVIDIA LaunchPad, a free program that provides you with short-term access to a large catalog of hands-on labs. You walk through the entire flow for deploying and managing applications on Fleet Command, including using MIG and other key features. Get started now.

Sign up for Edge AI News to stay up to date with the latest trends, customer use cases, and technical walkthroughs.

Categories
Misc

Remotely Operating Systems and Applications at the Edge

A recent poll during the Edge Computing 101 webinar revealed that many IT professionals interested in edge AI are still just learning the basics about the technology and…

A recent poll during the Edge Computing 101 webinar revealed that many IT professionals interested in edge AI are still just learning the basics about the technology and considerations for production deployments.

One key consideration for production edge AI is how administrators will manage ongoing maintenance for applications and systems post-deployment, sometimes referred to as Day-2 operations. Remote management is critical functionality that enables you to easily manage dozens or even thousands of remote sites.

Remote management is essential for edge AI

The process for bringing an edge AI proof of concept (POC) into a production environment at-scale requires that you have full access to both edge systems and applications at distributed locations.

Without complete and painless access, the ability to progress and scale quickly is limited by the time it takes to manually troubleshoot issues at the remote edge site. That process can be quite time-consuming and expensive as installing and scaling new technology always presents unpredictable issues. 

Traditional VPN connections lack security

After setup, you want to deploy and scale new applications on existing hardware, update existing applications, troubleshoot bugs, and validate new configurations. Having remote management capabilities that are secure is critical, as production deployments contain important data and insights that you will want to keep safe. 

But the traditional process of accessing machines and systems through VPN is not secure enough for the changing security landscape that edge deployments present.

First, most VPN connections do not have the ability to set time limits or restrictions. Administrators could (and often do) forget to close out a VPN session, leaving an avenue open for malicious actors.

Second, VPN connections do not easily provide the access controls needed for securely deploying and managing edge AI given the number of different partners, vendors, contractors, and other actors that might need access to parts of the deployment solution. 

To successfully operate edge deployments, you need remote management features with advanced functionality and security like just-in-time (JIT) access, clearly defined access controls, and timed sessions. 

To ensure this functionality, NVIDIA Fleet Command has two features to provide full remote management of both systems and applications. 

Remote system access

Remote console on Fleet Command provides secure, remote access to systems at the edge without needing physical access to the system or the network. You can view system information or data, navigate directories, view logs, and more.

Having an on-demand remote console eliminates the need for additional ports and traditional VPN connections and provides peace of mind. You’ll know that you can troubleshoot and remediate unexpected problems at remote edge locations. 

Another unique aspect of the remote console on Fleet Command is concurrent remote access to multiple edge nodes in an organization. To ensure the highest security across nodes, Fleet Command infrastructure isolates each of the open nodes in separate sessions and ensures that any issues on one system do not affect other systems. 

Remote application access

In addition to system-level access, you also have access to the applications. Fleet Command remote application access allows for web-based access to applications running on remote edge systems, eliminating the need for manual connection to the system and network through VPN to where applications are running.

Remote application access gives you visibility to the application services, providing full access to all features and functionality of the web applications running on the edge devices. Using remote application access, you can remotely access the application UI and configure applications, ensure that applications are running successfully, and troubleshoot any issues without compromising the security posture of your edge deployments. 

For added security, remote application access also features a configurable time allowance that automatically ends remote access sessions. This greatly simplifies resource management and frees up available remote sessions for other services.

Like remote console, Fleet Command remote application access enables multiple sessions to be open at the same time, so that multiple users from multiple locations can operate simultaneously. 

Secure remote management

A key aspect of remote management on Fleet Command is the security benefits of using these features. Access controls on remote console and remote application access mean that you can grant role-based usage capabilities to partners, customers, contractors, and others, ensuring limited exposure to the solution and network. 

Additionally, both features provide just-in-time (JIT) security, so sessions and privileges are granted by administrators and are time-limited. Time-limited sessions eliminate the possibility of perpetually open VPN sessions that provide backdoor access for malicious actors. 

Get started with remote management

Organizations are increasingly adopting edge AI solutions to power innovative new use cases. With any new technology, new approaches must ensure optimum functionality and safety, especially for production solutions dealing with critical or sensitive data. 

Remote management with Fleet Command provides everything you need to fully access edge systems and applications. It provides a layer of security that traditional VPN connections lack. 

To walk through the entire process of using remote console and remote application access on Fleet Command, see the Remotely Operate Systems and Applications with Fleet Command [NEEDS LINK] demo. 

Try Fleet Command yourself with NVIDIA LaunchPad, a free program that provides short-term access to a large catalog of hands-on labs. You can walk through the entire flow for deploying and managing applications on Fleet Command, including using remote management and other key features. Get started now

Sign up for Edge AI News to stay up to date with the latest trends, customer use cases, and technical walkthroughs.

Categories
Misc

CORSAIR Integrates NVIDIA Broadcast’s Audio, Video AI Features in iCUE and Elgato Software This Week ‘In the NVIDIA Studio’

Technology company CORSAIR and streaming partner BigCheeseKIT step In the NVIDIA Studio this week. A leader in high-performance gear and systems for gamers, content creators and PC enthusiasts, CORSAIR has integrated NVIDIA Broadcast technologies into its hardware and iCUE software. Similar AI enhancements have also been added to Elgato’s audio and video software, Wave Link and Camera Hub.

The post CORSAIR Integrates NVIDIA Broadcast’s Audio, Video AI Features in iCUE and Elgato Software This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Meet the Omnivore: Animator Entertains and Explains With NVIDIA Omniverse

Australian animator Marko Matosevic is taking jokes from a children’s school dads’ group and breathing them into animated life with NVIDIA Omniverse, a virtual world simulation and collaboration platform for 3D workflows.

The post Meet the Omnivore: Animator Entertains and Explains With NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Offsites

Towards Reliability in Deep Learning Systems

Deep learning models have made impressive progress in vision, language, and other modalities, particularly with the rise of large-scale pre-training. Such models are most accurate when applied to test data drawn from the same distribution as their training set. However, in practice, the data confronting models in real-world settings rarely match the training distribution. In addition, the models may not be well-suited for applications where predictive performance is only part of the equation. For models to be reliable in deployment, they must be able to accommodate shifts in data distribution and make useful decisions in a broad array of scenarios.

In “Plex: Towards Reliability Using Pre-trained Large Model Extensions”, we present a framework for reliable deep learning as a new perspective about a model’s abilities; this includes a number of concrete tasks and datasets for stress-testing model reliability. We also introduce Plex, a set of pre-trained large model extensions that can be applied to many different architectures. We illustrate the efficacy of Plex in the vision and language domains by applying these extensions to the current state-of-the-art Vision Transformer and T5 models, which results in significant improvement in their reliability. We are also open-sourcing the code to encourage further research into this approach.

Uncertainty — Dog vs. Cat classifier: Plex can say “I don’t know” for inputs that are neither cat nor dog.
Robust Generalization — A naïve model is sensitive to spurious correlations (“destination”), whereas Plex is robust.
Adaptation — Plex can actively choose the data from which it learns to improve performance more quickly.

Framework for Reliability
First, we explore how to understand the reliability of a model in novel scenarios. We posit three general categories of requirements for reliable machine learning (ML) systems: (1) they should accurately report uncertainty about their predictions (“know what they don’t know”); (2) they should generalize robustly to new scenarios (distribution shift); and (3) they should be able to efficiently adapt to new data (adaptation). Importantly, a reliable model should aim to do well in all of these areas simultaneously out-of-the-box, without requiring any customization for individual tasks.

  • Uncertainty reflects the imperfect or unknown information that makes it difficult for a model to make accurate predictions. Predictive uncertainty quantification allows a model to compute optimal decisions and helps practitioners recognize when to trust the model’s predictions, thereby enabling graceful failures when the model is likely to be wrong.
  • Robust Generalization involves an estimate or forecast about an unseen event. We investigate four types of out-of-distribution data: covariate shift (when the input distribution changes between training and application and the output distribution is unchanged), semantic (or class) shift, label uncertainty, and subpopulation shift.
    Types of distribution shift using an illustration of ImageNet dogs.
  • Adaptation refers to probing the model’s abilities over the course of its learning process. Benchmarks typically evaluate on static datasets with pre-defined train-test splits. However, in many applications, we are interested in models that can quickly adapt to new datasets and efficiently learn with as few labeled examples as possible.
Reliability framework. We propose to simultaneously stress-test the “out-of-the-box” model performance (i.e., the predictive distribution) across uncertainty, robust generalization, and adaptation benchmarks, without any customization for individual tasks.

We apply 10 types of tasks to capture the three reliability areas — uncertainty, robust generalization, and adaptation — and to ensure that the tasks measure a diverse set of desirable properties in each area. Together the tasks comprise 40 downstream datasets across vision and natural language modalities: 14 datasets for fine-tuning (including few-shot and active learning–based adaptation) and 26 datasets for out-of-distribution evaluation.

Plex: Pre-trained Large Model Extensions for Vision and Language
To improve reliability, we develop ViT-Plex and T5-Plex, building on large pre-trained models for vision (ViT) and language (T5), respectively. A key feature of Plex is more efficient ensembling based on submodels that each make a prediction that is then aggregated. In addition, Plex swaps each architecture’s linear last layer with a Gaussian process or heteroscedastic layer to better represent predictive uncertainty. These ideas were found to work very well for models trained from scratch at the ImageNet scale. We train the models with varying sizes up to 325 million parameters for vision (ViT-Plex L) and 1 billion parameters for language (T5-Plex L) and pre-training dataset sizes up to 4 billion examples.

The following figure illustrates Plex’s performance on a select set of tasks compared to the existing state-of-the-art. The top-performing model for each task is usually a specialized model that is highly optimized for that problem. Plex achieves new state-of-the-art on many of the 40 datasets. Importantly, Plex achieves strong performance across all tasks using the out-of-the-box model output without requiring any custom designing or tuning for each task.

The largest T5-Plex (top) and ViT-Plex (bottom) models evaluated on a highlighted set of reliability tasks compared to specialized state-of-the-art models. The spokes display different tasks, quantifying metric performance on various datasets.

<!–

The largest T5-Plex (top) and ViT-Plex (bottom) models evaluated on a highlighted set of reliability tasks compared to specialized state-of-the-art models. The spokes display different tasks, quantifying metric performance on various datasets.

–>

Plex in Action for Different Reliability Tasks
We highlight Plex’s reliability on select tasks below.

Open Set Recognition
We show Plex’s output in the case where the model must defer prediction because the input is one that the model does not support. This task is known as open set recognition. Here, predictive performance is part of a larger decision-making scenario where the model may abstain from making certain predictions. In the following figure, we show structured open set recognition: Plex returns multiple outputs and signals the specific part of the output about which the model is uncertain and is likely out-of-distribution.

Structured open set recognition enables the model to provide nuanced clarifications. Here, T5-Plex L can recognize fine-grained out-of-distribution cases where the request’s vertical (i.e., coarse-level domain of service, such as banking, media, productivity, etc.) and domain are supported but the intent is not.

Label Uncertainty
In real-world datasets, there is often inherent ambiguity behind the ground truth label for each input. For example, this may arise due to human rater ambiguity for a given image. In this case, we’d like the model to capture the full distribution of human perceptual uncertainty. We showcase Plex below on examples from an ImageNet variant we constructed that provides a ground truth label distribution.

Plex for label uncertainty. Using a dataset we construct called ImageNet ReaL-H, ViT-Plex L demonstrates the ability to capture the inherent ambiguity (probability distribution) of image labels.

Active Learning
We examine a large model’s ability to not only learn over a fixed set of data points, but also participate in knowing which data points to learn from in the first place. One such task is known as active learning, where at each training step, the model selects promising inputs among a pool of unlabeled data points on which to train. This procedure assesses an ML model’s label efficiency, where label annotations may be scarce, and so we would like to maximize performance while minimizing the number of labeled data points used. Plex achieves a significant performance improvement over the same model architecture without pre-training. In addition, even with fewer training examples, it also outperforms the state-of-the-art pre-trained method, BASE, which reaches 63% accuracy at 100K examples.

Active learning on ImageNet1K. ViT-Plex L is highly label efficient compared to a baseline that doesn’t leverage pre-training. We also find that active learning’s data acquisition strategy is more effective than uniformly selecting data points at random.

Learn more
Check out our paper here and an upcoming contributed talk about the work at the ICML 2022 pre-training workshop on July 23, 2022. To encourage further research in this direction, we are open-sourcing all code for training and evaluation as part of Uncertainty Baselines. We also provide a demo that shows how to use a ViT-Plex model checkpoint. Layer and method implementations use Edward2.

Acknowledgements
We thank all the co-authors for contributing to the project and paper, including Andreas Kirsch, Clara Huiyi Hu, Du Phan, D. Sculley, Honglin Yuan, Jasper Snoek, Jeremiah Liu, Jie Ren, Joost van Amersfoort, Karan Singhal, Kehang Han, Kelly Buchanan, Kevin Murphy, Mark Collier, Mike Dusenberry, Neil Band, Nithum Thain, Rodolphe Jenatton, Tim G. J. Rudner, Yarin Gal, Zachary Nado, Zelda Mariet, Zi Wang, and Zoubin Ghahramani. We also thank Anusha Ramesh, Ben Adlam, Dilip Krishnan, Ed Chi, Neil Houlsby, Rif A. Saurous, and Sharat Chikkerur for their helpful feedback, and Tom Small and Ajay Nainani for helping with visualizations.