With the tech industry facing opportunities at every turn, it’s a ripe moment for NVIDIA’s acquisition of Arm, said CEOs of the companies in a frank conversation with a leading analyst. Patrick Moorhead of Moor Insights & Strategy posed tough questions and gave the deal a thumbs up in the session at the Six Five Read article >
I am using Nvidia GeForce Gt 1030 with the driver version 460.89! I have tried installing CUDA 11, CUDA 10.1 and CUDA 9 with all instructions followed. Still I get the error “CUDA_ERROR_LAUNCH_FAILED”. Please help!
So I’m quite new to the world of Artificial Intelligence. I’m currently working on building a classification algorithm for raw data. This model will have 4 inputs, each a rank-1 tensor, but of different sizes. I could potentially make them all the same size if needed, but I would like to avoid this if possible. The output of the model would be a group number prediction based on softmax activation. My question is would it be better in terms of model accuracy to use the 4 arrays as separate inputs to a Functional model, or should I manipulate the arrays to all be the same size and create a Sequential model with one, rank-2 tensor as its input?
I apologize in advance if anything is poorly worded, or if there is necessary information missing. Please let me know if I can clarify anything.
In XGBoost 1.0, we introduced a new, official Dask interface to support efficient distributed training. Fast-forwarding to XGBoost 1.4, the interface is now feature-complete. If you are new to the XGBoost Dask interface, look at the first post for a gentle introduction. In this post, we look at simple code examples, showing how to maximize … Continued
In XGBoost 1.0, we introduced a new, official Dask interface to support efficient distributed training. Fast-forwarding to XGBoost 1.4, the interface is now feature-complete. If you are new to the XGBoost Dask interface, look at the first post for a gentle introduction. In this post, we look at simple code examples, showing how to maximize the benefits of GPU acceleration.
Our examples focus on the HIGGS dataset, a moderately sized classification problem from the UCI Machine Learning repository. In the following sections, we start from basic data loading and preprocessing with GPU-accelerated Dask and Dask-ml. Then, train an XGBoost model on returned data with different configurations. Also, share some new features along the way. After that, we showcase how to compute SHAP value on a GPU cluster and the speedup we can obtain. Lastly, we share some optimization techniques with inference.
The following examples need to be run on a machine with at least one NVIDIA GPU, which can be a laptop or a cloud instance. One of the advantages of Dask is its flexibility that users can test their code on a laptop. They can also scale up the computation to clusters with a minimum amount of code changes. Also, to set up the environment we need xgboost==1.4, dask, dask-ml, dask-cuda, and dask-cudf python packages, available from RAPIDS conda channels:
First we download the dataset into the data directory.
mkdir data
curl http://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz --output ./data/HIGGS.csv.gz
Then set up the GPU cluster using dask-cuda:
import os
from time import time
from typing import Tuple
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from distributed import Client, wait
import dask_cudf
from dask_ml.model_selection import train_test_split
import xgboost as xgb
from xgboost import dask as dxgb
import numpy as np
import argparse
# … main content to be inserted here in the following sections
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--n_workers", type=int, required=True)
args = parser.parse_args()
with LocalCUDACluster(args.n_workers) as cluster:
print("dashboard:", cluster.dashboard_link)
with Client(cluster) as client:
main(client)
Given a cluster, we start loading the data into GPUs. Because the data is loaded multiple times during parameter tuning, we convert the CSV file into Parquet format for better performance. This can be easily done using dask_cudf:
def to_parquet() -> str:
"""Convert the HIGGS.csv file to parquet files."""
dirpath = "./data"
parquet_path = os.path.join(dirpath, "HIGGS.parquet")
if os.path.exists(parquet_path):
return parquet_path
csv_path = os.path.join(dirpath, "HIGGS.csv")
colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]
df = dask_cudf.read_csv(csv_path, header=None, names=colnames, dtype=np.float32)
df.to_parquet(parquet_path)
return parquet_path
After data loading, we prepare the training/validation splits:
In the preceding example, we use dask-cudf for loading data from the disk, and the train_test_split function from dask-ml for splitting up the dataset. Most of the time, the GPU backend of dask works seamlessly with utilities in dask-ml and we can accelerate the entire ML pipeline.
Training with early stopping
One of the most frequently requested features is early stopping support for the Dask interface. In the XGBoost 1.4 release, not only can we specify the number of stopping rounds, but also develop customized early stopping strategies. For the simplest case, providing stopping rounds to the train function enables early stopping:
There are two things to notice in the preceding snippet. Firstly, we specify the number of rounds to trigger early stopping for training. XGBoost will stop the training process once the validation metric fails to improve in consecutive X rounds, where X is the number of rounds specified for early stopping. Secondly, we use a data type called DaskDeviceQuantileDMatrix for training but DaskDMatrix for validation. DaskDeviceQuantileDMatrix is a drop-in replacement of DaskDMatrix for GPU-based training inputs that avoids extra data copies.
DaskDeviceQuantileDMatrix can save a considerable amount of memory when used with gpu_hist and input data is already on GPU. Figure 1 depicts the construction of DaskDeviceQuantileDMatrix. Data partitions no longer need to be copied and concatenated, instead, a summary generated by the sketching algorithm is used as a proxy for the real data.
Inside XGBoost, early stopping is implemented as a callback function. The new callback interface can be used to implement more advanced early stopping strategies. The following code shows an alternative implementation of early stopping, with an additional parameter asking XGBoost to return only the best model instead of the full model:
In the preceding example, the EarlyStopping callback is provided as an argument to train instead of using the early_stopping_rounds parameter. To provide a customized early stopping strategy, exploring other parameters of EarlyStopping or subclassing this callback is a great starting point.
Customized objective and evaluation metric
XGBoost is designed to be scalable through customized objective functions and metrics. In 1.4, this feature is brought to the dask interface. The requirement is exactly the same as for the single node interface:
In the preceding function, we use the custom objective function and metric to implement a logistic regression model along with early stopping. Note that the function returns both gradient and hessian, which XGBoost uses to optimize the model. Also, the parameter named metric_name needs to be specified in our callback. It is used to inform XGBoost that the custom error function should be used for evaluating early stopping criteria.
Explaining the model
After obtaining our first model, we might want to explain predictions using SHAP. SHAP(SHapley Additive exPlanations) is a game theoretic approach to explain the output of machine learning models based on Shapley Value. For details about the algorithm, please refer to the papers. As XGBoost now has support for GPU-accelerated Shapley values, we extend this feature to the Dask interface. Now, users can compute shap values on distributed GPU clusters. This is enabled by the significantly improved predict function and the GPUTreeShap library:
def explain(client, model, X):
# Use array instead of dataframe in case of output dim is greater than 2.
X_array = X.values
contribs = dxgb.predict(
client, model, X_array, pred_contribs=True, validate_features=False
)
# Use the result for further analysis
return contribs
The performance of XGBoost computing shap value with multiple GPUs is shown in figure 2.
The benchmark is performed on an NVIDIA DGX-1 server with eight V100 GPUs and two 20-core Xeon E5–2698 v4 CPUs, with one round of training, shap value computation, and inference.
The resulting SHAP values can be used for visualization, tuning the column sampling with feature weights or for other data engineering purposes.
Running inference
After some tuning, we arrive at the final model for performing inference on new data. The prediction of the XGBoost Dask interface was not as efficient and also memory hungry in the older versions. In 1.4, we revised the predict function and added support for in-place prediction. For the normal prediction, it uses the same interface with shap value computation:
The standard predict function provides a general interface accepting both DaskDMatrix and dask collections (DataFrame or Array), but is not optimized for memory usage. Here, we replace it with in-place prediction, which supports basic inference task and doesn’t require copying the data into internal data structures of XGBoost:
def inplace_predict(client, model, X):
# Use inplace_predict instead of standard predict.
predt = dxgb.inplace_predict(client, model, X)
assert isinstance(predt, dd.Series)
return predt
The memory savings vary depending on the size of each chunk and the input types. When running inference multiple times with the same model, another potential optimization is prescattering the model. By default, XGBoost transfers the model to workers every time predict is called, incurring significant overhead. The good news is Dask functions accept a future object as a proxy to the finished model. We can then transfer data, which can overlap with other computations and persisting data on workers.
def inplace_predict_multi_parts(client, model, X_train, X_valid):
"""Simulate the scenario that we need to run prediction on multiple datasets using train
and valid. In real world the number of datasets is unlimited
"""
# prescatter the model onto workers
model_f = client.scatter(model)
predictions = []
for X in [X_train, X_valid]:
# Use inplace_predict instead of standard predict.
predt = dxgb.inplace_predict(client, model_f, X)
assert isinstance(predt, dd.Series)
predictions.append(predt)
return predictions
In the preceding snippet, we pass a future model to XGBoost instead of the real one. This way we avoid repeated transfers during prediction, or we can parallelize the model transfer with other operations like loading data, as suggested in the comments.
Putting it all together
In previous sections, we demonstrate early stopping, shap value computation, customized objective, and finally inference. The following chart shows the end-to-end speed-up for a GPU cluster with varying numbers of workers.
As before, the benchmark is performed on an NVIDIA DGX-1 server with eight V100 GPUs and two 20-core Xeon E5–2698 v4 CPUs, with one round of training, shap value computation, and inference. Also, we have shared two optimizations for memory usage and the overall memory usage comparison is depicted in Figure 4.
The left two columns are memory usage of training with a 64-bit data type, while the right two columns are training with a 32-bit data type. Standard means training using normal DaskDMatrix and predicts function. Efficient means using DaskDeviceQuantileDMatrix along with inplace_predict.
Scikit-learn wrapper
Previous sections consider basic model training with the ‘functional’ interface, however, there’s also a scikit-learn estimator-like interface. It’s easier to use but with some more constraints. In XGBoost 1.4, this interface has feature parity with the single node implementation. Users can choose different estimators like DaskXGBClassifier for classification and DaskXGBRanker for ranking. Check out the reference for a complete list of available estimators: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.dask.
Summary
We have walked through an example of accelerating XGBoost on a GPU cluster with RAPIDS libraries showing that modernizing your XGBoost code can help maximize training efficiency. With the XGBoost Dask interface along with RAPIDS, users can achieve significant speedup with an easy-to-use API. Even though the XGBoost Dask interface has reached feature parity with single node API, development is continuing for better integration with other libraries for new features like hyperparameter tuning. For new feature requests relating to the dask interface, you can open an issue on XGBoost’s GitHub repository.
Audiences are making a round-trip to the moon with a science documentary that showcases China’s recent lunar explorations. Fly to the Moon, a series produced by the China Media Group (CMG) entirely in NVIDIA Omniverse, details the history of China’s space missions and shares some of the best highlights of the Chang ‘e-4 lunar lander, Read article >
No need to hold out for a hero. You’ve made it to GFN Thursday. This year’s E3 was packed with news for GeForce NOW and we’re sharing the excitement from the announcements of Marvel’s Guardians of the Galaxy and Humankind. Both will join the GeForce NOW lineup when they launch later this year. On top Read article >
Researchers from University of Washington and Facebook used deep learning to convert still images into realistic animated looping videos. Their approach, which will be presented at the upcoming Conference on Computer Vision and Pattern Recognition (CVPR), imitates continuous fluid motion — such as flowing water, smoke and clouds — to turn still images into short … Continued
Researchers from University of Washington and Facebook used deep learning to convert still images into realistic animated looping videos.
Their approach, which will be presented at the upcoming Conference on Computer Vision and Pattern Recognition (CVPR), imitates continuous fluid motion — such as flowing water, smoke and clouds — to turn still images into short videos that loop seamlessly.
“What’s special about our method is that it doesn’t require any user input or extra information,” said Aleksander Hołyński, University of Washington doctoral student in computer science and engineering and lead author on the project. “All you need is a picture. And it produces as output a high-resolution, seamlessly looping video that quite often looks like a real video.”
The team created a method known as “symmetric splatting” to predict the past and future motion from a still image, combining that data to create a seamless animation.
“When we see a waterfall, we know how the water should behave. The same is true for fire or smoke. These types of motions obey the same set of physical laws, and there are usually cues in the image that tell us how things should be moving,” Hołyński said. “We’d love to extend our work to operate on a wider range of objects, like animating a person’s hair blowing in the wind. I’m hoping that eventually the pictures that we share with our friends and family won’t be static images. Instead, they’ll all be dynamic animations like the ones our method produces.”
To teach their neural network to estimate motion, the team trained the model on more than 1,000 videos of fluid motion such as waterfalls, rivers and oceans. Given only the first frame of the video, the system would predict what should happen in future frames, and compare its prediction with the original video. This comparison helped the model improve its predictions of whether and how each pixel in an image should move.
The researchers used the NVIDIA Pix2PixHD GAN model for motion estimation network training, as well as FlowNet2 and PWC-Net. NVIDIA GPUs were used for both training and inference of the model. The training data included 1196 unique videos, 1096 for training, 50 for validation and 50 for testing.
But an overview is that I am unable to get proper output from my model because of what I believe to be issues with my input and my lack of understanding regarding shape of input vs. shape of a tensor. If there is anything I can provide to give a better idea of my problem let me know. Appreciate any help I could get
Yifeng Jiang, Research Intern and Jie Tan, Research Scientist, Robotics at Google
Simulation empowers various engineering disciplines to quickly prototype with minimal human effort. In robotics, physics simulations provide a safe and inexpensive virtual playground for robots to acquire physical skills with techniques such as deep reinforcement learning (DRL). However, as the hand-derived physics in simulations does not match the real world exactly, control policies trained entirely within simulation can fail when tested on real hardware — a challenge known as the sim-to-real gap or the domain adaptation problem. The sim-to-real gap for perception-based tasks (such as grasping) has been tackled using RL-CycleGAN and RetinaGAN, but there is still a gap caused by the dynamics of robotic systems. This prompts us to ask, can we learn a more accurate physics simulator from a handful of real robot trajectories? If so, such an improved simulator could be used to refine the robot controller using standard DRL training, so that it succeeds in the real world.
In our ICRA 2021 publication “SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning”, we propose to treat the physics simulator as a learnable component that is trained by DRL with a special reward function that penalizes discrepancies between the trajectories (i.e., the movement of the robots over time) generated in simulation and a small number of trajectories that are collected on real robots. We use generative adversarial networks (GANs) to provide such a reward, and formulate a hybrid simulator that combines learnable neural networks and analytical physics equations, to balance model expressiveness and physical correctness. On robotic locomotion tasks, our method outperforms multiple strong baselines, including domain randomization.
A Learnable Hybrid Simulator A traditional physics simulator is a program that solves differential equations to simulate the movement or interactions of objects in a virtual world. For this work, it is necessary to build different physical models to represent different environments – if a robot walks on a mattress, the deformation of the mattress needs to be taken into account (e.g., with the finite element method). However, due to the diversity of the scenarios that robots could encounter in the real world, it would be tedious (or even impossible) for such environment-specific modeling techniques, which is why it is useful to instead take an approach based on machine learning. Although simulators can be learned entirely fromdata, if the training data does not include a wide enough variety of situations, the learned simulator might violate the laws of physics (i.e., deviate from the real-world dynamics) if it needs to simulate situations for which it was not trained. As a result, the robot that is trained in such a limited simulator is more likely to fail in the real world.
To overcome this complication, we construct a hybrid simulator that combines both learnable neural networks and physics equations. Specifically, we replace what are often manually-defined simulator parameters — contact parameters (e.g., friction and restitution coefficients) and motor parameters (e.g., motor gains) — with a learnable simulation parameter function because the unmodeled details of contact and motor dynamics are major causes of the sim-to-real gap. Unlike conventional simulators in which these parameters are treated as constants, in the hybrid simulator they are state-dependent — they can change according to the state of the robot. For example, motors can become weaker at higher speed. These typically unmodeled physical phenomena can be captured using the state-dependent simulation parameter functions. Moreover, while contact and motor parameters are usually difficult to identify and subject to change due to wear-and-tear, our hybrid simulator can learn them automatically from data. For example, rather than having to manually specify the parameters of a robot’s foot against every possible surface it might contact, the simulation learns these parameters from training data.
Comparison between a conventional simulator and our hybrid simulator.
The other part of the hybrid simulator is made up of physics equations that ensure the simulation obeys fundamental laws of physics, such as conservation of energy, making it a closer approximation to the real world and thus reducing the sim-to-real gap.
In our earlier mattress example, the learnable hybrid simulator is able to mimic the contact forces from the mattress. Because the learned contact parameters are state-dependent, the simulator can modulate contact forces based on the distance and velocity of the robot’s feet relative to the mattress, mimicking the effect of the stiffness and damping of a deformable surface. As a result, we do not need to analytically devise a model specifically for deformable surfaces.
Using GANs for Simulator Learning Successfully learning the simulation parameter functions discussed above would result in a hybrid simulator that can generate similar trajectories to the ones collected on the real robot. The key that enables this learning is defining a metric for the similarity between trajectories. GANs, initially designed to generate synthetic images that share the same distribution, or “style,” with a small number of real images, can be used to generate synthetic trajectories that are indistinguishable from real ones. GANs have two main parts, a generator that learns to generate new instances, and a discriminator that evaluates how similar the new instances are to the training data. In this case, the learnable hybrid simulator serves as the GAN generator, while the GAN discriminator provides the similarity scores.
The GAN discriminator provides the similarity metric that compares the movements of the simulated and the real robot.
Fitting parameters of simulation models to data collected in the real world, a process called system identification (SysID), has been a common practice in many engineering fields. For example, the stiffness parameter of a deformable surface can be identified by measuring the displacements of the surface under different pressures. This process is typically manual and tedious, but using GANs can be much more efficient. For example, SysID often requires a hand-crafted metric for the discrepancy between simulated and real trajectories. With GANs, such a metric is automatically learned by the discriminator. Furthermore, to calculate the discrepancy metric, conventional SysID requires pairing each simulated trajectory to a corresponding real-world one that is generated using the same control policy. Since the GAN discriminator takes only one trajectory as the input and calculates the likelihood that it is collected in the real world, this one-to-one pairing is not needed.
Using Reinforcement Learning (RL) to Learn the Simulator and Refine the Policy Putting everything together, we formulate simulation learning as an RL problem. A neural network learns the state-dependent contact and motor parameters from a small number of real-world trajectories. The neural network is optimized to minimize the error between the simulated and the real trajectories. Note that it is important to minimize this error over an extended period of time — a simulation that accurately predicts a more distant future will lead to a better control policy. RL is well suited to this because it optimizes the accumulated reward over time, rather than just optimizing a single-step reward.
After the hybrid simulator is learned and becomes more accurate, we use RL again to refine the robot’s control policy within the simulation (e.g., walking across a surface, shown below).
Following the arrows clockwise: (upper left) recording a small number of robot’s failed attempts in the target domain (e.g., a real-world proxy in which the leg in red is modified to be much heavier than the source domain); (upper right) learning the hybrid simulator to match trajectories collected in the target domain; (lower right) refining control policies in this learned simulator; (lower left) testing the refined controller directly in the target domain.
Evaluation Due to limited access to real robots during 2020, we created a second and different simulation (target domain) as a proxy of the real-world. The change of dynamics between the source and the target domains are large enough to approximate different sim-to-real gaps (e.g., making one leg heavier, walking on deformable surfaces instead of hard floor). We assessed whether our hybrid simulator, with no knowledge of these changes, could learn to match the dynamics in the target domain, and if the refined policy in this learned simulator could be successfully deployed in the target domain.
Qualitative results below show that simulation learning with less than 10 minutes of data collected in the target domain (where the floor is deformable) is able to generate a refined policy that performs much better for two robots with different morphologies and dynamics.
Comparison of performance between the initial and refined policy in the target domain (deformable floor) for the hopper and the quadruped robot.
Quantitative results below show that SimGAN outperforms multiple state-of-the-art baselines, including domain randomization (DR) and direct finetuning in target domains (FT).
Comparison of policy performance using different sim-to-real transfer methods in three different target domains for the Quadruped robot: locomotion on deformable surface, with weakened motors, and with heavier bodies.
Conclusion The sim-to-real gap is one of the key bottlenecks that prevents robots from tapping into the power of reinforcement learning. We tackle this challenge by learning a simulator that can more faithfully model real-world dynamics, while using only a small amount of real-world data. The control policy that is refined in this simulator can be successfully deployed. To achieve this, we augment a classical physics simulator with learnable components, and train this hybrid simulator using adversarial reinforcement learning. To date we have tested its application to locomotion tasks, we hope to build on this general framework by applying it to other robot learning tasks, such as navigation and manipulation.
So I apply preprocessing on my dataset because I can generate new data/do normalization.
Now its time to save model, so someone else can use it. Now I need normalization.
I want to attach this commented Layer before saving model. I didn’t need it for training but now
once model is trained I want to have normalization in network.
From this tutorial it is mention it is possible but I don’t see how to do this.
In this case the prepreprocessing layers will not be exported with the model when you call model.save
. You will need to attach them to your model before saving it or reimplement them server-side. After training, you can attach the preprocessing layers before export.