![]() |
submitted by /u/1991viet [visit reddit] [comments] |
It is widely accepted that the enormous growth of deep reinforcement learning research, which combines traditional reinforcement learning with deep neural networks, began with the publication of the seminal DQN algorithm. This paper demonstrated the potential of this combination, showing that it could produce agents that could play a number of Atari 2600 games very effectively. Since then, there have been several approaches that have built on and improved the original DQN. The popular Rainbow algorithm combined a number of these recent advances to achieve state-of-the-art performance on the ALE benchmark. This advance, however, came at a very high computational cost, which has the unfortunate side effect of widening the gap between those with ample access to computational resources and those without.
In “Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research”, to be presented at ICML 2021, we revisit this algorithm on a set of small- and medium-sized tasks. We first discuss the computational cost associated with the Rainbow algorithm. We explore how the same conclusions regarding the benefits of combining the various algorithmic components can be reached with smaller-scale experiments, and further generalize that idea to how research done on a smaller computational budget can provide valuable scientific insights.
The Cost of Rainbow
A major reason for the computational cost of Rainbow is that the standards in academic publishing often require evaluating new algorithms on large benchmarks like ALE, which consists of 57 Atari 2600 games that reinforcement learning agents may learn to play. For a typical game, it takes roughly five days to train a model using a Tesla P100 GPU. Furthermore, if one wants to establish meaningful confidence bounds, it is common to perform at least five independent runs. Thus, to train Rainbow on the full suite of 57 games required around 34,200 GPU hours (or 1425 days) in order to provide convincing empirical performance statistics. In other words, such experiments are only feasible if one is able to train on multiple GPUs in parallel, which can be prohibitive for smaller research groups.
Revisiting Rainbow
As in the original Rainbow paper, we evaluate the effect of adding the following components to the original DQN algorithm: double Q-learning, prioritized experience replay, dueling networks, multi-step learning, distributional RL, and noisy nets.
We evaluate on a set of four classic control environments, which can be fully trained in 10-20 minutes (compared to five days for ALE games):
![]() |
Upper left: In CartPole, the task is to balance a pole on a cart that the agent can move left and right. Upper right: In Acrobot, there are two arms and two joints, where the agent applies force to the joint between the two arms in order to raise the lower arm above a threshold. Lower left: In LunarLander, the agent is meant to land the spaceship between the two flags. Lower right: In MountainCar, the agent must build up momentum between two hills to drive to the top of the rightmost hill. |
We investigated the effect of both independently adding each of the components to DQN, as well as removing each from the full Rainbow algorithm. As in the original Rainbow paper, we find that, in aggregate, the addition of each of these algorithms does improve learning over the base DQN. However, we also found some important differences, such as the fact that distributional RL — commonly thought to be a positive addition on its own — does not always yield improvements on its own. Indeed, in contrast to the ALE results in the Rainbow paper, in the classic control environments, distributional RL only yields an improvement when combined with another component.
![]() |
Each plot shows the training progress when adding the various components to DQN. The x-axis is training steps,the y-axis is performance (higher is better). |
![]() |
Each plot shows the training progress when removing the various components from Rainbow. The x-axis is training steps,the y-axis is performance (higher is better). |
We also re-ran the Rainbow experiments on the MinAtar environment, which consists of a set of five miniaturized Atari games, and found qualitatively similar results. The MinAtar games are roughly 10 times faster to train than the regular Atari 2600 games on which the original Rainbow algorithm was evaluated, but still share some interesting aspects, such as game dynamics and having pixel-based inputs to the agent. As such, they provide a challenging mid-level environment, in between the classic control and the full Atari 2600 games.
When viewed in aggregate, we find our results to be consistent with those of the original Rainbow paper — the impact resulting from each algorithmic component can vary from environment to environment. If we were to suggest a single agent that balances the tradeoffs of the different algorithmic components, our version of Rainbow would likely be consistent with the original, in that combining all components produces a better overall agent. However, there are important details in the variations of the different algorithmic components that merit a more thorough investigation.
Beyond the Rainbow
When DQN was introduced, it made use of the Huber loss and the RMSProp Optimizer. It has been common practice for researchers to use these same choices when building on DQN, as most of their effort is spent on other algorithmic design decisions. In the spirit of reassessing these assumptions, we revisited the loss function and optimizer used by DQN on a lower-cost, small-scale classic control and MinAtar environments. We ran some initial experiments using the Adam optimizer, which has lately been the most popular optimizer choice, combined with a simpler loss function, the mean-squared error loss (MSE). Since the selection of optimizer and loss function is often overlooked when developing a new algorithm, we were surprised to see that we observed a dramatic improvement on all the classic control and MinAtar environments.
We thus decided to evaluate the different ways of combining the two optimizers (RMSProp and Adam) with the two losses (Huber and MSE) on the full ALE suite (60 Atari 2600 games). We found that Adam+MSE is a superior combination than RMSProp+Huber.
![]() |
Measuring the improvement Adam+MSE gives over the default DQN settings (RMSProp + Huber); higher is better. |
Additionally, when comparing the various optimizer-loss combinations, we find that when using RMSProp, the Huber loss tends to perform better than MSE (illustrated by the gap between the solid and dotted orange lines).
![]() |
Normalized scores aggregated over all 60 Atari 2600 games, comparing the different optimizer-loss combinations. |
Conclusion
On a limited computational budget we were able to reproduce, at a high-level, the findings of the Rainbow paper and uncover new and interesting phenomena. Evidently it is much easier to revisit something than to discover it in the first place. Our intent with this work, however, was to argue for the relevance and significance of empirical research on small- and medium-scale environments. We believe that these less computationally intensive environments lend themselves well to a more critical and thorough analysis of the performance, behaviors, and intricacies of new algorithms.
We are by no means calling for less emphasis to be placed on large-scale benchmarks. We are simply urging researchers to consider smaller-scale environments as a valuable tool in their investigations, and reviewers to avoid dismissing empirical work that focuses on smaller-scale environments. By doing so, in addition to reducing the environmental impact of our experiments, we will get both a clearer picture of the research landscape and reduce the barriers for researchers from diverse and often underresourced communities, which can only help make our community and scientific advances stronger.
Acknowledgments
Thank you to Johan, the first author of this paper, for his hard work and persistence in seeing this through! We would also like to thank Marlos C. Machado, Sara Hooker, Matthieu Geist, Nino Vieillard, Hado van Hasselt, Eleni Triantafillou, and Brian Tanner for their insightful comments on this work.
Developers can use the new viisights wise application powered by GPU-accelerated AI technologies to help organizations and municipalities worldwide avoid operational hazards and risk in their most valuable spaces.
Developers can use the new viisights wise application powered by GPU-accelerated AI technologies to help organizations and municipalities worldwide avoid operational hazards and risk in their most valuable spaces.
With conventional imaging and analytics solutions, AI vision is used to detect the presence of people and objects and their basic attributes to better understand how a physical space is being used. For example, is there a pedestrian, cyclist, or vehicle on a roadway? While understanding what’s happening on a road or in a retail store is important, predicting what might happen next is critical to taking this vision to the next level and building even smarter spaces.
viisights — an NVIDIA Metropolis partner — helps cities and enterprises make better decisions faster by offering an AI-enabled platform that analyzes complex behavior within video content. viisights uses the NVIDIA Deepstream SDK to accelerate their development workflow for its video analytics pipeline, cutting development time by half. DeepStream is a path to developing highly optimized video analysis pipelines including video decoding, multi-stream processing and accelerated image transformations in real-time, critical for high performance video analytics applications. For deployment, viisights uses NVIDIA TensorRT to further accelerate their application’s inference throughput at the edge with INT8 calibration.
The viisights Wise application recognizes and classifies over 200 different types of objects and dozens of different types of actions and events. The ability to recognize that a person is running, falling, or throwing an object provides a far richer and more holistic understanding of how spaces are being used. By using viisights Wise, operations teams can gain deeper insights and identify behavioral patterns that can predict what might happen next. viisights Wise can process over 20 video streams on a single GPU in near real-time.
viisights technology protects public privacy, as it only analyzes general behavior patterns of individuals and groups and does not identify specifics like faces or license plates. Because the application only analyzes behavior patterns, viisights technology is incredibly easy to use, further increasing ROI.
Real-world situations and use cases for viisights Wise include:
Hazard Detection: Detect operational hazards in real-time. These include large objects that might block human or vehicle mobility in both outdoor and indoor surroundings such as streets, sidewalks, production floors, and operational areas in transportation hubs. It can even detect environmental hazards like smoke or fire in outdoor areas where smoke detectors cannot be installed.
Traffic Management: Cities can make informed decisions on how to best modify roadways, traffic patterns, public transit infrastructure, and more. Real-time data also lets city traffic agencies take immediate action to mitigate traffic congestion and address other issues such as vehicles colliding, occupying restricted areas, driving the wrong direction, and blocking roads or junctions.
Crowd Insights: Large groups of people can behave in a wide variety of ways. viisights Wise can recognize the action of a crowd and determine whether it’s moving, running, gathering, or dispersing. This real-time detection of escalated behavior can help officials predict how quickly an event is growing and decide if there’s a reason for concern or a need to take action. In addition to analyzing behavior, viisights can identify three classifications of human groupings—“groups”, “gatherings”, and “crowds”—based on a user-specified size for each category. These user-configurable classifications reduce false alarms by configuring grouping alerts to suit the context. For example, a small group of six to eight people will be categorized as ordinary in the context of a playground but as a cause for concern in an airport setting.
Group Density Determination: Determine group density (classified as “low”, “medium” or “high”) and count people crossing a line or getting into and out from a zone. Cities can use this insight to enforce social distancing mandates during public health emergencies and pandemics. Retail stores or entertainment facilities like stadiums or cinemas can also use the technology to estimate queue wait times and determine spacing between people in queues to preserve public health. This helps businesses, cities, and authorities assess staff allocation and work more efficiently. Counting people can also be used in operational scenarios, such as at train stations, where train operators can improve train line scheduling in real time based on the number of people entering and exiting the trains.
Sense Body Movements: Their AI enabled video analytics solution notices if people are falling or lying on the floor. Implications include medical distress, accidents, or even violence. Real-time behavioral recognition analytics can alert operations teams where every second of latency matters.
Abnormal Behavior: Identify actions that are a potential cause for operational team attention such as when running is being observed where walking is normal (ex. department store aisles, shopping malls, etc.) or climbing or jumping over fences. It can also understand contextual loitering and detect if individuals are spending excessive amounts of time near ATMs, parks, building lobbies, etc., or are moving irregularly from object to object (i.e., from car door to car door in a parking garage). viisights behavioral analytics apply the intelligence and power of AI-driven contextual understanding to detect and identify these incidents as they unfold so they can be addressed accordingly and issue alerts in real time.
As one example, viisights is helping to manage operations in the city of Eilat, one of Israel’s largest tourist destinations. Their application has helped identify and manage events in densely populated entertainment areas. The application has helped in maintaining transportation safety. In one incident, a motorcycle was detected driving in parks where only bicycles are allowed and the situation was quickly handled and safety maintained. In another example, the city of Ashdod is using viisights to manage traffic efficiently. viisights traffic management capabilities monitor several dozen intersections in real-time for congestion and accidents. The data collected by viisights enables a quick response to ease traffic jams and helps improve traffic light scheduling at intersections. The value and impact of behavior analytics are significant and incredibly unique for many industries.
Want to learn more about computer vision and video analytics? Click here to read more Metropolis Spotlight posts.
Robotics researchers from NVIDIA and University of Southern California recently presented their work at the 2021 RSS conference called DiSECt, the first differentiable simulator for robotic cutting.
Robotics researchers from NVIDIA and University of Southern California presented their work at the 2021 Robotics: Science and Systems (RSS) conference called DiSECt, the first differentiable simulator for robotic cutting. The simulator accurately predicts the forces acting on a knife as it presses and slices through natural soft materials, such as fruits and vegetables.
Robots that are intelligent, adaptive and generalize cutting behavior with either a kitchen butter knife, or a surgical resection remains a difficult problem for researchers.

As it turns out, the process of cutting with feedback requires adaptation to stiffness of the objects, applied force during the cut, and often a sawing motion to cut through. To achieve this, researchers use a family of techniques which leverage feedback to guide the controller adaptation. However, fluid controller adaptation requires very careful parameter tuning for each instance of the same problem. While these techniques are successful in industrial settings, no two cucumbers (or tomatoes) are the same, hence rendering these family of algorithms ineffective in a more generic setting.
In contrast, recent focus in research has been in building differentiable algorithms for control problems, which in simpler terms means that the sensitivity of output with respect to input can be evaluated without excessive sampling. Efficient solutions for control problems are achievable when the simulated dynamics is differentiable [1,2,3], but the process of simulating cutting has not been differentiable so far!
Differentiable simulation for cutting poses a challenge, since naturally cutting is a discontinuous process where crack formation and fracture propagation occur that prohibit the calculation of gradients. We tackle this problem by proposing a novel way of simulating cutting that represents the process of crack propagation and damage mechanics in a continuous manner.
DiSECt implements the commonly used Finite Element Method (FEM) to simulate deformable materials, such as foodstuffs. The object to be cut is represented by a 3D mesh which consists of tetrahedral elements. Along the cutting surface we slice the mesh following the Virtual Node Algorithm [4]. This algorithm duplicates the mesh elements that intersect the cutting surface, and adds additional, so-called “virtual” vertices on the edges where these elements are cut. The virtual nodes add extra degrees of freedom to accurately simulate the contact dynamics of the knife when it presses and slices through the mesh.
Next, DiSECt inserts springs connecting the virtual nodes on either side of the cutting surface. These cutting springs allow us to simulate damage mechanics and crack propagation in a continuous manner, by weakening them in proportion to the contact force the knife exerts on the mesh. This continuous treatment allows us to differentiate through the dynamics in order to compute gradients for the parameters defining the properties of the material or the trajectory of the knife. For example, given the gradients for the vertical and sideways velocity of the knife, we can efficiently determine an energy-minimizing yet fast cutting motion through gradient-based optimization.


We leverage reverse-mode automatic differentiation to efficiently compute gradients for hundreds of simulation parameters. Our simulator uses source code transformation which automatically generates efficient CUDA kernels for the forward and backward passes of all our simulation routines, such as the FEM or contact model. Such an approach allows us to implement complex simulation routines which are parallelized on the GPU, while the gradients of the inputs with respect to the outputs of such routines are automatically derived from analyzing the abstract syntax tree (AST) of the simulation code.
Through gradient-based optimization algorithms, we can automatically tune the simulation parameters to achieve a close match between the simulator and real-world measurements. In one of our experiments, we leverage an existing dataset [5] of knife force profiles measured by a real-world robot while cutting various foodstuffs. We set up our simulator with the corresponding mesh and its material properties, and optimize the remaining parameters to reduce the discrepancy between the simulated and the real knife force profile. Within 150 gradient evaluations, our simulator closely predicts the knife force profile, as we demonstrate on the examples of cutting an actual apple and a potato. As shown in the figure, the initial parameter guess yielded a force profile that was far from the real observation, and our approach automatically found an accurate fit. We present further results in our accompanying paper that demonstrate that the found parameters generalize to different conditions, such as the downward velocity of the knife or the length of the reference trajectory window.



We also generate additional data using a highly established commercial simulator, which allows us to precisely control the experimental setup, such as object shape and material properties. Given such data, we can also leverage the motion of the mesh vertices as an additional ground-truth signal. After optimizing the simulation parameters, DiSECt is able to predict the vertex positions and velocities, as well as the force profile, much more accurately.

Aside from parameter inference, the gradients of our differentiable cutting simulator can also be used to optimize the cutting motion of the knife. In our full cutting simulation, we represent a trajectory by keyframes, where each frame prescribes the downward velocity, as well as the frequency and amplitude of a sinusoidal sideways velocity. At the start of the optimization, the initial motion is a straight downward-pressing motion.
We optimize this trajectory with the objective to minimize the mean force on the knife and penalize the time it takes to cut the object. Studies have shown that humans perform sawing motions when cutting biomaterials in order to reduce the required force. Such behavior emerges from our optimization as well.

After 50 iterations with the Adam optimizer, we see a reduction in average knife force by 15 percent. However, the knife slices sideways further than its blade length. Therefore, we add a hard constraint to keep the lateral motion within valid limits and perform constrained optimization. Thanks to the end-to-end-differentiability of DiSECt, accurate gradients for such constraints are available, and lead to a valid knife motion which requires only 0.3 percent more force than the unconstrained result.
Cutting food items multiple times results in slightly different force profiles for each instance, depending on the geometry of such materials. We additionally present results to transfer simulation parameters between different meshes corresponding to the same material. Our approach leverages optimal transport to find correspondences between simulation parameters of a source mesh and a target mesh (e.g., local stiffnesses) based on the location of the virtual nodes. As shown in the following figure, the 2D positions of these nodes along the cutting surface allow us to map simulation parameters (shown here is the softness of the cutting springs) to topologically different target geometries.

In our ongoing research, we are bringing our differentiable simulation approach to real-world robotic cutting. We investigate a closed-loop control system where the simulator is updated online from force measurements, while the robot is cutting foodstuffs. Through model-predictive planning and optimal control, we aim to find time- and energy-efficient cutting actions that apply to the physical system.
We thank Yan-Bin Jia and Prajjwal Jamdagni for kindly providing us a dataset of real-world cutting trajectories which we used throughout our experiments. Be sure to also check out their research on robotic cutting!
DiSECt is a finalist at 2021 RSS for the best (student) paper. Visit the team’s project webpage to learn more.
DiSECt Research Paper: DiSECt: A Differentiable Simulation Engine for Autonomous Robotic Cutting
Eric Heiden, Miles Macklin, Yashraj S Narang, Dieter Fox, Animesh Garg, Fabio Ramos
Robotics Science and Systems (RSS) 2021.
![]() |
Hello. I convert weights from a pytorch model to tf2 model. To verify the result, I export the feature maps after the conv2d for both pytorch original model and tf2 model. However, the feature map from tf2 model has black bars on the top and left hand side of the map, whereas the pytorch feature map does not have these black bars. The pytorch code of conv2d is nn.Conv2d(1, 56, kernel_size=5, padding=5//2) The tensorflow build is tensorflow.keras.layers.Conv2D(56, 5, padding="same")(inp) According to https://stackoverflow.com/questions/68165375/manualy-convert-pytorch-weights-to-tf-keras-weights-for-convolutional-layer# advise, I even add the zeropadding layer to the tensorflow model x = tensorflow.keras.layers.ZeroPadding2D(padding=((2,2)))(inp) x2 = tensorflow.keras.layers.Conv2D(56, 5, padding="valid")(x) However, the black bar is still there. I have no clue for this problem > <. For the converting process for pytorch weights to tensorflow weights is as follow: onnx_1_w_num = onnx_l.weight.data.permute(2,3,1,0).numpy() onnx_1_b_num = onnx_l.bias.data.numpy() tf_l.set_weights([onnx_1_w_num,onnx_1_b_num]) submitted by /u/mrgreen3325 |

![]() |
submitted by /u/RubiksCodeNMZ [visit reddit] [comments] |
The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation. Image segmentation partitions a digital image into multiple segments by changing the … Continued
The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.
Image segmentation partitions a digital image into multiple segments by changing the representation into something more meaningful and easier to analyze. In the field of medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information. It does this by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and other formats.
To achieve state-of-the-art models that deliver the desired accuracy and performance for a use case, you must set up the right environment, train with the ideal hyperparameters, and optimize it to achieve the desired accuracy. All of this can be time-consuming. Data scientists and developers need the right set of tools to quickly overcome tedious tasks. That’s why we built the NGC catalog.
The NGC catalog is a hub of GPU-optimized AI and HPC applications and tools. NGC provides easy access to performance-optimized containers, shortens model development time with pretrained models, and provides industry-specific SDKs to help build complete AI solutions and speed up AI workflows. These diverse assets can be used for a variety of use cases, ranging from computer vision and speech recognition to language understanding. The potential solutions span industries such as automotive, healthcare, manufacturing, and retail.
![NGC catalog page shows cards for GPU-optimized HPC and AI containers, pretrained models, and industry SDKs that help accelerate workflows]](https://developer-blogs.nvidia.com/wp-content/uploads/2021/07/NGC-catalog-625x381.png)
3D medical image segmentation with U-Net
In this post, we show how you can use the Medical 3D Image Segmentation notebook to predict brain tumors in MRI images. This post is suitable for anyone who is new to AI and has a particular interest in image segmentation as it applies to medical imaging. 3D U-Net enables the seamless segmentation of 3D volumes, with high accuracy and performance. It can be adapted to solve many different segmentation problems.

Figure 2 shows that 3D U-Net consists of a contractive (left) and expanding (right) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling.
In deep learning, a convolutional neural network (CNN) is a subset of deep neural networks, mostly used in image recognition and image processing. CNNs use deep learning to perform both generative and descriptive tasks, often using machine vision along with recommender systems and natural language processing.
Padding in CNNs refers to the number of pixels added to an image when it is processed by the kernel of a CNN. Unpadded CNNs means that no pixels are added to the image.
Pooling is a downsampling approach in CNN. Max pooling is one of common pooling methods that summarize the most activated presence of a feature. Every step in the expanding path consists of a feature map upsampling and a concatenation with the correspondingly cropped feature map from the contractive path.
Requirements
This resource contains a Dockerfile that extends the TensorFlow NGC container and encapsulates some dependencies. The resource can be downloaded using the following commands:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/unet3d_medical_for_tensorflow/versions/20.06.0/zip -O unet3d_medical_for_tensorflow_20.06.0.zip
Aside from these dependencies, you also need the following components:
- NVIDIA Docker
- Latest TensorFlow container from NGC
- NVIDIA Ampere Architecture, NVIDIA Turing, or NVIDIA Volta GPUs
To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the 3D U-Net model on the Brain Tumor Segmentation 2019 dataset.
Download the resource
Download the resource manually by clicking the three dots at the top-right corner of the resource page.

You could also use the following wget command:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/unet3d_medical_for_tensorflow/versions/20.06.0/zip -O unet3d_medical_for_tensorflow_20.06.0.zip
Build the U-Net TensorFlow NGC container
This command uses the Dockerfile to create a Docker image named unet3d_tf, downloading all the required components automatically.
docker build -t unet3d_tf
.
Download the dataset
Data can be obtained by registering on the Brain Tumor Segmentation 2019 dataset website. The data should be downloaded and placed where /data
in the container is mounted.
Run the container
To start an interactive session in the NGC container to run preprocessing, training, and inference, you must run the following command. This launches the container and mounts the ./data
directory as a volume to the /data
directory inside the container, mounts the ./results
directory to the /results
directory in the container.
The advantage of using a container is that it packages all the necessary libraries and dependencies into a single, isolated environment. This way you don’t have to worry about the complex install process.
mkdir data mkdir results docker run --runtime=nvidia -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm --ipc=host -v ${PWD}/data:/data -v ${PWD}/results:/results -p 8888:8888 unet3d_tf:latest /bin/bash
Start the notebook inside the container
Use this command to start a Jupyter notebook inside the container:
jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root
Move the dataset to the /data directory inside the container. Download the notebook with the following command:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/med_3dunet/versions/1/zip -O med_3dunet_1.zip
Then, upload the downloaded notebook into JupyterLab and run the cells of the notebook to preprocess the dataset and train, benchmark, and test the model.
Jupyter notebook
By running the cells of this Jupyter notebook, you can first check the downloaded dataset and see the brain tumor images. After that, see the data preprocessing command and prepare the data for training. The next step is training the model and using the checkpoints of the training process for the predicting step. Finally, check the output of the predict function visually.
Checking the images
To check the dataset, you can use nibabel, which is a package that provides read/write access to some common medical and neuroimaging file formats.
By running the next three cells, you can install nibabel using pip install, choose an image from the dataset, and plot the chosen third image from the dataset using matplotlib. You can check other dataset images by changing the image address in the code.
import nibabel as nib import matplotlib.pyplot as plt img_arr = nib.load('/data/MICCAI_BraTS_2019_Data_Training/HGG/BraTS19_2013_10_1/BraTS19_2013_10_1_flair.nii.gz').get_data() def show_plane(ax, plane, cmap="gray", title=None): ax.imshow(plane, cmap=cmap) ax.axis("off") if title: ax.set_title(title) (n_plane, n_row, n_col) = img_arr.shape _, (a, b, c) = plt.subplots(ncols=3, figsize=(15, 5)) show_plane(a, img_arr[n_plane // 2], title=f'Plane = {n_plane // 2}') show_plane(b, img_arr[:, n_row // 2, :], title=f'Row = {n_row // 2}') show_plane(c, img_arr[:, :, n_col // 2], title=f'Column = {n_col // 2}')
The result is something like Figure 4.

Data preprocessing
The dataset/preprocess_data.py
script converts the raw data into the TFRecord format used for training and evaluation. This dataset, from the 2019 BraTS challenge, contains over 3 TB multiinstitutional, routine, clinically acquired, preoperative, multimodal, MRI scans of glioblastoma (GBM/HGG) and lower-grade glioma (LGG), with the pathologically confirmed diagnosis. When available, overall survival (OS) data for the patient is also included. This data is structured in training, validation, and testing datasets.
The format of images is nii.gz.
NIfTI is a type of file format for neuroimaging. You can preprocess the downloaded dataset by running the following command:
python dataset/preprocess_data.py -i /data/MICCAI_BraTS_2019_Data_Training -o /data/preprocessed -v
The final format of the processed images is tfrecord
. To help you read data efficiently, serialize your data and store it in a set of files (~100 to 200 MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. It can also be useful for caching any data preprocessing. The TFRecord format is a simple format for storing a sequence of binary records, which speeds up the data loading process considerably.
Training using default parameters
After the Docker container is launched, you can start the training of a single fold (fold 0) with the default hyperparameters (for example, {1 to 8} GPUs {TF-AMP/FP32/TF32}):
Bash examples/unet3d_train_single{_TF-AMP}.sh
For example, to run with 32-bit precision (FP32 or TF32) with batch size 2 on one GPU, run the following command:
bash examples/unet3d_train_single.sh 1 /data/preprocessed /results 2
To train a single fold with mixed precision (TF-AMP) with on eight GPUs and batch size 2 per GPU, run the following command:
bash examples/unet3d_train_single_TF-AMP.sh 8 /data/preprocessed /results 2
Training performance benchmarking
The training performance can be evaluated by running benchmarking scripts:
bash examples/unet3d_{train,infer}_benchmark{_TF-AMP}.sh
This script makes the model run and reports the performance. For example, to benchmark training with TF-AMP with batch size 2 on four GPUs, run the following command:
bash examples/unet3d_train_benchmark_TF-AMP.sh 4 /data/preprocessed /results 2
Predict
You can use the test dataset and predict as exec-mode to test the model. The result is saved in the model_dir
directory and data_dir
is the path to the dataset:
python main.py --model_dir /results --exec_mode predict --data_dir /data/preprocessed_test
Plotting the result of prediction
In the following code example, you plot one of the chosen results from the results
folder:
import numpy as np from mpl_toolkits import mplot3d import matplotlib.pyplot as plt data= np.load('/results/vol_0.npy') def show_plane(ax, plane, cmap="gray", title=None): ax.imshow(plane, cmap=cmap) ax.axis("off") if title: ax.set_title(title) (n_plane, n_row, n_col) = data.shape _, (a, b, c) = plt.subplots(ncols=3, figsize=(15, 5)) show_plane(a, data[n_plane // 2], title=f'Plane = {n_plane // 2}') show_plane(b, data[:, n_row // 2, :], title=f'Row = {n_row // 2}') show_plane(c, data[:, :, n_col // 2], title=f'Column = {n_col // 2}')

Advanced options
For those of you looking to explore advanced features built into this notebook, you can see the full list of available options for main.py using -h
or --help
. By running the next cell, you can see how to change execute mode and other parameters of this script. You can perform model training, predicting, evaluating, and inferencing using customized hyperparameters using this script.
python main.py --help
The main.py
parameters can be changed to perform different tasks, including training, evaluation, and prediction.
You can also train the model using default hyperparameters. By running the python main.py --help
command, you can see the list of arguments that you can change, including training hyperparameters. For example, in training mode, you can change the learning rate from the default 0.0002 to 0.001 and the training steps from 16000 to 1000 using the following command:
python main.py --model_dir /results --exec_mode train --data_dir /data/preprocessed_test --learning_rate 0.001 --max_steps 1000
You can run other execution modes available in main.py
. For example, in this post, we used the prediction execution mode of python.py
by running the following command:
python main.py --model_dir /results --exec_mode predict --data_dir /data/preprocessed_test
Summary and next steps
In this post, we showed how you can get started with a medical imaging model using a simple Jupyter notebook from the NGC catalog. As you make the transition from this Jupyter notebook to building your own medical imaging workflows, consider using NVIDIA Clara Train. Clara Train includes AI-Assisted Annotation APIs and an Annotation server that can be seamlessly integrated into any medical viewer, making it AI-capable. The training framework includes decentralized learning techniques, such federated learning and transfer learning for your AI workflows.
To learn how to use these resources and kickstart your AI journey, register for the upcoming webinar with live Q&A, NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.
Your robotaxi is arriving soon. Self-driving startup AutoX took the wraps off its full self-driving system — dubbed “Gen5” — this week. The autonomous driving platform, which is specifically designed for robotaxis, uses NVIDIA DRIVE automotive grade GPUs to reach up to 2,200 trillion operations per second (TOPS) of AI compute performance. In January, AutoX Read article >
The post AutoX Unveils Full Self-Driving System Powered by NVIDIA DRIVE appeared first on The Official NVIDIA Blog.
Despite unprecedented progress in NLP, many state-of-the-art models are available in English only. NVIDIA has developed tools to enable the development of even the largest language models. This post describes the challenges associated with building and deploying large scale models and the solutions to enable global organizations to achieve NLP leadership in their regions.
Despite substantial progress in natural language processing (NLP) research over the last two years and its commercial success, little effort has been devoted to adapting this capability to other significant languages, such as Hindu, Arabic, Portuguese, or Spanish. Obviously, catering to the entire human population with more than 6,500 languages is challenging. At the same time, supporting just 40 languages addresses the NLP needs of more than 60% of the human population.

Figure 2 shows that, even across the most frequently used languages, the performance of language models varies tremendously. Bear in mind that this comparison is not perfect, as those languages do have different language entropy. More importantly, the research on most capable large-scale language models seems to be limited to only a handful of high resource languages (languages with a high number of documents available publicly), such as English or Chinese.

The situation is even more complex when you account for domain-specific languages (such as medical, technical, or legal jargon), where besides English only a few high-quality models exist. This is regretful as those domain-specific language models are currently transforming the way that clinicians, engineers, researchers or other experts access information.
- Models such as GaTortron are changing the way that clinicians access medical notes.
- Models such as bioMegatron change the way that medical researchers identify drug or chemical interactions affecting the human body.
- Models such as sciBERT change the way that engineers find and interact with scientific information.
Unfortunately, there is a limited number of equivalent models outside of English. Fortunately, replicating the success of English-language models across other languages is no longer a research task but predominantly an engineering activity. It no longer requires inventing new models and training approaches, but instead systematic and iterative dataset engineering, model training, and its continuous validation.
This does not mean that engineering those models is trivial. Because of the model and dataset sizes used in modern NLP, the training process requires a substantial amount of computing power. Secondly, to use large models, you must collect large textual datasets. Thirdly, because of the shared size of models used, new approaches to training and inference are required.
NVIDIA has extensive experience not only in building large-scale language models (ranging from 1 billion to 175 billion parameters) but also in deploying them to production. The goal of this post is to share our knowledge around project organization, infrastructure requirements, and budgeting and to support projects in this area.
Dawn of large models
As hypothesized in Deep Learning Scaling is Predictable, Empirically, the NLP model performance seems to follow a power law with respect to both the model size and the volume of data used for training. As you make models and datasets bigger, the performance continues to improve. The following diagram from Scaling Laws for Neural Language Models demonstrates not only that this relationship holds but more importantly, it holds across nine orders of magnitude of compute.

In the NLP scaling law, despite the models at the far right reaching as much as 175 billion parameters (more than 500 times larger than BERT Large), this relationship does not show signs of stopping. This suggests that even further improvement can be expected from larger models. Indeed, switch transformers when scaled to 1.6 trillion parameters (roughly 5000x larger than BERT Large) continue to demonstrate the previously mentioned behavior.
More importantly, the large NLP models seem to generate much more robust features capable of solving complex problems even without large-scale fine-tuning datasets. Figure 4 shows this capability across three orders of magnitude of models.

Due to this capability, and despite the relatively high cost of their development, large NLP models are likely to not only continue to dominate the NLP processing landscape but also continue to grow, at least by another order of magnitude approaching trillions of parameters.
This relationship between model size, dataset size, and model performance is not unique to NLP. I see the same behavior of automatic speech recognition and computer vision models, and across many other disciplines that are a backbone of conversational AI.
At the same time, a limited amount of work was devoted to the development of both large-scale datasets and models for other languages. Indeed, the majority of the work that focuses on languages other than English take advantage of smaller models and less curated datasets. For example, NLP using subset from general datasets such as raw Common Crawl).
Even less effort is devoted to supporting any of the following:
- More localized models, such as Arabic dialects and variants or Mandarin subgroups and dialects.
- Domain-specific models such as used those for clinical note taking, scientific publications, financial report analysis, interpretation of engineering documents, parsing through legal texts, or even variants of language used on platforms such as Twitter.
The current status creates an opportunity for local companies willing to invest in model training to lead the development of NLP technologies in the region.
Building your local language model
Building large-scale language models is not trivial for many reasons. First, the large-scale datasets are not trivial to curate. In the raw format, they are actually quite easy to obtain. Second, the infrastructure required to train these huge models requires substantial systems knowledge to set up. Finally, they require extensive research expertise to train and optimize.
What is less widely understood is that training such large models requires software engineering effort. Most interesting models are larger than the memory capacity of not only individual GPUs but also of many multi-GPU servers. The number of mathematical operations required to train them can also make training times unmanageable, measured in months even on fairly sizable systems. Approaches such as model and pipeline parallelism overcome some of those challenges. However, applying them in a naive way could lead to scaling issues, exacerbating an already long training time.
Together with organizations such as Microsoft and Stanford University, NVIDIA has worked towards developing tools that streamline the development process of the largest language models and provide computational efficiency and scalability to allow cost-effective training. As a consequence, a wide range of tools abstracting the complexity of large model development are now available, including the following:
- NVIDIA Megatron LM—A computationally efficient model and pipeline parallel implementation of self-attention and multilayer perceptron together with implementations of models such as GPT-3, T5, or even Vision Transformers. In particular, the GPT-3 code was tested with up to 1 trillion parameters. For more information, see Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism and Efficient Large-Scale Language Model Training on GPU Clusters.
- Microsoft DeepSpeed—A library that not only supports Megatron LM as a foundation for model and pipeline parallelism but also further optimizes the implementation with Zero Redundancy Optimizer (ZeRO), 1-bit Adam and LAMB, or sparse attention.
- Facebook FairScale—A library that addresses large-scale training through the implementation of model parallel training (based on NVIDIA Megatron LM), sharded training, or an AdaScale optimizer.
As a result of those efforts, I’ve seen a substantial reduction in training times of large models. Indeed, the GPT-3 model with 175 billion parameters trained on 300 billion tokens using 1024 NVIDIA A100 Tensor Core GPUs can be trained today in 34 days (as shown in Efficient Large-Scale Language Model Training on GPU Clusters). Based on experimentation, NVIDIA estimates that 1 trillion parameter models can be trained in approximately 84 days with 3072 A100 GPUs. Despite the training cost of those models being high, it is not beyond reach of most large organizations. With further software advances, it is likely to reduce further.
Because the development of large language models requires scalable infrastructure, NVIDIA has also consolidated knowledge from building the internal Selene cluster (used for NLP internal research and to deliver record-breaking performance in MLPerf training and inference benchmarks) into a fully packaged product called NVIDIA DGX SuperPOD. This cluster is more than just a system reference design. In fact, it can be bought in its entirety together with software and support of NVIDIA data scientists and applied researchers, similar to the NLP-focused deployment by Naver Clova).
Such an approach has already had a substantial impact on the NLP landscape, as it enables organizations with extensive NLP expertise to scale out their efforts fast. More importantly, it enables organizations with limited systems, HPC, or large-scale NLP workload expertise to start iterating in weeks, rather than months or years.

Production deployment
The ability to build large language models is just an academic achievement when it’s impossible to take advantage of the results of your work by deploying your models to production. The challenge of deploying models such as GPT-3 correlates to their sheer size, which exceeds the memory capacity of a GPU, and computational complexity. Both are factors that contribute to decreased throughput and high latency of inference. This is a widely understood problem and a range of tools and solutions currently exist to make serving the largest language models simple and cost-effective.
- NVIDIA Triton Inference Server
- NVIDIA TensorRT and other model optimization techniques
NVIDIA Triton Inference Server
NVIDIA Triton Inference Server is an open-source cloud and edge inferencing solution optimized for both CPUs and GPUs. It can be used to host distributed models effectively. To deploy a large model using pipeline parallelism, the model must be split into several parts, for example, manipulating the ONNX graph with tools such as ONNX Graph Surgeon. Each of the parts must be small enough to fit into the memory space of a single GPU.

After the model is subdivided, it can be distributed across multiple GPUs without the need to develop any code. You create an NVIDIA Triton YAML configuration file defining how individual parts of the model should be connected.

The traffic between individual model parts and their load balancing can be managed automatically by Triton Inference Server. The communication overheads are also kept to a minimum as Triton takes advantage of the latest NVIDIA NVSwitch and third-generation NVIDIA NVLink technology, providing 600 GB/sec GPU-to-GPU direct bandwidth, which is 10X higher than PCIe gen4. This means that you can efficiently deploy not only medium-scale models of multi billions of parameters but even the largest of models, including GPT-3 with trillions of parameters.

For more information, see Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime (GTC21 session).
NVIDIA TensorRT and other model optimization techniques
Beyond the ability to host trained large models, it is important to look at optimization techniques. Such techniques can reduce the memory footprint of the models, through quantization and pruning; substantially accelerate the execution; and reduce latency by optimizing memory access, taking advantage of TensorCores or sparsity acceleration.
Utilities such as TensorRT provide a wide range of optimized kernels for execution of transformer-based architectures. They can automatically do half precision (FP16) or, in certain cases, INT8 quantization. TensorRT also supports quantization-aware training and provides early support for hardware-accelerated sparsity.
The NVIDIA FasterTransformer library specializes in the inference of the transformer neural networks and can be used with models such as BERT or GPT-2/3. This library includes a tensor-parallel inference backend that provides the ability to do the inference of the huge GPT-3 models in parallel on multiple GPUs within the DGX A100 system. This enables you to reduce inference latency by as much as 1.2–3x, depending on model size. With FasterTransformer, you can deploy the largest of Megatron Models with a single line of code.
The Microsoft DeepSpeed library has a number of features focused on inference, including support for Mixture-of-Quantization (MoQ), high-performance INT8 kernels, or DeepFusion.
Thanks to all of those advances, large language models are no longer limited to academic research as they are making headway into commercial AI-based products.
Sizing the challenge
Correct sizing of the challenge is critical for the success of your NLP initiative. The amount of engineering and research staff needed as well as training and inference infrastructure significantly affects your business case. The following factors have significant impact on the overall cost of the development:
- Number of supported languages and dialects
- Number of applications being developed (named entity recognition, translation, and so on)
- Development schedule (to train faster, you need more compute)
- Requirements towards model accuracy (driven by dataset size and model size)
After the fundamental business questions are addressed, it is possible to estimate the effort and compute required for their development. When you have an understanding of how good your model must be to allow for the product or service, it is possible to estimate the model size needed. The relationship between the performance of language models and the amount of data and model size is widely understood (Figure 9).

After you understand the size of the model and dataset that you need, you can estimate the amount of infrastructure required and training time. For more information, see Efficient Large-Scale Language Model Training on GPU Clusters. Furthermore, the scaling of large language models is superlinear, meaning that the training performance does not degrade with the increasing model size but actually increases (Figure 10).

Here are the key factors to consider for initial infrastructure sizing:
- Number of supported languages, dialects, and domain-specific variants
- Number of applications per language
- Project timelines
- Performance metrics and targets that must be achieved by models to create minimal viable products
- Number of data scientists, researchers, and engineers actively involved in training per application or language
- Minimal turnaround time and training frequencies, both for pretraining and fine-tuning
- Rough understanding of inference demand of the application, the average or maximal number of requests per day or hour and its seasonality
- Training pipeline performance, or how long a single training cycle takes and how far from theoretical maximum performance is your pipeline implementation (Figure 10 shows a Megatron GPT-3 based example)
Large models are the present and the future of NLP
Large language models have appealing properties and will help expand the availability of NLP around the globe. They more performant across a wide number of NLP tasks, but they are also much more sample efficient. They are what are known as few-shot learners and in certain ways are easier to design, as their exact hyperparameter configuration seems unimportant in comparison to their size. As a consequence, the NLP models are likely to continue to grow. I see empirical evidence justifying at least one if not two orders of magnitude of growth.
Fortunately, the technology to build and deploy them to production has matured considerably. The software required to train them has also matured considerably and is broadly available, such as the NVIDIA open source Megatron-based implementation of GPT-3. Quality is continuing to improve, driving down the training times. The infrastructure required to train models in this space is also well understood and commercially available (DGX SuperPOD). It is now possible to deploy the largest of NLP models to production using tools such as Triton Inference Server. As a consequence, big NLP models are in reach of everyone with the will to pursue them.
Summary
NVIDIA actively supports customers in the scoping and delivery of large training and inference systems, as well as supporting them in establishing NLP training capability. If you are working towards building your NLP capability, reach out to your local NVIDIA account team.
You can also join one of our Deep Learning Institute NLP classes. During the course, you learn how to work with modern NLP models, optimise them with TensorRT, and deploy for cost-effective production with Triton Inference Server.
For more information, see any of the following NLP-related GTC presentations:
- Natural Language Processing: The New Opportunity [E32293]
- Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime [S31578]
- Accelerating AI at-scale with Selene DGXA100 SuperPOD and Parallel Filesystem Storage [S31522]
- S32030: Gatortron – The Biggest Clinical Language Model
- Designing and Optimizing Deep Neural Networks for High-Throughput and Low-Latency Production Deployment [E31970]
TL; DR The use of Plotly’s Dash, RAPIDS, and Data shader allows users to build viz dashboards that both render datasets of 300 million+ rows and remain highly interactive without the need for precomputed aggregations. Using RAPIDS cuDF and Plotly Dash for real-time, interactive visual analytics on GPUs Dash is an open-source framework from Plotly … Continued
TL; DR
The use of Plotly’s Dash, RAPIDS, and Data shader allows users to build viz dashboards that both render datasets of 300 million+ rows and remain highly interactive without the need for precomputed aggregations.
Using RAPIDS cuDF and Plotly Dash for real-time, interactive visual analytics on GPUs
Dash is an open-source framework from Plotly for building interactive web application-based dashboards using Python. In addition, the RAPIDS suite of open-source software (OSS) libraries gives the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. Combining these two projects enables real-time, interactive visual analytics of multi-gigabyte datasets, even on a single GPU.
This census visualization uses the Dash API for generating charts and their callback functions. In contrast, RAPIDS cuDF is being used to accelerate these callbacks for real-time aggregations and query operations.
Using a modified version of 2010 Census data combined with 2006-2010 American Community Survey data (sourced with permission from the fantastic IPUMS.org), we mapped every individual in the United States to a single point (randomly) located onto the equivalent of a city block. As a result, each person has unique demographic attributes associated with them that enable fine-grained filtering and data discovery not previously possible. The code, installation details, and data caveats are publicly available on our GitHub.
Part 1: data preparation for the visualization
While not the most current dataset, we chose to use the 2010 Census due to its high geospatial resolution, large size, and availability. With some modification, the final dataset of 308 million rows by seven columns (type int8) was large enough to illustrate GPU acceleration’s benefits dramatically.
Census 2010 SF1+ shape file data
We decided to focus on the census dataset; the obvious choice was to search the census.gov website, which includes numerous tabulation files for download. The most applicable one, summary file 1, had a population count section with attributes that include sex, age, race, and so on. However, this dataset is tabulated to a census-block level and not to an individual level (for various privacy reasons). The result was just 211,267 rows, one for each block, with sex, age, race counts.
We chose to use the census-block boundary shapefiles to expand the row count to equal the population counts for all blocks. Then, assign a random lat-long within the boundary, and create a unique row for each person. The script to do this for each state can be found in Plotly-dash-rapids-census-demo. Switching to the more user-friendly dataset files (both SF1 and tiger boundary files) provided by the data finder tool on the IPUMS NHGIS site speeds up the process.
Throughout the data munging process, apart from double-checking the aggregations per block matched, we used our own cuxfilter for rapid prototyping and visual accuracy checks. In this case, it was as simple as the following snippet to have an interactive geo-scatter plot for the full 308 million rows:

ACS 2006–2010 data
Curious if we could combine additional interesting attributes to cross filter on, such as income, education, and a class of workers, we added the 5-Year 2006–2010 American Community Survey (ACS) dataset. This dataset is aggregated over census-block-groups (one level larger than census-blocks). Thus, we decided to take the aggregations over the block-groups and arbitrarily distribute them over each individual while still maintaining block-group level aggregate values. The modified dataset included:
- Sex by Age.
- Sex by Educational Attainment for the Population 25 Years and Over.
- Sex by Earnings in the Past 12 Months (in 2010 Inflation-Adjusted Dollars) for the Population 16 Years and Over with Earnings in the Past 12 Months.
- Sex by Class of Worker for the Civilian Employed Population 16 Years and Over.
The common column is Sex, which is used to merge all datasets. However, while this approach provides additional interesting attributes to filter on, there are a few caveats as a result:
- Cross filtering geographically or on a single column will produce accurate counts for all the other columns.
- Cross filtering multiple non geographic columns simultaneously will not necessarily produce realistic counts.
- The attributes associated with an individual are only statistical and do not reflect a real person. However, they are accurate when aggregated to the census-block-group level and greater.
The notebooks that execute the process can be found on plotly-dash-rapids-census-demo. The final dataset looks something like this:

Here is a quick cuxfilter dashboard to verify the dataset values:

Resource Links:
- Final Modified Dataset (~2.9 GB tar parquet file)
- All Data Prep Code (GitHub)
Part 2: building the interactive dashboard using Plotly Dash
Dash supports adding individual Plotly chart objects in a dashboard, along with individual callbacks for each object figure, selection, and layout using just Python. For example, the charts in the dashboard based on the dataset above are:
Scattermapbox: Population Distribution of Individuals
This chart consists of two layers:
- Scattermapbox layer.
- Datashader generated an output image on top of it.

Chart update callbacks are triggered on:
- ‘Relayout-data’ (scroll-in, scroll-out, mouse pan) the datashader image is re-rendered as per the zoom level so that the resolution is constant.
- Dropdown selects ‘Color by’.
- Box selection on Education, Income, Class of Worker, and Age charts.
- Box selection on Map.
Bar charts: education, income, class of workers, age

Chart update callbacks are triggered on:
- Box selection on Education, Income, Class of Worker, and Age charts.
- Box selection on Map.
- Dropdown select of ‘Color by’.
Where GPUs fit and how they can help:
Each chart in this dashboard benefits from GPU acceleration through cuDF: Using the GPU-accelerated mode, a filtering or zooming interaction generally takes around 0.2–2 seconds on a 24GB NVIDIA TITAN NVIDIA RTX. Running on a high-end CPU and 64GB of system memory, the same interaction generally takes 10–80 seconds. Typically, the cuDF GPU mode is over 20x faster than the Pandas CPU mode per chart. The 20x difference is what transforms a reporting dashboard into an interactive visual analytics application.
Data visualization is an iterative design process

Though such a well-documented and usable dataset as from the Census, learning about the data and making a viz to interact with it effectively always seems to take longer than any preliminary look at the dataset might suggest.
As with all data visualization, the end result often depends on finding the appropriate balance between the data and charts you have available, the story you are trying to communicate, and the medium (and hardware) you are communicating through. For instance, we spent several iterations on column formats to ensure that the GPU usage reliably stays under our 24GB single GPU limit while still allowing for smooth interaction between multiple charts.
Working with data is complex, and working with large datasets makes it more so, but by combining Plotly Dash with RAPIDS, we accelerate the capability of analysts and data scientists. These libraries permit users to work in a familiar environment and produce bigger, faster, and more interactive visualization applications ready for production out of the box – pushing the boundaries of traditional visual analytics into the realm of high-performance computing.
GTC digital live webinar
Hear Josh Patterson discuss more on the RAPIDS and Plotly collaboration during the upcoming GTC Digital live webinar State of RAPIDS: Bridging the GPU Data Science Ecosystem [S22181] on May 28th at 9AM PDT.
Community and next steps
Plotly and RAPIDS thrive from their open source communities, and we want to see them grow together. In the future, we are looking towards building tighter integration with Dash and also robust multi-GPU support with dask_cuDF.
Finally, we would like to see developers try out the tools and create awesome-looking demos on the many large-scale datasets out there, so let us know how RAPIDS made it possible to work with these datasets! Who knows, maybe we will have an even better 2020 Census visualization app ready in a year!
NVIDIA inception
Plotly is also a Premier member of NVIDIA Inception, a virtual accelerator designed to support startups using GPUs for AI and data science applications. The program is open for all to apply and gives access to NVIDIA engineering support and hardware access. If interested, apply here.
Helpful Links
- RAPIDS Plotly community webpage with demos
- RAPIDS Plotly Dash census demo GitHub repo
- A recent blog on Plotly Medium covering RAPIDS integration and partnership
This post was originally published on the RAPIDS AI blog here.