Categories
Misc

Object Detection Dependencies take hours to install

I’m kind of a noob at tensorflow, built a few classifieds but that’s it. I’m now trying to use the tensorflow object detection API, but the command to install the dependencies takes hours to install everything.

This would be fine if it weren’t for the fact that I’m in a research setting that requires me to run in a docker container. I could create a custom docker image and solve it that way, but does anyone know how to make this process go faster?

submitted by /u/prinse4515
[visit reddit] [comments]

Categories
Misc

QuickDraw – an online game developed by Google, combined with AirGesture – a simple gesture recognition application

QuickDraw - an online game developed by Google, combined with AirGesture - a simple gesture recognition application submitted by /u/1991viet
[visit reddit] [comments]
Categories
Offsites

Reducing the Computational Cost of Deep Reinforcement Learning Research

It is widely accepted that the enormous growth of deep reinforcement learning research, which combines traditional reinforcement learning with deep neural networks, began with the publication of the seminal DQN algorithm. This paper demonstrated the potential of this combination, showing that it could produce agents that could play a number of Atari 2600 games very effectively. Since then, there have been several approaches that have built on and improved the original DQN. The popular Rainbow algorithm combined a number of these recent advances to achieve state-of-the-art performance on the ALE benchmark. This advance, however, came at a very high computational cost, which has the unfortunate side effect of widening the gap between those with ample access to computational resources and those without.

In “Revisiting Rainbow: Promoting more Insightful and Inclusive Deep Reinforcement Learning Research”, to be presented at ICML 2021, we revisit this algorithm on a set of small- and medium-sized tasks. We first discuss the computational cost associated with the Rainbow algorithm. We explore how the same conclusions regarding the benefits of combining the various algorithmic components can be reached with smaller-scale experiments, and further generalize that idea to how research done on a smaller computational budget can provide valuable scientific insights.

The Cost of Rainbow
A major reason for the computational cost of Rainbow is that the standards in academic publishing often require evaluating new algorithms on large benchmarks like ALE, which consists of 57 Atari 2600 games that reinforcement learning agents may learn to play. For a typical game, it takes roughly five days to train a model using a Tesla P100 GPU. Furthermore, if one wants to establish meaningful confidence bounds, it is common to perform at least five independent runs. Thus, to train Rainbow on the full suite of 57 games required around 34,200 GPU hours (or 1425 days) in order to provide convincing empirical performance statistics. In other words, such experiments are only feasible if one is able to train on multiple GPUs in parallel, which can be prohibitive for smaller research groups.

Revisiting Rainbow
As in the original Rainbow paper, we evaluate the effect of adding the following components to the original DQN algorithm: double Q-learning, prioritized experience replay, dueling networks, multi-step learning, distributional RL, and noisy nets.

We evaluate on a set of four classic control environments, which can be fully trained in 10-20 minutes (compared to five days for ALE games):

Upper left: In CartPole, the task is to balance a pole on a cart that the agent can move left and right. Upper right: In Acrobot, there are two arms and two joints, where the agent applies force to the joint between the two arms in order to raise the lower arm above a threshold. Lower left: In LunarLander, the agent is meant to land the spaceship between the two flags. Lower right: In MountainCar, the agent must build up momentum between two hills to drive to the top of the rightmost hill.

We investigated the effect of both independently adding each of the components to DQN, as well as removing each from the full Rainbow algorithm. As in the original Rainbow paper, we find that, in aggregate, the addition of each of these algorithms does improve learning over the base DQN. However, we also found some important differences, such as the fact that distributional RL — commonly thought to be a positive addition on its own — does not always yield improvements on its own. Indeed, in contrast to the ALE results in the Rainbow paper, in the classic control environments, distributional RL only yields an improvement when combined with another component.

Each plot shows the training progress when adding the various components to DQN. The x-axis is training steps,the y-axis is performance (higher is better).
Each plot shows the training progress when removing the various components from Rainbow. The x-axis is training steps,the y-axis is performance (higher is better).

We also re-ran the Rainbow experiments on the MinAtar environment, which consists of a set of five miniaturized Atari games, and found qualitatively similar results. The MinAtar games are roughly 10 times faster to train than the regular Atari 2600 games on which the original Rainbow algorithm was evaluated, but still share some interesting aspects, such as game dynamics and having pixel-based inputs to the agent. As such, they provide a challenging mid-level environment, in between the classic control and the full Atari 2600 games.

When viewed in aggregate, we find our results to be consistent with those of the original Rainbow paper — the impact resulting from each algorithmic component can vary from environment to environment. If we were to suggest a single agent that balances the tradeoffs of the different algorithmic components, our version of Rainbow would likely be consistent with the original, in that combining all components produces a better overall agent. However, there are important details in the variations of the different algorithmic components that merit a more thorough investigation.

Beyond the Rainbow
When DQN was introduced, it made use of the Huber loss and the RMSProp Optimizer. It has been common practice for researchers to use these same choices when building on DQN, as most of their effort is spent on other algorithmic design decisions. In the spirit of reassessing these assumptions, we revisited the loss function and optimizer used by DQN on a lower-cost, small-scale classic control and MinAtar environments. We ran some initial experiments using the Adam optimizer, which has lately been the most popular optimizer choice, combined with a simpler loss function, the mean-squared error loss (MSE). Since the selection of optimizer and loss function is often overlooked when developing a new algorithm, we were surprised to see that we observed a dramatic improvement on all the classic control and MinAtar environments.

We thus decided to evaluate the different ways of combining the two optimizers (RMSProp and Adam) with the two losses (Huber and MSE) on the full ALE suite (60 Atari 2600 games). We found that Adam+MSE is a superior combination than RMSProp+Huber.

Measuring the improvement Adam+MSE gives over the default DQN settings (RMSProp + Huber); higher is better.

Additionally, when comparing the various optimizer-loss combinations, we find that when using RMSProp, the Huber loss tends to perform better than MSE (illustrated by the gap between the solid and dotted orange lines).

Normalized scores aggregated over all 60 Atari 2600 games, comparing the different optimizer-loss combinations.

Conclusion
On a limited computational budget we were able to reproduce, at a high-level, the findings of the Rainbow paper and uncover new and interesting phenomena. Evidently it is much easier to revisit something than to discover it in the first place. Our intent with this work, however, was to argue for the relevance and significance of empirical research on small- and medium-scale environments. We believe that these less computationally intensive environments lend themselves well to a more critical and thorough analysis of the performance, behaviors, and intricacies of new algorithms.

We are by no means calling for less emphasis to be placed on large-scale benchmarks. We are simply urging researchers to consider smaller-scale environments as a valuable tool in their investigations, and reviewers to avoid dismissing empirical work that focuses on smaller-scale environments. By doing so, in addition to reducing the environmental impact of our experiments, we will get both a clearer picture of the research landscape and reduce the barriers for researchers from diverse and often underresourced communities, which can only help make our community and scientific advances stronger.

Acknowledgments
Thank you to Johan, the first author of this paper, for his hard work and persistence in seeing this through! We would also like to thank Marlos C. Machado, Sara Hooker, Matthieu Geist, Nino Vieillard, Hado van Hasselt, Eleni Triantafillou, and Brian Tanner for their insightful comments on this work.

Categories
Misc

Metropolis Spotlight: Viisights Uses AI to Understand Behavior and Predict What Might Happen Next

Developers can use the new viisights wise application powered by GPU-accelerated AI technologies to help organizations and municipalities worldwide avoid operational hazards and risk in their most valuable spaces.

Developers can use the new viisights wise application powered by GPU-accelerated AI technologies to help organizations and municipalities worldwide avoid operational hazards and risk in their most valuable spaces. 

With conventional imaging and analytics solutions, AI vision is used to detect the presence of people and objects and their basic attributes to better understand how a physical space is being used. For example, is there a pedestrian, cyclist, or vehicle on a roadway? While understanding what’s happening on a road or in a retail store is important, predicting what might happen next is critical to taking this vision to the next level and building even smarter spaces. 

viisights — an NVIDIA Metropolis partner — helps cities and enterprises make better decisions faster by offering an AI-enabled platform that analyzes complex behavior within video content. viisights uses the NVIDIA Deepstream SDK to accelerate their development workflow for its video analytics pipeline, cutting development time by half. DeepStream is a path to developing highly optimized video analysis pipelines including video decoding, multi-stream processing and accelerated image transformations in real-time, critical for high performance video analytics applications. For deployment, viisights uses NVIDIA TensorRT to further accelerate their application’s inference throughput at the edge with INT8 calibration. 

The viisights Wise application recognizes and classifies over 200 different types of objects and dozens of different types of actions and events. The ability to recognize that a person is running, falling, or throwing an object provides a far richer and more holistic understanding of how spaces are being used. By using viisights Wise, operations teams can gain deeper insights and identify behavioral patterns that can predict what might happen next. viisights Wise can process over 20 video streams on a single GPU in near real-time.

viisights technology protects public privacy, as it only analyzes general behavior patterns of individuals and groups and does not identify specifics like faces or license plates. Because the application only analyzes behavior patterns, viisights technology is incredibly easy to use, further increasing ROI.

Real-world situations and use cases for viisights Wise include: 

Hazard Detection: Detect operational hazards in real-time. These include large objects that might block human or vehicle mobility in both outdoor and indoor surroundings such as streets, sidewalks, production floors, and operational areas in transportation hubs. It can even detect environmental hazards like smoke or fire in outdoor areas where smoke detectors cannot be installed.

Traffic Management: Cities can make informed decisions on how to best modify roadways, traffic patterns, public transit infrastructure, and more. Real-time data also lets city traffic agencies take immediate action to mitigate traffic congestion and address other issues such as vehicles colliding, occupying restricted areas, driving the wrong direction, and blocking roads or junctions.

Crowd Insights: Large groups of people can behave in a wide variety of ways. viisights Wise can recognize the action of a crowd and determine whether it’s moving, running, gathering, or dispersing. This real-time detection of escalated behavior can help officials predict how quickly an event is growing and decide if there’s a reason for concern or a need to take action. In addition to analyzing behavior, viisights can identify three classifications of human groupings—“groups”, “gatherings”, and “crowds”—based on a user-specified size for each category. These user-configurable classifications reduce false alarms by configuring grouping alerts to suit the context. For example, a small group of six to eight people will be categorized as ordinary in the context of a playground but as a cause for concern in an airport setting. 

Group Density Determination: Determine group density (classified as “low”, “medium” or “high”) and count people crossing a line or getting into and out from a zone. Cities can use this insight to enforce social distancing mandates during public health emergencies and pandemics. Retail stores or entertainment facilities like stadiums or cinemas can also use the technology to estimate queue wait times and determine spacing between people in queues to preserve public health. This helps businesses, cities, and authorities assess staff allocation and work more efficiently. Counting people can also be used in operational scenarios, such as at train stations, where train operators can improve train line scheduling in real time based on the number of people entering and exiting the trains.

Sense Body Movements: Their AI enabled video analytics solution notices if people are falling or lying on the floor. Implications include medical distress, accidents, or even violence. Real-time behavioral recognition analytics can alert operations teams where every second of latency matters.

Abnormal Behavior: Identify actions that are a potential cause for operational team attention such as when running is being observed where walking is normal (ex. department store aisles, shopping malls, etc.) or climbing or jumping over fences. It can also understand contextual loitering and detect if individuals are spending excessive amounts of time near ATMs, parks, building lobbies, etc., or are moving irregularly from object to object (i.e., from car door to car door in a parking garage). viisights behavioral analytics apply the intelligence and power of AI-driven contextual understanding to detect and identify these incidents as they unfold so they can be addressed accordingly and issue alerts in real time. 

As one example, viisights is helping to manage operations in the city of Eilat, one of Israel’s largest tourist destinations. Their application has helped identify and manage events in densely populated entertainment areas. The application has helped in maintaining transportation safety. In one incident, a motorcycle was detected driving in parks where only bicycles are allowed and the situation was quickly handled and safety maintained. In another example, the city of Ashdod is using viisights to manage traffic efficiently. viisights traffic management capabilities monitor several dozen intersections in real-time for congestion and accidents. The data collected by viisights enables a quick response to ease traffic jams and helps improve traffic light scheduling at intersections. The value and impact of behavior analytics are significant and incredibly unique for many industries.

Want to learn more about computer vision and video analytics? Click here to read more Metropolis Spotlight posts.

Categories
Misc

NVIDIA Research: DiSECt – A Differentiable Simulation Engine for Autonomous Robotic Cutting

Robotics researchers from NVIDIA and University of Southern California recently presented their work at the 2021 RSS conference called DiSECt, the first differentiable simulator for robotic cutting.

Robotics researchers from NVIDIA and University of Southern California presented their work at the 2021 Robotics: Science and Systems (RSS) conference called DiSECt, the first differentiable simulator for robotic cutting. The simulator accurately predicts the forces acting on a knife as it presses and slices through natural soft materials, such as fruits and vegetables.

Robots that are intelligent, adaptive and generalize cutting behavior with either a kitchen butter knife, or a surgical resection remains a difficult problem for researchers.

[source: Huffington Post]

As it turns out, the process of cutting with feedback requires adaptation to stiffness of the objects, applied force during the cut, and often a sawing motion to cut through. To achieve this, researchers use a family of techniques which leverage feedback to guide the controller adaptation. However, fluid controller adaptation requires very careful parameter tuning for each instance of the same problem. While these techniques are successful in industrial settings, no two cucumbers (or tomatoes) are the same, hence rendering these family of algorithms ineffective in a more generic setting.

In contrast, recent focus in research has been in building differentiable algorithms for control problems, which in simpler terms means that the sensitivity of output with respect to input can be evaluated without excessive sampling. Efficient solutions for control problems are achievable when the simulated dynamics is differentiable [1,2,3], but the process of simulating cutting has not been differentiable so far!

Differentiable simulation for cutting poses a challenge, since naturally cutting is a discontinuous process where crack formation and fracture propagation occur that prohibit the calculation of gradients. We tackle this problem by proposing a novel way of simulating cutting that represents the process of crack propagation and damage mechanics in a continuous manner.

DiSECt implements the commonly used Finite Element Method (FEM) to simulate deformable materials, such as foodstuffs. The object to be cut is represented by a 3D mesh which consists of tetrahedral elements. Along the cutting surface we slice the mesh following the Virtual Node Algorithm [4]. This algorithm duplicates the mesh elements that intersect the cutting surface, and adds additional, so-called “virtual” vertices on the edges where these elements are cut. The virtual nodes add extra degrees of freedom to accurately simulate the contact dynamics of the knife when it presses and slices through the mesh.

Next, DiSECt inserts springs connecting the virtual nodes on either side of the cutting surface. These cutting springs allow us to simulate damage mechanics and crack propagation in a continuous manner, by weakening them in proportion to the contact force the knife exerts on the mesh. This continuous treatment allows us to differentiate through the dynamics in order to compute gradients for the parameters defining the properties of the material or the trajectory of the knife. For example, given the gradients for the vertical and sideways velocity of the knife, we can efficiently determine an energy-minimizing yet fast cutting motion through gradient-based optimization.

Video: Deformable objects are simulated via the Finite Element Method (FEM). To simulate crack propagation and damage, we insert springs (shown in cyan) between the cut elements of the mesh which are weakened as the knife exerts contact forces.

We leverage reverse-mode automatic differentiation to efficiently compute gradients for hundreds of simulation parameters. Our simulator uses source code transformation which automatically generates efficient CUDA kernels for the forward and backward passes of all our simulation routines, such as the FEM or contact model. Such an approach allows us to implement complex simulation routines which are parallelized on the GPU, while the gradients of the inputs with respect to the outputs of such routines are automatically derived from analyzing the abstract syntax tree (AST) of the simulation code.

Through gradient-based optimization algorithms, we can automatically tune the simulation parameters to achieve a close match between the simulator and real-world measurements. In one of our experiments, we leverage an existing dataset [5] of knife force profiles measured by a real-world robot while cutting various foodstuffs. We set up our simulator with the corresponding mesh and its material properties, and optimize the remaining parameters to reduce the discrepancy between the simulated and the real knife force profile. Within 150 gradient evaluations, our simulator closely predicts the knife force profile, as we demonstrate on the examples of cutting an actual apple and a potato. As shown in the figure, the initial parameter guess yielded a force profile that was far from the real observation, and our approach automatically found an accurate fit. We present further results in our accompanying paper that demonstrate that the found parameters generalize to different conditions, such as the downward velocity of the knife or the length of the reference trajectory window.

Evolution of the knife force profile while the simulation parameters are calibrated via gradient-based optimization. The ground-truth trajectory has been obtained from a real force sensor attached to a robot cutting a real apple.

We also generate additional data using a highly established commercial simulator, which allows us to precisely control the experimental setup, such as object shape and material properties. Given such data, we can also leverage the motion of the mesh vertices as an additional ground-truth signal. After optimizing the simulation parameters, DiSECt is able to predict the vertex positions and velocities, as well as the force profile, much more accurately.

Aside from parameter inference, the gradients of our differentiable cutting simulator can also be used to optimize the cutting motion of the knife. In our full cutting simulation, we represent a trajectory by keyframes, where each frame prescribes the downward velocity, as well as the frequency and amplitude of a sinusoidal sideways velocity. At the start of the optimization, the initial motion is a straight downward-pressing motion.

We optimize this trajectory with the objective to minimize the mean force on the knife and penalize the time it takes to cut the object. Studies have shown that humans perform sawing motions when cutting biomaterials in order to reduce the required force. Such behavior emerges from our optimization as well.

After 50 iterations with the Adam optimizer, we see a reduction in average knife force by 15 percent. However, the knife slices sideways further than its blade length. Therefore, we add a hard constraint to keep the lateral motion within valid limits and perform constrained optimization. Thanks to the end-to-end-differentiability of DiSECt, accurate gradients for such constraints are available, and lead to a valid knife motion which requires only 0.3 percent more force than the unconstrained result.

Cutting food items multiple times results in slightly different force profiles for each instance, depending on the geometry of such materials. We additionally present results to transfer simulation parameters between different meshes corresponding to the same material. Our approach leverages optimal transport to find correspondences between simulation parameters of a source mesh and a target mesh (e.g., local stiffnesses) based on the location of the virtual nodes. As shown in the following figure, the 2D positions of these nodes along the cutting surface allow us to map simulation parameters (shown here is the softness of the cutting springs) to topologically different target geometries.

In our ongoing research, we are bringing our differentiable simulation approach to real-world robotic cutting. We investigate a closed-loop control system where the simulator is updated online from force measurements, while the robot is cutting foodstuffs. Through model-predictive planning and optimal control, we aim to find time- and energy-efficient cutting actions that apply to the physical system.

We thank Yan-Bin Jia and Prajjwal Jamdagni for kindly providing us a dataset of real-world cutting trajectories which we used throughout our experiments. Be sure to also check out their research on robotic cutting!

DiSECt is a finalist at 2021 RSS for the best (student) paper. Visit the team’s project webpage to learn more.

DiSECt Research Paper: DiSECt: A Differentiable Simulation Engine for Autonomous Robotic Cutting
Eric Heiden, Miles Macklin, Yashraj S Narang, Dieter Fox, Animesh Garg, Fabio Ramos
Robotics Science and Systems (RSS) 2021.

Categories
Misc

Converting pytorch weights to tensorflow weights for Conv2d

Converting pytorch weights to tensorflow weights for Conv2d

Hello. I convert weights from a pytorch model to tf2 model. To verify the result, I export the feature maps after the conv2d for both pytorch original model and tf2 model. However, the feature map from tf2 model has black bars on the top and left hand side of the map, whereas the pytorch feature map does not have these black bars.

https://preview.redd.it/hyo378nslxa71.png?width=961&format=png&auto=webp&s=34613e37989d3c13fb2dd2c89d28b45df845b3a1

The pytorch code of conv2d is

nn.Conv2d(1, 56, kernel_size=5, padding=5//2) 

The tensorflow build is

tensorflow.keras.layers.Conv2D(56, 5, padding="same")(inp) 

According to https://stackoverflow.com/questions/68165375/manualy-convert-pytorch-weights-to-tf-keras-weights-for-convolutional-layer# advise, I even add the zeropadding layer to the tensorflow model

x = tensorflow.keras.layers.ZeroPadding2D(padding=((2,2)))(inp) x2 = tensorflow.keras.layers.Conv2D(56, 5, padding="valid")(x) 

However, the black bar is still there. I have no clue for this problem > <. For the converting process for pytorch weights to tensorflow weights is as follow:

onnx_1_w_num = onnx_l.weight.data.permute(2,3,1,0).numpy() onnx_1_b_num = onnx_l.bias.data.numpy() tf_l.set_weights([onnx_1_w_num,onnx_1_b_num]) 

submitted by /u/mrgreen3325
[visit reddit] [comments]

Categories
Misc

Guide to Reinforcement Learning with Python and TensorFlow

Guide to Reinforcement Learning with Python and TensorFlow submitted by /u/RubiksCodeNMZ
[visit reddit] [comments]
Categories
Misc

Building Medical 3D Image Segmentation Using Jupyter Notebooks from the NGC Catalog

The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation. Image segmentation partitions a digital image into multiple segments by changing the … Continued

The NVIDIA NGC team is hosting a webinar with live Q&A to dive into this Jupyter notebook available from the NGC catalog. Learn how to use these resources to kickstart your AI journey. Register now: NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.

Image segmentation partitions a digital image into multiple segments by changing the representation into something more meaningful and easier to analyze. In the field of medical imaging, image segmentation can be used to help identify organs and anomalies, measure them, classify them, and even uncover diagnostic information. It does this by using data gathered from x-rays, magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and other formats.

To achieve state-of-the-art models that deliver the desired accuracy and performance for a use case, you must set up the right environment, train with the ideal hyperparameters, and optimize it to achieve the desired accuracy. All of this can be time-consuming. Data scientists and developers need the right set of tools to quickly overcome tedious tasks. That’s why we built the NGC catalog.

The NGC catalog is a hub of GPU-optimized AI and HPC applications and tools. NGC provides easy access to performance-optimized containers, shortens model development time with pretrained models, and provides industry-specific SDKs to help build complete AI solutions and speed up AI workflows. These diverse assets can be used for a variety of use cases, ranging from computer vision and speech recognition to language understanding. The potential solutions span industries such as automotive, healthcare, manufacturing, and retail.

NGC catalog page shows cards for GPU-optimized HPC and AI containers, pretrained models, and industry SDKs that help accelerate workflows]
Figure 1. The NGC catalog, a hub for GPU-optimized AI software.

3D medical image segmentation with U-Net

In this post, we show how you can use the Medical 3D Image Segmentation notebook to predict brain tumors in MRI images. This post is suitable for anyone who is new to AI and has a particular interest in image segmentation as it applies to medical imaging. 3D U-Net enables the seamless segmentation of 3D volumes, with high accuracy and performance. It can be adapted to solve many different segmentation problems.

Shows the overall view of a typical U-Net model consisting of an encoder (left) and a decoder (right)
Figure 2. Overview of a U-Net model

Figure 2 shows that 3D U-Net consists of a contractive (left) and expanding (right) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling.

In deep learning, a convolutional neural network (CNN) is a subset of deep neural networks, mostly used in image recognition and image processing. CNNs use deep learning to perform both generative and descriptive tasks, often using machine vision along with recommender systems and natural language processing.

Padding in CNNs refers to the number of pixels added to an image when it is processed by the kernel of a CNN. Unpadded CNNs means that no pixels are added to the image.

Pooling is a downsampling approach in CNN. Max pooling is one of common pooling methods that summarize the most activated presence of a feature. Every step in the expanding path consists of a feature map upsampling and a concatenation with the correspondingly cropped feature map from the contractive path.

Requirements

This resource contains a Dockerfile that extends the TensorFlow NGC container and encapsulates some dependencies. The resource can be downloaded using the following commands:

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/unet3d_medical_for_tensorflow/versions/20.06.0/zip -O unet3d_medical_for_tensorflow_20.06.0.zip

Aside from these dependencies, you also need the following components:

To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the 3D U-Net model on the Brain Tumor Segmentation 2019 dataset.

Download the resource

Download the resource manually by clicking the three dots at the top-right corner of the resource page.

Page lists all the pull instructions, documentation for setting up the model and the necessary performance metrics associated with this model.
Figure 3. Resource landing page for 3D U-Net resource on the NGC Catalog

You could also use the following wget command:

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/unet3d_medical_for_tensorflow/versions/20.06.0/zip -O unet3d_medical_for_tensorflow_20.06.0.zip

Build the U-Net TensorFlow NGC container

This command uses the Dockerfile to create a Docker image named unet3d_tf, downloading all the required components automatically.

docker build -t unet3d_tf .

Download the dataset

Data can be obtained by registering on the Brain Tumor Segmentation 2019 dataset website. The data should be downloaded and placed where /data in the container is mounted.

Run the container

To start an interactive session in the NGC container to run preprocessing, training, and inference, you must run the following command. This launches the container and mounts the ./data directory as a volume to the /data directory inside the container, mounts the ./results directory to the /results directory in the container.

The advantage of using a container is that it packages all the necessary libraries and dependencies into a single, isolated environment. This way you don’t have to worry about the complex install process.

mkdir data
mkdir results
  
docker run --runtime=nvidia -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm --ipc=host -v ${PWD}/data:/data -v ${PWD}/results:/results  -p 8888:8888 unet3d_tf:latest /bin/bash 

Start the notebook inside the container

Use this command to start a Jupyter notebook inside the container:

jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root

Move the dataset to the /data directory inside the container. Download the notebook with the following command:

 wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/med_3dunet/versions/1/zip -O med_3dunet_1.zip

Then, upload the downloaded notebook into JupyterLab and run the cells of the notebook to preprocess the dataset and train, benchmark, and test the model.

Jupyter notebook

By running the cells of this Jupyter notebook, you can first check the downloaded dataset and see the brain tumor images. After that, see the data preprocessing command and prepare the data for training. The next step is training the model and using the checkpoints of the training process for the predicting step. Finally, check the output of the predict function visually.

Checking the images

To check the dataset, you can use nibabel, which is a package that provides read/write access to some common medical and neuroimaging file formats.

By running the next three cells, you can install nibabel using pip install, choose an image from the dataset, and plot the chosen third image from the dataset using matplotlib. You can check other dataset images by changing the image address in the code.

import nibabel as nib
import matplotlib.pyplot as plt
  
img_arr = nib.load('/data/MICCAI_BraTS_2019_Data_Training/HGG/BraTS19_2013_10_1/BraTS19_2013_10_1_flair.nii.gz').get_data()
  
def show_plane(ax, plane, cmap="gray", title=None):
     ax.imshow(plane, cmap=cmap)
     ax.axis("off")
  
     if title:
         ax.set_title(title)
         
(n_plane, n_row, n_col) = img_arr.shape
_, (a, b, c) = plt.subplots(ncols=3, figsize=(15, 5))
  
show_plane(a, img_arr[n_plane // 2], title=f'Plane = {n_plane // 2}')
show_plane(b, img_arr[:, n_row // 2, :], title=f'Row = {n_row // 2}')
show_plane(c, img_arr[:, :, n_col // 2], title=f'Column = {n_col // 2}') 

The result is something like Figure 4.

The images show three separate slices of the brain and the tumor. These images are included in the dataset accompanying the Jupyter Notebook.
Figure 4. A sample output of the images included in the dataset as part of the Jupyter Notebook

Data preprocessing

The dataset/preprocess_data.py script converts the raw data into the TFRecord format used for training and evaluation. This dataset, from the 2019 BraTS challenge, contains over 3 TB multiinstitutional, routine, clinically acquired, preoperative, multimodal, MRI scans of glioblastoma (GBM/HGG) and lower-grade glioma (LGG), with the pathologically confirmed diagnosis. When available, overall survival (OS) data for the patient is also included. This data is structured in training, validation, and testing datasets.

The format of images is nii.gz. NIfTI is a type of file format for neuroimaging. You can preprocess the downloaded dataset by running the following command:

python dataset/preprocess_data.py -i /data/MICCAI_BraTS_2019_Data_Training -o /data/preprocessed -v

The final format of the processed images is tfrecord. To help you read data efficiently, serialize your data and store it in a set of files (~100 to 200 MB each) that can each be read linearly. This is especially true if the data is being streamed over a network. It can also be useful for caching any data preprocessing. The TFRecord format is a simple format for storing a sequence of binary records, which speeds up the data loading process considerably.

Training using default parameters

After the Docker container is launched, you can start the training of a single fold (fold 0) with the default hyperparameters (for example, {1 to 8} GPUs {TF-AMP/FP32/TF32}):

Bash examples/unet3d_train_single{_TF-AMP}.sh

For example, to run with 32-bit precision (FP32 or TF32) with batch size 2 on one GPU, run the following command:

bash examples/unet3d_train_single.sh 1 /data/preprocessed /results 2

To train a single fold with mixed precision (TF-AMP) with on eight GPUs and batch size 2 per GPU, run the following command:

bash examples/unet3d_train_single_TF-AMP.sh 8 /data/preprocessed /results 2

Training performance benchmarking

The training performance can be evaluated by running benchmarking scripts:

bash examples/unet3d_{train,infer}_benchmark{_TF-AMP}.sh

This script makes the model run and reports the performance. For example, to benchmark training with TF-AMP with batch size 2 on four GPUs, run the following command:

bash examples/unet3d_train_benchmark_TF-AMP.sh 4 /data/preprocessed /results 2

Predict

You can use the test dataset and predict as exec-mode to test the model. The result is saved in the model_dir directory and data_dir is the path to the dataset:

python main.py --model_dir /results --exec_mode predict --data_dir /data/preprocessed_test

Plotting the result of prediction

In the following code example, you plot one of the chosen results from the results folder:

import numpy as np
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt
data= np.load('/results/vol_0.npy')
def show_plane(ax, plane, cmap="gray", title=None):
     ax.imshow(plane, cmap=cmap)
     ax.axis("off")
     if title:
         ax.set_title(title)        
(n_plane, n_row, n_col) = data.shape
_, (a, b, c) = plt.subplots(ncols=3, figsize=(15, 5))
show_plane(a, data[n_plane // 2], title=f'Plane = {n_plane // 2}')
show_plane(b, data[:, n_row // 2, :], title=f'Row = {n_row // 2}')
show_plane(c, data[:, :, n_col // 2], title=f'Column = {n_col // 2}') 
The images show a comparison between the actual images in the data set and the resulting images predicted by the model. The output images from the models show the location and size of the tumor in white color.
Figure 5. The results predicted by the model that correspond to the actual images shown in Figure 4.

Advanced options

For those of you looking to explore advanced features built into this notebook, you can see the full list of available options for main.py using -h or --help. By running the next cell, you can see how to change execute mode and other parameters of this script. You can perform model training, predicting, evaluating, and inferencing using customized hyperparameters using this script.

python main.py --help

The main.py parameters can be changed to perform different tasks, including training, evaluation, and prediction.

You can also train the model using default hyperparameters. By running the python main.py --help command, you can see the list of arguments that you can change, including training hyperparameters. For example, in training mode, you can change the learning rate from the default 0.0002 to 0.001 and the training steps from 16000 to 1000 using the following command:

python main.py --model_dir /results --exec_mode train --data_dir /data/preprocessed_test --learning_rate 0.001 --max_steps 1000

You can run other execution modes available in main.py. For example, in this post, we used the prediction execution mode of python.py by running the following command:

python main.py --model_dir /results --exec_mode predict --data_dir /data/preprocessed_test

Summary and next steps

In this post, we showed how you can get started with a medical imaging model using a simple Jupyter notebook from the NGC catalog. As you make the transition from this Jupyter notebook to building your own medical imaging workflows, consider using NVIDIA Clara Train. Clara Train includes AI-Assisted Annotation APIs and an Annotation server that can be seamlessly integrated into any medical viewer, making it AI-capable. The training framework includes decentralized learning techniques, such federated learning and transfer learning for your AI workflows.

To learn how to use these resources and kickstart your AI journey, register for the upcoming webinar with live Q&A, NVIDIA NGC Jupyter Notebook Day: Medical Imaging Segmentation.

Categories
Misc

AutoX Unveils Full Self-Driving System Powered by NVIDIA DRIVE

Your robotaxi is arriving soon. Self-driving startup AutoX took the wraps off its full self-driving system — dubbed “Gen5” — this week. The autonomous driving platform, which is specifically designed for robotaxis, uses NVIDIA DRIVE automotive grade GPUs to reach up to 2,200 trillion operations per second (TOPS) of AI compute performance. In January, AutoX Read article >

The post AutoX Unveils Full Self-Driving System Powered by NVIDIA DRIVE appeared first on The Official NVIDIA Blog.

Categories
Misc

Applying Natural Language Processing Across the World’s Languages

Despite unprecedented progress in NLP, many state-of-the-art models are available in English only. NVIDIA has developed tools to enable the development of even the largest language models. This post describes the challenges associated with building and deploying large scale models and the solutions to enable global organizations to achieve NLP leadership in their regions.

Despite substantial progress in natural language processing (NLP) research over the last two years and its commercial success, little effort has been devoted to adapting this capability to other significant languages, such as Hindu, Arabic, Portuguese, or Spanish. Obviously, catering to the entire human population with more than 6,500 languages is challenging. At the same time, supporting just 40 languages addresses the NLP needs of more than 60% of the human population. 

Supporting just 40 languages addresses the NLP needs of more than 60% of the human population.
Figure 1. List of Top 40 languages by total number of native speakers (percentage of world population)

Figure 2 shows that, even across the most frequently used languages, the performance of language models varies tremendously. Bear in mind that this comparison is not perfect, as those languages do have different language entropy. More importantly, the research on most capable large-scale language models seems to be limited to only a handful of high resource languages (languages with a high number of documents available publicly), such as English or Chinese.  

Even across most frequently used languages, the performance of NLP models varies tremendously
Figure 2. Substantial differences in the performance of NLP models across languages

The situation is even more complex when you account for domain-specific languages (such as medical, technical, or legal jargon), where besides English only a few high-quality models exist. This is regretful as those domain-specific language models are currently transforming the way that clinicians, engineers, researchers or other experts access information.

  • Models such as GaTortron are changing the way that clinicians access medical notes.
  • Models such as bioMegatron change the way that medical researchers identify drug or chemical interactions affecting the human body.
  • Models such as sciBERT change the way that engineers find and interact with scientific information.

Unfortunately, there is a limited number of equivalent models outside of English. Fortunately, replicating the success of English-language models across other languages is no longer a research task but predominantly an engineering activity. It no longer requires inventing new models and training approaches, but instead systematic and iterative dataset engineering, model training, and its continuous validation. 

This does not mean that engineering those models is trivial. Because of the model and dataset sizes used in modern NLP, the training process requires a substantial amount of computing power. Secondly, to use large models, you must collect large textual datasets. Thirdly, because of the shared size of models used, new approaches to training and inference are required.

NVIDIA has extensive experience not only in building large-scale language models (ranging from 1 billion to 175 billion parameters) but also in deploying them to production. The goal of this post is to share our knowledge around project organization, infrastructure requirements, and budgeting and to support projects in this area.

Dawn of large models

As hypothesized in Deep Learning Scaling is Predictable, Empirically, the NLP model performance seems to follow a power law with respect to both the model size and the volume of data used for training. As you make models and datasets bigger, the performance continues to improve. The following diagram from Scaling Laws for Neural Language Models demonstrates not only that this relationship holds but more importantly, it holds across nine orders of magnitude of compute.

Deep Learning Scaling is predictable empirically, and data shows that even with very large models we are not seeing any evidence of getting close to the irreducible error region.
Figure 3. Deep learning scaling law (left) and NLP scaling law (right)

In the NLP scaling law, despite the models at the far right reaching as much as 175 billion parameters (more than 500 times larger than BERT Large), this relationship does not show signs of stopping. This suggests that even further improvement can be expected from larger models. Indeed, switch transformers when scaled to 1.6 trillion parameters (roughly 5000x larger than BERT Large) continue to demonstrate the previously mentioned behavior. 

More importantly, the large NLP models seem to generate much more robust features capable of solving complex problems even without large-scale fine-tuning datasets. Figure 4 shows this capability across three orders of magnitude of models.

Larger models are more efficient at learning and lead to better features
Figure 4. Large language models are few-shot learners

Due to this capability, and despite the relatively high cost of their development, large NLP models are likely to not only continue to dominate the NLP processing landscape but also continue to grow, at least by another order of magnitude approaching trillions of parameters.

This relationship between model size, dataset size, and model performance is not unique to NLP. I see the same behavior of automatic speech recognition and computer vision models, and across many other disciplines that are a backbone of conversational AI.

At the same time, a limited amount of work was devoted to the development of both large-scale datasets and models for other languages. Indeed, the majority of the work that focuses on languages other than English take advantage of smaller models and less curated datasets. For example, NLP using subset from general datasets such as raw Common Crawl).

Even less effort is devoted to supporting any of the following:

  • More localized models, such as Arabic dialects and variants or Mandarin subgroups and dialects.
  • Domain-specific models such as used those for clinical note taking, scientific publications, financial report analysis, interpretation of engineering documents, parsing through legal texts, or even variants of language used on platforms such as Twitter.

The current status creates an opportunity for local companies willing to invest in model training to lead the development of NLP technologies in the region.

Building your local language model

Building large-scale language models is not trivial for many reasons. First, the large-scale datasets are not trivial to curate. In the raw format, they are actually quite easy to obtain. Second, the infrastructure required to train these huge models requires substantial systems knowledge to set up. Finally, they require extensive research expertise to train and optimize. 

What is less widely understood is that training such large models requires software engineering effort. Most interesting models are larger than the memory capacity of not only individual GPUs but also of many multi-GPU servers. The number of mathematical operations required to train them can also make training times unmanageable, measured in months even on fairly sizable systems. Approaches such as model and pipeline parallelism overcome some of those challenges. However, applying them in a naive way could lead to scaling issues, exacerbating an already long training time. 

Together with organizations such as Microsoft and Stanford University, NVIDIA has worked towards developing tools that streamline the development process of the largest language models and provide computational efficiency and scalability to allow cost-effective training. As a consequence, a wide range of tools abstracting the complexity of large model development are now available, including the following:

As a result of those efforts, I’ve seen a substantial reduction in training times of large models. Indeed, the GPT-3 model with 175 billion parameters trained on 300 billion tokens using 1024 NVIDIA A100 Tensor Core GPUs can be trained today in 34 days (as shown in Efficient Large-Scale Language Model Training on GPU Clusters). Based on experimentation, NVIDIA estimates that 1 trillion parameter models can be trained in approximately 84 days with 3072 A100 GPUs. Despite the training cost of those models being high, it is not beyond reach of most large organizations. With further software advances, it is likely to reduce further.

Because the development of large language models requires scalable infrastructure, NVIDIA has also consolidated knowledge from building the internal Selene cluster (used for NLP internal research and to deliver record-breaking performance in MLPerf training and inference benchmarks) into a fully packaged product called NVIDIA DGX SuperPOD. This cluster is more than just a system reference design. In fact, it can be bought in its entirety together with software and support of NVIDIA data scientists and applied researchers, similar to the NLP-focused deployment by Naver Clova).

Such an approach has already had a substantial impact on the NLP landscape, as it enables organizations with extensive NLP expertise to scale out their efforts fast. More importantly, it enables organizations with limited systems, HPC, or large-scale NLP workload expertise to start iterating in weeks, rather than months or years. 

14 systems in the Top500 are built on NVIDIA DGX SuperPOD
Figure 5. Some of the key NVIDIA DGX-based clusters and their ranks on the supercomputing Top 500 list

Production deployment

The ability to build large language models is just an academic achievement when it’s impossible to take advantage of the results of your work by deploying your models to production. The challenge of deploying models such as GPT-3 correlates to their sheer size, which exceeds the memory capacity of a GPU, and computational complexity. Both are factors that contribute to decreased throughput and high latency of inference. This is a widely understood problem and a range of tools and solutions currently exist to make serving the largest language models simple and cost-effective.

  • NVIDIA Triton Inference Server
  • NVIDIA TensorRT and other model optimization techniques

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is an open-source cloud and edge inferencing solution optimized for both CPUs and GPUs. It can be used to host distributed models effectively. To deploy a large model using pipeline parallelism, the model must be split into several parts, for example, manipulating the ONNX graph with tools such as ONNX Graph Surgeon. Each of the parts must be small enough to fit into the memory space of a single GPU.

NVIDIA Triton Inference Server hosts distributed models, ONNX Graph Surgeon can split the model into several parts so it can be deployed.
Figure 6. Large language models can be partitioned using tools such as Graph Surgeon and further optimized with TensorRT

After the model is subdivided, it can be distributed across multiple GPUs without the need to develop any code. You create an NVIDIA Triton YAML configuration file defining how individual parts of the model should be connected.

Triton Inference Server can serve models that exceed the memory capacity of a single GPU and take advantage of NVLINK technology in DGX A100 to exchange partial computation with low latency.
Figure 7. Individual parts of the graph can be brought together into a single inference pipeline with Triton Inference Server.

The traffic between individual model parts and their load balancing can be managed automatically by Triton Inference Server. The communication overheads are also kept to a minimum as Triton takes advantage of the latest NVIDIA NVSwitch and third-generation NVIDIA NVLink technology, providing 600 GB/sec GPU-to-GPU direct bandwidth, which is 10X  higher than PCIe gen4. This means that you can efficiently deploy not only medium-scale models of multi billions of parameters but even the largest of models, including GPT-3 with trillions of parameters.

Chart shows inference on DGX A100 using Triton + TensorRT of GPT-3 18B showing significantly higher throughput vs without TensorRT
Figure 8. Inference performance with and without TensorRT optimization of GPT-3 with 18B parameters.

For more information, see Megatron GPT-3 Large Model Inference with Triton and ONNX Runtime (GTC21 session).

NVIDIA TensorRT and other model optimization techniques

Beyond the ability to host trained large models, it is important to look at optimization techniques. Such techniques can reduce the memory footprint of the models, through quantization and pruning; substantially accelerate the execution; and reduce latency by optimizing memory access, taking advantage of TensorCores or sparsity acceleration.

Utilities such as TensorRT provide a wide range of optimized kernels for execution of transformer-based architectures. They can automatically do half precision (FP16) or, in certain cases, INT8 quantization. TensorRT also supports quantization-aware training and provides early support for hardware-accelerated sparsity.

The NVIDIA FasterTransformer library specializes in the inference of the transformer neural networks and can be used with models such as BERT or GPT-2/3. This library includes a tensor-parallel inference backend that provides the ability to do the inference of the huge GPT-3 models in parallel on multiple GPUs within the DGX A100 system. This enables you to reduce inference latency by as much as 1.2–3x, depending on model size. With FasterTransformer, you can deploy the largest of Megatron Models with a single line of code.

The Microsoft DeepSpeed library has a number of features focused on inference, including support for Mixture-of-Quantization (MoQ), high-performance INT8 kernels, or DeepFusion.

Thanks to all of those advances, large language models are no longer limited to academic research as they are making headway into commercial AI-based products.

Sizing the challenge

Correct sizing of the challenge is critical for the success of your NLP initiative. The amount of engineering and research staff needed as well as training and inference infrastructure significantly affects your business case. The following factors have significant impact on the overall cost of the development:

  • Number of supported languages and dialects
  • Number of applications being developed (named entity recognition, translation, and so on)
  • Development schedule (to train faster, you need more compute)
  • Requirements towards model accuracy (driven by dataset size and model size)

After the fundamental business questions are addressed, it is possible to estimate the effort and compute required for their development. When you have an understanding of how good your model must be to allow for the product or service, it is possible to estimate the model size needed. The relationship between the performance of language models and the amount of data and model size is widely understood (Figure 9).

Charts show how to size your NLP initiative given the understanding of the relationship between performance of language models and amount of data and model size.
Figure 9. Overview of the methodology that can be used to establish the sizing of datasets and models required to reach given model performance.

After you understand the size of the model and dataset that you need, you can estimate the amount of infrastructure required and training time. For more information, see Efficient Large-Scale Language Model Training on GPU Clusters. Furthermore, the scaling of large language models is superlinear, meaning that the training performance does not degrade with the increasing model size but actually increases (Figure 10).

Chart shows training performance does not degrade with increasing model size but actually increases
Figure 10: Scaling of the training of large language models (in this case GPT-3) is superlinear, opening a route to even larger NLP models

Here are the key factors to consider for initial infrastructure sizing:

  • Number of supported languages, dialects, and domain-specific variants
  • Number of applications per language
  • Project timelines
  • Performance metrics and targets that must be achieved by models to create minimal viable products
  • Number of data scientists, researchers, and engineers actively involved in training per application or language
  • Minimal turnaround time and training frequencies, both for pretraining and fine-tuning
  • Rough understanding of inference demand of the application, the average or maximal number of requests per day or hour and its seasonality
  • Training pipeline performance, or how long a single training cycle takes and how far from theoretical maximum performance is your pipeline implementation (Figure 10 shows a Megatron GPT-3 based example) 

Large models are the present and the future of NLP

Large language models have appealing properties and will help expand the availability of NLP around the globe. They more performant across a wide number of NLP tasks, but they are also much more sample efficient. They are what are known as few-shot learners and in certain ways are easier to design, as their exact hyperparameter configuration seems unimportant in comparison to their size.  As a consequence, the NLP models are likely to continue to grow. I see empirical evidence justifying at least one if not two orders of magnitude of growth.

Fortunately, the technology to build and deploy them to production has matured considerably. The software required to train them has also matured considerably and is broadly available, such as the NVIDIA open source Megatron-based implementation of GPT-3. Quality is continuing to improve, driving down the training times. The infrastructure required to train models in this space is also well understood and commercially available (DGX SuperPOD). It is now possible to deploy the largest of NLP models to production using tools such as Triton Inference Server.  As a consequence, big NLP models are in reach of everyone with the will to pursue them.

Summary

NVIDIA actively supports customers in the scoping and delivery of large training and inference systems, as well as supporting them in establishing NLP training capability. If you are working towards building your NLP capability, reach out to your local NVIDIA account team. 

You can also join one of our Deep Learning Institute NLP classes. During the course, you learn how to work with modern NLP models, optimise them with TensorRT, and deploy for cost-effective production with Triton Inference Server. 

For more information, see any of the following NLP-related GTC presentations: