DataBloom - Part 376

Misc

Cloud GPU services sugesstions to run heavy object detection training?

Post author By
Post date January 28, 2022
No Comments on Cloud GPU services sugesstions to run heavy object detection training?

Hello, Reddit!

We currently have a decent machine learning pipeline going on but new projects/demands are coming and we’re now lacking coputational power to run everything in time. Since hardware prices skyrocketed since the begining of this pandemic, a local upgrade would be very expensive, so we’re looking for alternatives.

Can anyone suggest and comment your experience with some of these cloud GPU services? We’re running trainings 24/7 on our local machines and more “horsepower” would be appreciated. We’re looking into Paperspace Gradient, but it seems a bit of an overkill (but still cheap) and AWS.

Thanks in advance.

submitted by /u/Flashbek
[visit reddit] [comments]

Misc

prediction values vary in direct calls (both REST-API and gRPC) and call to TF-serving (we assume the TF prediction is correct, TF-serving is incorrect)

Hello everyone! I have a TensorFlow Serving related question. I have also posted it to StackOverflow (https://stackoverflow.com/questions/70843006/same-model-served-via-tf-and-tf-serving-produce-different-predictions) but as no-one is answering there i thought to repost it here, maybe any of you has encountered such problems and can provide me some advice.

PROBLEM: prediction values vary in direct call and call to TF-serving (we assume the TF prediction is correct, TF-serving is incorrect)

SETUP: a TF model that uses Keras as backend and being saved as 2 files: model.h5 and architechture.json.

GOAL: predict using TF-serving

CURRENT TRIAL:

In order to use the model on TF-serving I save the model into Saved_Model format:

model = model_from_json(json_string) model.load_weights(model_path) export_path = './gModel_Volume/3' tf.keras.models.save_model(model, export_path)

there are several approaches to save it that we tried, all produce same saved_models

Load the data from image is same for both setups:

X = [] X.append(preprocess_input(img_to_array(load_img(img_url_str, target_size=(260,260))))) X = np.asarray(X)

3a. for plain TF to predict we use:

#load: json_string = open(architecture_path).read() gModel = model_from_json(json_string) gModel.load_weights(model_path) #predict: predicted = gModel.predict_on_batch(X) # predicted = 0.2...

3b. for TF-servin via REST-API to predict we use:

dataToSend = X.tolist() predicted = requests.post('http://tf-serving:8501/v1/models/gModel_Volume:predict', json = {"signature_name": "serving_default", "instances": dataToSend}) # predicted = -1.17...

3c. for TF-servin via gRPC to predict we use:

GRPC_MAX_RECEIVE_MESSAGE_LENGTH = 4096 * 4096 * 3 channel = grpc.insecure_channel('tf-serving:8500', options=[('grpc.max_receive_message_length', GRPC_MAX_RECEIVE_MESSAGE_LENGTH)]) stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) grpc_request = predict_pb2.PredictRequest() grpc_request.model_spec.name = 'gModel_Volume' grpc_request.model_spec.signature_name = 'serving_default' grpc_request.inputs['input_1'].CopyFrom(tf.make_tensor_proto(X)) result = stub.Predict(grpc_request,10) predicted = result.outputs['dense_1'].float_val[0] # predicted = -1.17...

As one can see the results (0.2 and -1.17) are completely different. Same saved_model was used in all cases.

Note: TF-serving is run under docker-compose and the service “tf-serving” is using the official tensorflow/serving image

I would appreciate any hits and testing suggestions.

If there are better solutions to run multiple TF models in production than TF-serving (except Cloud-based solutions like SageMaker) please share some knowledge with me.

Notes:

Batching is disabled by default settings, but I disabled it manually too
There are 4 different models that produce different type of output (single value, set of values, bitmap data). ALL 4 have different result while being served by TFServing and plain TF. Dimensions of the output tensors are fine, values they have are “wrong”.

submitted by /u/SG_Able
[visit reddit] [comments]

Misc

Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale

Post author By
Post date January 27, 2022
No Comments on Multinode Multi-GPU: Using NVIDIA cuFFTMp FFTs at Scale

cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascale platforms.

Today, NVIDIA announces the release of cuFFTMp for Early Access (EA). cuFFTMp is a multi-node, multi-process extension to cuFFT that enables scientists and engineers to solve challenging problems on exascale platforms.

FFTs (Fast Fourier Transforms) are widely used in a variety of fields, ranging from molecular dynamics, signal processing, computational fluid dynamics (CFD) to wireless multimedia and machine learning applications. With cuFFTMp, NVIDIA now supports not only multiple GPUs within a single system, but many GPUs across multiple nodes.

Figure 1 shows cuFFTMp reaching over 1.8 PFlop/s, more than 70% of the peak machine bandwidth for a transform of that scale.

Bar chart shows cuFFTMp reaching over 1.8 PFlop/s, transforming 16,3843 (~over 4 trillion) complex data points using 4096 A100 80GB GPUs on the Selene cluster. — *Figure 1. cuFFTMp (weak scaling) performances on the Selene cluster*

In Figure 2, the problem size is kept unchanged but the number of GPUs is increased from 8 to 2048. You can see that cuFFTMp successfully strong-scales the problem, bringing the single-precision time from 78ms with 8 GPUs (1 node) to 4ms with 2048 GPUs (256 nodes).

Chart provides strong scaling results across multiple GPUs for an FFT size of 20483. — *Figure 2. cuFFTMp (strong scaling) performances on the Selene cluster*

Figure 1 and 2 were run on the Selene cluster. Selene is made of NVIDIA DGXA100, 8xA100-80GB per node with NVSwitch (300 GB/s/GPU, bidirectional) and Mellanox Infiniband HDR (200 GB/s/node, bidirectional). Tests were ran using CUDA 11.4 and the NVIDIA HPC SDK 21.9 Docker container, available at nvcr.io/nvidia/nvhpc:21.9-runtime-cuda11.4-ubuntu20.04. GPU application clocks were set to the maximum.

Performance and scalability

Distributed 3D FFTs are well-known to be communication-bound because of global collective communications of the MPI_Alltoallv type. MPI_Alltoallv is the main bottleneck for distributed FFTs due to low internode bandwidth relative to high compute capabilities, and accelerator-aware MPI implementations of all_to_all type of communications vary in quality.

cuFFTMp uses NVSHMEM, a new communication library based on the OpenSHMEM standard and designed for NVIDIA GPUs by providing kernel-initiated communications. NVSHMEM creates a global address space that includes the memory of all GPUs in the cluster. Performing communication from inside CUDA kernels enables fine-grained, remote data access that reduces synchronization cost and takes advantage of the massive parallelism in the GPU to hide communication overheads.

By using NVSHMEM, cuFFTMp is independent of the quality of the MPI implementation, which is critical because performance can vary significantly from one MPI to another. For more information, see the Interim Report on Benchmarking FFT Libraries on High Performance Systems. Chapter 3.

Figure 3 shows that cuFFTMp is able to maintain roughly 75% peak as the number of GPUs are doubled.

Chart shows cuFFTMp peak performance relative to hardware (that is, the number of GPUs and interconnects). — *Figure 3. Weak scalings of cuFFTMp on the Selene cluster, displayed as a fraction of the peak performances*

Peak performance is using 2000 GB/s/gpu for bidirectional global memory bandwidth, 300 GB/s/gpu for bidirectional NVLink bandwidth and 25 GB/s/gpu for Infiniband bandwidth.

Let N be the 1D transform size and G the number of GPUs. Every GPU owns N³/G elements (8 or 16 bytes each), and the model assumes that N³/G elements are read/written six times to or from global memory and N³/G² elements are sent one time from every GPU to every other GPU. On 4096 GPUs, the time spent in non-InfiniBand communications accounts for less than 10% of the total time.

MPI portability and multi-architecture support

As mentioned earlier, the performances of cuFFTMp do not depend on the MPI implementation. For portability, cuFFTMp requires MPI to be launched and to manage data distributions on the CPUs.

Currently cuFFTMp statically links to NVSHMEM. NVSHMEM uses a small MPI “bootstrap plugin” (nvshmem_bootstrap_mpi.so), which is built using MPI and automatically loaded at runtime. This bootstrap targets the OpenMPI version included in the HPC SDK. For user applications that depend on another MPI implementation, the EA package includes helper scripts to build a bootstrap targeting a different MPI.

cuFFTMp supports both Linux x86_64 and IBM POWER architecture. You can download the EA package for different architectures. Figure 4 shows that, using 1536 V100 GPUs in 256 nodes, cuFFTMp can reach over 50 TFlop/s transforming 4096³ complex data points with only 5% of the Summit system.

Bar chart shows the performance of cuFFTMp on the Summit, Oak Ridge National Lab’s supercomputer with IBM POWER9 processors and V100 GPUs. — *Figure 4. cuFFTMp (FP32) performances on the Summit cluster*

Figure 5 shows that, using 1536 V100 GPUs in 256 nodes, cuFFTMp can reach over 40 TFlop/s transforming 4096³ complex data points with only 5% of the Summit system.

Easy transition to cuFFTMp

cuFFTMp is simply an extension to the current multi-GPU cuFFT library. Most existing multi-GPU functions apply to cuFFTMp. As a distributed, multiprocess library, cuFFTMp requires MPI to be bootstrapped (“launched”) and expects that data is distributed among MPI processes. The following table shows the code required to convert an application from using multi-GPU cuFFT to cuFFTMp.

Multi-GPU, single-process cuFFT	cuFFTMp
#include	#include #include MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); cudaSetDevice(my_device); size_t my_NX = (NX / size) + (rank size ? 1 : 0);
// host buffer h_f size NXNYNZ	// host buffer h_f size my_NXNYNZ
cufftHandle plan_c2c; cufftCreate(&plan_c2c);
for (auto i = 0; i whichGPUs[i] = i; cufftXtSetGPUs(plan_c2c, NGPUS, whichGPUs)	cufftMpAttachComm(plan, CUFFT_COMM_MPI, MPI_COMM_WORLD)
size_t worksize; cufftMakePlan3d(plan_c2c, NX, NY, NZ, CUFFT_R2C, &worksize); cudaLibXtDesc d_f; cufftXtMalloc(plan_c2c, &d_f, CUFFT_XT_FORMAT_INPLACE); cufftXtMemcpy(plan_c2c, d_f, h_f, CUFFT_COPY_HOST_TO_DEVICE); cufftXtExecDescriptor(plan_c2c, d_f, d_f, CUFFT_FORWARD) // Any user-defined operations (filtering, …)* cufftXtExecDescriptor(plan_c2c, d_f, d_f, CUFFT_INVERSE) cufftXtMemcpy(plan_c2c, h_f, d_f, CUFFT_COPY_DEVICE_TO_HOST); cufftXtFree(d_f)
	MPI_Finalize();

Slab, pencil, and block decompositions are typical names of data distribution methods in multidimensional FFT algorithms for the purposes of parallelizing the computation across nodes. cuFFTMp EA only supports optimized slab (1D) decompositions, and provides helper functions, for example cufftXtSetDistribution and cufftMpReshape, to help users redistribute from any other data distributions to cuFFTMp’s slab data distribution.

The cuFFTMp EA package includes C++ and Fortran samples that cover a range of use cases: C2C, R2C/C2R, different plans sharing workspace, and shuffling data from one distribution to the other or redistributing across GPUs. cuFFTMp provides full support for Fortran applications, using the HPC SDK 21.7+ compilers and wrappers included in the EA package.

Customer experience: Turbulence flow simulation

cuFFTMp enables scientists to study the challenging problem of fluid turbulence flow, the oldest unsolved problem in physics.

To understand turbulence flow behavior, a research team at the Tata Institute of Fundamental Research-Hyderabad India (TFRI) developed Fluid3D, a CFD package applying direct numerical simulation (DNS) of the Navier-Stokes equations with pseudo-spectral methods. By porting Fluid3D to cuFFTMp and CUDA, the team can now simulate higher Reynolds number flow on thousands of GPUs within a few hours, an impossible task using the MPI CPU version.

In Figure 6, turbulent flows consist of vortices of different scales, and energy is transferred from larger scales of motion to the small scales. It is important to simulate and understand the isotropic behavior of the smallest turbulent structures in large DNS runs.

Image shows z-component of vorticity and vorticity isosurfaces. — *Figure 6. DNS turbulence flow simulation using Fluid3D*

DNS is a key tool to improve the understanding of turbulence flows, and pseudo-spectral methods are commonly used because of their computational efficiency and accuracy.

The challenge of turbulence flow simulation is the need to attain high Reynolds (Re) numbers. To maintain the computational stability, the Re number is limited by the grid resolution, that is, Re^2.25N³, where N is the number of grid points in each dimension. Therefore, simulating high Re number turbulence flows requires numerical resolutions that can be computationally costly or even prohibitive.

Table 1 shows the grid resolutions required for the maximum Re numbers and the memory requirements for the simulations.

Grid resolution	Simulated Reynolds number	Memory requirement (GB)
1024³	199.2	88
2048³	316.2	704
4096³	501.9	5,632
8192³	796.8	45,056
12288³	1044.1	152,064
16384³	1264.8	360,448

Table 1. The Turbulence DNS package Fluid3D requires large numerical resolution and system memory to simulate high Reynolds number flow

Fluid3D uses a second-order exponential Adams-Bashforth time-stepping method in the Fourier space. The simulations are typically integrated over tens of thousands of time steps, computing nine 3D-FFTs per time step. FFTs dominate the overall simulation runtime. The elapsed wall time per time step is an important metric to gauge whether the time to solution for a particular configuration of numerical experiment is reasonable.

Figure 7 shows the wall time per time step of Fluid3D is under 5 seconds, at a resolution of 8192³, using 1024 A100 GPUs (128 nodes) on Selene. The CPU version with FFTW-MPI, takes 23.9 seconds per time iteration, for a resolution of 1024³ problem size using 64 MPI ranks on a single 64-core CPU node. Compared to the wall time running the same 1024³ problem size using two A100 GPUs, it’s clear that the speedup of Fluid3D from a CPU node to a single A100 is more than 20x.

The left y-axis provides wall time for GPU execution (green) and the right y-axis represents wall time for CPU execution (red). Due to the long execution time, only 10243 was run for CPU results. — *Figure 7. Wall time per time step of Fluid3D DNS running on Selene*

Get started with cuFFTMp

Interested in trying out cuFFTMp to transition your application to run on multiple nodes? Head over to the Getting Started page of cuFFTMp EA. After downloading cuFFTMp, play with the sample code and see how similar they are to the multi-GPU version and how they can scale over multiple nodes.

We continue working on improving cuFFTMp, including adding batched APIs, as well as data compression to minimize communications. If you have questions or new feature requests, contact product manager Matthew Nicely.

Acknowledgments

Special thanks to Prasad Perlekar’s team at Tata Institute of Fundamental Research in Hyderabad, India, for giving us access to the multiphase turbulence flow code Fluid3D and becoming the first adopter of cuFFTMp.

We also thank the entire NVSHMEM team at NVIDIA for their help supporting the development of cuFFTMp.

Misc

Meta Works with NVIDIA to Build Massive AI Research Supercomputer

Post author By
Post date January 27, 2022
No Comments on Meta Works with NVIDIA to Build Massive AI Research Supercomputer

Meta’s AI supercomputer — the largest NVIDIA DGX A100 customer system to date — will deliver 5 exaflops of AI performance.

Misc

How to make custom models for tensorflow.js?

Post author By
Post date January 27, 2022
No Comments on How to make custom models for tensorflow.js?

i seem to be unable to find anything about making a custom tf model for js.

so i though i’d ask here. how would you go about doing it?

whats the best way to create and train custom models, can you do it using python and use those models in js?

and if you can, then how would you do that?

also as a side note, are there any articles or video series that takes you trough this step-by-step?

thanks for all the help!

submitted by /u/CCmamo
[visit reddit] [comments]

Misc

Advanced Kernel Profiling with the Latest Nsight Compute

Post author By
Post date January 27, 2022
No Comments on Advanced Kernel Profiling with the Latest Nsight Compute

CUDA-X logo graphic Nsight Compute kernel profiler now includes Range Replay, Memory Analysis, and Guided Analysis enhancements.

NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging through a user interface and a command-line tool. Nsight Compute 2022.1 brings updates to improve data collection modes enabling new use cases and options for performance profiling.

Download Now>>

What’s New

Range Replay

This release of Nsight Compute extends the existing replay modes with the highly requested feature of Range Replay. Range Replay captures and replays complete ranges of CUDA API calls and kernel launches within the profiled application. Metrics are associated with the entire range as opposed to individual kernels.This allows the tool to execute kernels without serialization and support profiling kernels that need to be run concurrently for correctness or performance reasons. A range consists of a start and an end marker; and includes all CUDA API calls and kernels launched between these markers from any CPU thread.

Range markers can be defined using either:

Profiler Start/Stop API
NVTX Ranges

For complete details, see the “Replay” section in Nsight Compute’s Kernel Profiling Guide.

This is a flowchart diagram of how the Range Replay feature Nsight Compute captures a range of CUDA API calls and collects performance information to display for the profiled application. — *Figure 1. Visualization of Range Replay: After capturing the range, each pass collects performance information for the entire range.*

Memory Analysis

When profiling on A100, a new L2 Cache Eviction Policies table in the Memory Analysis section helps you understand the number of accesses and achieved hit rates by the various cache eviction policies. In the same section, the L2 Cache table now has a new ECC row to show traffic created from enabling hardware Error Correction Code on the GPU.

This is a screenshot of Nsight Compute showing tables with ECC and L2 cache eviction policy information using the Memory Workload Analysis feature. — *Figure 2. Improvements to the Memory Workload Analysis tables: ECC and L2 cache eviction policy information.*

Guided Analysis

Nsight Compute now makes it easier to select initial analysis targets in multiresult collection by dynamically selecting between the Summary and Details pages when opening a report. Rules were extended to detect non-fused floating-point instructions as an optimization opportunity. Last, but not least, when the Uncoalesced Memory Access rules are triggered, they show a table of the five most valuable instances, making it easier to inspect and resolve them on the Source page.

This figure is a screen capture Nsight Compute showing a summary page of multiresult reports, with the ability to sort and optimize the order of presentation. — *Figure 3. Opening multiresult reports now shows the Summary page, allowing you to sort results and decide on the optimization order.*

This is a screen capture of Nsight Compute displaying how the two Uncoalesced Memory Access rules are better presented in a concise and sorted format. — *Figure 4. Both Uncoalesced Memory Access rules present results in a more concise and sorted format.*

Additional improvements

Further improvements include an Occupancy Calculator auto-update. There is also a new ‘Thread Instructions Executed’ metric and register name tooltips for the Register Dependency columns in the Source page, as well as NVLink updates.

At GTC in November of 2021, we released insightful assets showcasing Nsight tools capabilities:

Resources

Misc

NVIDIA Jetson-based Robots Excel in DARPA Underground Competition

Post author By
Post date January 27, 2022
No Comments on NVIDIA Jetson-based Robots Excel in DARPA Underground Competition

With NVIDIA Jetson embedded platforms, teams at the DARPA SubT Challenge detected objects with both high accuracy and high throughput.

Performing real-time inference with high accuracy is a challenging task, especially in a poor-visibility environment. With NVIDIA Jetson embedded platforms, teams at the recently concluded Defense Advanced Research Projects Agency (DARPA) Subterranean (SubT) Challenge were able to detect objects of interest with both high accuracy and high throughput. In this post, we will cover the results, systems, and challenges faced by teams in the final leg of the systems competition.

The SubT Challenge is an international robotics competition organized and coordinated by DARPA. The competition encourages researchers to develop new approaches for robots to map, navigate, and search environments that pose various challenges such as poor visibility, presence of hazards, unknown maps, or poor communication infrastructure.

The challenge consists of three preliminary circuit events: Tunnel Circuit, Urban Circuit, and Cave Circuit (canceled due to the COVID-19 pandemic), as well as a final integrated challenge course. Each circuit and the final event are held in different environments with various types of terrain. According to the event organizers, the competition was held over 3 years in different phases with the final event held in September of 2021 in Louisville, KY.

Competitors in the SubT Challenge leveraged NVIDIA technology for both their hardware and software needs. Teams used desktop/server GPUs to train models that were deployed on robots using NVIDIA Jetson embedded platform for real-time detection of artifacts and objects of interest–the main criteria used to determine the winning team. Five out of seven competitors also used the Jetson platform to perform real-time object detection.

The SubT Challenge

The SubT Challenge is inspired by real-world scenarios faced by first responders during search and rescue operations or disaster response.

The state-of-the-art methods developed through this competition will help reduce the risk of casualties of search and rescue personnel and first responders while they explore the unknown underground environments. Additionally, the autonomous robots will assist personnel in exploring the environment to find survivors, objects of interest, and access locations that are otherwise risky for humans.

A 3D rendering of the variety of domains in the DARPA Supterranean Challenge, including tunnel, urban, and cave environments. — *Figure 1. The DARPA Subterranean Challenge explores innovative approaches and new technologies to map, navigate, and search complex underground environments. — Image courtesy of DARPA*.

Technical challenges

The competition incorporates various technical challenges such as dealing with unknown, unstructured, and uneven terrain that some robots might not be able to maneuver easily.

These environments typically would not have any infrastructure for communication with the central command. From a perception perspective, these environments will have poor visibility where the robots must find artifacts and objects of interest.

The competing teams were tasked with addressing these challenges by developing novel sensor fusion methods as well as developing new or modifying existing robotic platforms with different capabilities to locate and detect objects of interest.

Team CERBERUS

Team CERBERUS (CollaborativE walking and flying RoBots for autonomous ExploRation in Underground Settings) is a joint consortium between several universities and industrial organizations worldwide.

The team participated in the competition with four quadruped robots called ANYmal, five primarily in-house-built drones with variable size and payload capacity, and a rover robot in the form of Super Mega Bot. In the competition finals, the team ended up using four ANYmal robots and the Super Mega Bot for exploration and artifact detection.

Each ANYmal robot was equipped with two CPU-based computers and an NVIDIA Jetson AGX Xavier. The rover robot was equipped with an NVIDIA GTX 1070 GPU.

The CERBERUS team used a modified version of the You Only Look Once (YOLO) model for object detection. The model was trained on 40,000 labeled images using two NVIDIA RTX 3090 GPUs.

The trained model was further optimized using TensorRT before being deployed on Jetson for real-time inference. The Jetson AGX Xavier was able to perform inference at a collective rate of 20 Hz. In the competition finals, the CERBERUS team was the first to detect 23 of the 40 artifacts located in the environment, clinching the number one spot.

The CERBERUS team also used GPUs for the elevation mapping of the terrain and training the locomotion policy controller of the ANYmal quadruple robot. The elevation mapping was done in real-time using Jetson AGX Xavier. The ANYmal robot’s locomotion policy training for the rough terrain was done offline using desktop GPUs.

Team Co-STAR

Led by researchers at NASA’s Jet Propulsion Laboratory (JPL) in Southern California along with other universities and industrial collaborators, team Collaborative SubTerranean Autonomous Robots (Co-STAR) was the winner of the 2020 competition focused on exploring complex underground urban environments.

The Co-STAR team posing behind the robots and drones they used to compete in the DARPA challenges. — *Figure 3. Members of team Co-STAR with their robots (Spot, Husky, and drones)*.

They also successfully participated in the 2021 competition in mixed artificial and natural environments, placing fifth. The Co-STAR team entered the competition with four Spots, four Husky robots, and two drones.

Following an unexpected hardware issue in the final run, the team ended up using one Spot and three Husky robots. Each robot was equipped with a CPU-based computer along with one NVIDIA Jetson AGX Xavier.

For object detection, team Co-STAR used RGB and thermal images. They used the medium variant of the YOLO v5 model to process high-resolution images for real-time inference. The team trained two different models to perform inference on captured RGB and thermal images.

The image-based model was trained using approximately 54,000 labeled frames whereas the thermal image model was trained using about 2,400 labeled images. For training the model on their customized dataset, team Co-STAR used a pretrained YOLO v5 model on the COCO dataset and performed transfer learning using the NVIDIA Transfer Learning Toolkit (known as TAO Toolkit).

The models were trained using two on-premise NVIDIA A100 GPUs and an AWS instance that consisted of eight V100 GPUs. Before deploying the models on Jetson AGX Xavier, the team pruned the models using TensorRT.

Using this setup, team Co-STAR was able to perform inference at 28 Hz with RGB images received from five RealSense cameras and images received from one thermal camera. In the final run, the robots were able to detect all 13 artifacts present in the designated areas. The exploration time was limited due to the delayed deployment caused by unexpected hardware issues at the deployment site.

Equipped with the NVIDIA Jetson platform and NVIDIA GPU hardware, teams competing in the DARPA SubT event were able to effectively train models for real-time inference, addressing the challenge posed by underground environments with accurate object detection.

Offsites

Does Your Medical Image Classifier Know What It Doesn’t Know?

Post author By
Post date January 27, 2022
No Comments on Does Your Medical Image Classifier Know What It Doesn’t Know?

Posted by Abhijit Guha Roy, Research Software Engineer and Jie Ren, Research Scientist, Google Research

Deep machine learning (ML) systems have achieved considerable success in medical image analysis in recent years. One major contributing factor is access to abundant labeled datasets, which are used to train highly effective supervised deep learning models. However, in the real-world, these models may encounter samples exhibiting rare conditions that are individually too infrequent for per-condition classification. Nevertheless, such conditions can be collectively common because they follow a long-tail distribution and when taken together can represent a significant portion of cases — e.g., in a recent deep learning dermatological study, hundreds of rare conditions composed around 20% of cases encountered by the model at test time.

To prevent models from generating erroneous outputs on rare samples at test time, there remains a considerable need for deep learning systems with the ability to recognize when a sample is not a condition it can identify. Detecting previously unseen conditions can be thought of as an out-of-distribution (OOD) detection task. By successfully identifying OOD samples, preventive measures can be taken, like abstaining from prediction or deferring to a human expert.

Traditional computer vision OOD detection benchmarks work to detect dataset distribution shifts. For example, a model may be trained on CIFAR images but be presented with street view house numbers (SVHN) as OOD samples, two datasets with very different semantic meanings. Other benchmarks seek to detect slight differences in semantic information, e.g., between images of a truck and a pickup truck, or two different skin conditions. The semantic distribution shifts in such near-OOD detection problems are more subtle in comparison to dataset distribution shifts, and thus, are harder to detect.

In “Does Your Dermatology Classifier Know What it Doesn’t Know? Detecting the Long-Tail of Unseen Conditions”, published in Medical Image Analysis, we tackle this near-OOD detection task in the application of dermatology image classification. We propose a novel hierarchical outlier detection (HOD) loss, which leverages existing fine-grained labels of rare conditions from the long tail and modifies the loss function to group unseen conditions and improve identification of these near OOD categories. Coupled with various representation learning methods and the diverse ensemble strategy, this approach enables us to achieve better performance for detecting OOD inputs.

The Near-OOD Dermatology Dataset
We curated a near-OOD dermatology dataset that includes 26 inlier conditions, each of which are represented by at least 100 samples, and 199 rare conditions considered to be outliers. Outlier conditions can have as low as one sample per condition. The separation criteria between inlier and outlier conditions can be specified by the user. Here the cutoff sample size between inlier and outlier was 100, consistent with our previous study. The outliers are further split into training, validation, and test sets that are intentionally mutually exclusive to mimic real-world scenarios, where rare conditions shown during test time may have not been seen in training.

Long tail distribution of different dermatological conditions in our dataset. The 26 inlier conditions, with at least 100 samples, (blue) and the remaining 199 rare outlier conditions (orange). Outlier conditions can have as low as one sample per condition.

	Train set		Validation set		Test set
	Inlier	Outlier	Inlier	Outlier	Inlier	Outlier
Number of classes	26	68	26	66	26	65
Number of samples	8854	1111	1251	1082	1192	937

Inlier and outlier conditions in our benchmark dataset and detailed dataset split statistics. The outliers are further split into mutually exclusive train, validation, and test sets.

Hierarchical Outlier Detection Loss
We propose to use “known outlier” samples during training that are leveraged to aid detection of “unknown outlier” samples during test time. Our novel hierarchical outlier detection (HOD) loss performs a fine-grained classification of individual classes for all inlier or outlier classes and, in parallel, a coarse-grained binary classification of inliers vs. outliers in a hierarchical setup (see the figure below). Our experiments confirmed that HOD is more effective than performing a coarse-grained classification followed by a fine-grained classification, as this could result in a bottleneck that impacted the performance of the fine-grained classifier.

We use the sum of the predictive probabilities of the outlier classes as the OOD score. As a primary OOD detection metric we use the area under receiver operating characteristics (AUROC) curve, which ranges between 0 and 1 and gives us a measure of separability between inliers and outliers. A perfect OOD detector, which separates all inliers from outliers, is assigned an AUROC score of 1. A popular baseline method, called reject bucket, separates each inlier individually from the outliers, which are grouped into a dedicated single abstention class. In addition to a fine-grained classification for each individual inlier and outlier classes, the HOD loss–based approach separates the inliers collectively from the outliers with a coarse-grained prediction loss, resulting in better generalization. While similar, we demonstrate that our HOD loss–based approach outperforms other baseline methods that leverage outlier data during training, achieving an AUROC score of 79.4% on the benchmark, a significant improvement over that of reject bucket, which achieves 75.6%.

Our model architecture and the HOD loss. The encoder (green) represents the wide ResNet 101×3 model pre-trained with different representation learning models (ImageNet, BiT, SimCLR, and MICLe; see below). The output of the encoder is sent to the HOD loss where fine-grained and coarse-grained predictions for inliers (blue) and outliers (orange) are obtained. The coarse predictions are obtained by summing over the fine-grained probabilities as indicated in the figure. The OOD score is defined as the sum of the probabilities of outlier classes.

Representation Learning and the Diverse Ensemble Strategy
We also investigate how different types of representation learning help in OOD detection in conjunction with HOD by pretraining on ImageNet, BiT-L, SimCLR and MICLe models. We observe that including HOD loss improves OOD performance compared to the reject bucket baseline method for all four representation learning methods.

Representation Learning Methods	OOD detection metric (AUROC %)
Representation Learning Methods	With reject bucket	With HOD loss
ImageNet	74.7%	77%
BiT-L	75.6%	79.4%
SimCLR	75.2%	77.2%
MICLe	76.7%	78.8%

OOD detection performance for different representation learning models with reject bucket and with HOD loss.

Another orthogonal approach for improving OOD detection performance and accuracy is deep ensemble, which aggregates outputs from multiple independently trained models to provide a final prediction. We build upon deep ensemble, but instead of using a fixed architecture with a fixed pre-training, we combine different representation learning architectures (ImageNet, BiT-L, SimCLR and MICLe) and introduce objective loss functions (HOD and reject bucket). We call this a diverse ensemble strategy, which we demonstrate outperforms the deep ensemble for OOD performance and inlier accuracy.

Downstream Clinical Trust Analysis
While we mainly focus on improving the performance for OOD detection, the ultimate goal for our dermatology model is to have high accuracy in predicting inlier and outlier conditions. We go beyond traditional performance metrics and introduce a “penalty” matrix that jointly evaluates inlier and outlier predictions for model trust analysis to approximate downstream impact. For a fixed confidence threshold, we count the following types of mistakes: (i) incorrect inlier predictions (i.e., mistaking inlier condition A as inlier condition B); (ii) incorrect abstention of inliers (i.e., abstaining from making a prediction for an inlier); and (iii) incorrect prediction for outliers as one of the inlier classes.

To account for the asymmetrical consequences of the different types of mistakes, penalties can be 0, 0.5, or 1. Both incorrect inlier and outlier-as-inlier predictions can potentially erode user trust in the model and were penalized with a score of 1. Incorrect abstention of an inlier as an outlier was penalized with a score of 0.5, indicating that potential model users should seek additional guidance given the model-expressed uncertainty or abstention. For correct decisions no cost is incurred, indicated by a score of 0.

	Action of the Model
	Prediction as Inlier	Abstain
Inlier	0 (Correct) 1 (Incorrect, mistakes that may erode trust)	0.5 (Incorrect, abstains inliers)
Outlier	1 (Incorrect, mistakes that may erode trust)	0 (Correct)

The penalty matrix is designed to capture the potential impact of different types of model errors.

Because real-world scenarios are more complex and contain a variety of unknown variables, the numbers used here represent simplifications to enable qualitative approximations for the downstream impact on user trust of outlier detection models, which we refer to as “cost”. We use the penalty matrix to estimate a downstream cost on the test set and compare our method against the baseline, thereby making a stronger case for its effectiveness in real-world scenarios. As shown in the plot below, our proposed solution incurs a much lower estimated cost in comparison to baseline over all possible operating points.

Trust analysis comparing our proposed method to the baseline (reject bucket) for a range of outlier recall rates, indicated by 𝛕. We show that our method reduces downstream estimated cost, potentially reflecting improved downstream impact.

Conclusion
In real-world deployment, medical ML models may encounter conditions that were not seen in training, and it’s important that they accurately identify when they do not know a specific condition. Detecting those OOD inputs is an important step to improving safety. We develop an HOD loss that leverages outlier data during training, and combine it with pre-trained representation learning models and a diverse ensemble to further boost performance, significantly outperforming the baseline approach on our new dermatology benchmark dataset. We believe that our approach, aligned with our AI Principles, can aid successful translation of ML algorithms into real-world scenarios. Although we have primarily focused on OOD detection for dermatology, most of our contributions are fairly generic and can be easily incorporated into OOD detection for other applications.

Acknowledgements
We would like to thank Shekoofeh Azizi, Aaron Loh, Vivek Natarajan, Basil Mustafa, Nick Pawlowski, Jan Freyberg, Yuan Liu, Zach Beaver, Nam Vo, Peggy Bui, Samantha Winter, Patricia MacWilliams, Greg S. Corrado, Umesh Telang, Yun Liu, Taylan Cemgil, Alan Karthikesalingam, Balaji Lakshminarayanan, and Jim Winkens for their contributions. We would also like to thank Tom Small for creating the post animation.

Misc

Edge Computing Fuels a Sustainable Future for Energy

Post author By
Post date January 27, 2022
No Comments on Edge Computing Fuels a Sustainable Future for Energy

Learn how edge computing is powering efficient energy operations, protecting worker health and safety, and improving power grid resiliency.

Each day, energy flows throughout our lives – from the fuel that powers cars and planes, to the gas used for stove top cooking, to the electricity that keeps the lights on in homes and businesses. Oil, gas, and electricity are mature commodity markets, but AI is transforming the processes used to produce, transport, and deliver these resources.

Enter AI deployed at the edge: on oil rigs, within power plants, riding along utility trucks, even embedded in smart buildings. Oil and gas enterprises and utilities are using AI and edge computing to improve operational efficiency, protect worker health and safety, integrate renewable energy, increase grid resiliency, and provide more reliable and affordable sources of energy to consumers.

Image of smart camera pilot deployment from Noteworthy AI used by FirstEnergy to automate inspections of utility poles at the edge. — Figure 1. Noteworthy AI, a member of NVIDIA Inception, put smart cameras on FirstEnergy’s trucks in a pilot that showed how edge computing can monitor millions of pole-mounted assets. Image courtesy of Noteworthy AI.

As companies and countries race to decarbonize and meet net-zero emissions goals, edge AI will play a key role managing distributed energy resources such as electric vehicles, home batteries, solar panels, and wind farms to enhance power grid resiliency and accelerate the energy transition. The following examples highlight the top AI use cases across the energy industry, including:

Software-defined smart grids: Future smart meters will use edge computing to optimize power flow, detect grid anomalies, deliver more reliable energy at a lower cost, and unlock opportunities for new energy applications. Utilidata, a leading grid-edge software company, is developing a software-defined smart grid chip with NVIDIA that will power next-generation smart meters to increase grid resiliency, decarbonization, and consumer value.
Autonomous operations: Industrial sites, such as oil rigs and power plants, require extensive monitoring for efficiency and safety because liquid, steam, or oil leakages can be catastrophic, costly, and wasteful. Global energy leaders, such as Siemens Energy, are using AI and machine learning to deliver a path to autonomous power plants. The company trains AI models using thousands of images and video streams from millions of onsite cameras and sensors to detect process anomalies. These models are deployed at the edge in power plants and use real-time inferencing to identify leaks. Rig operators are using computer vision, deep learning, and intelligent video analytics (IVA) to monitor heavy machinery, detect potential hazards, and alert workers in real-time to protect their health and safety, prevent accidents, and assign repair technicians for maintenance.
Pipeline optimization: Oil and gas enterprises rely on finding the best-fit routes to transfer oil to refineries and eventually fuel stations. Edge AI can calculate the optimal flow of oil to ensure reliability of production and protect long-term pipeline health. Using IVA, these companies can inspect pipelines for defects that could lead to dangerous failures and automatically alert pipeline operators. Further downstream, NVIDIA ReOpt uses GPU-accelerated solvers for logistics and route optimization, which can efficiently route fuel to fueling stations.
Power grid maintenance: With proactive maintenance, utilities can accurately detect defects and reduce unplanned outages to better serve customers. FirstEnergy worked with Noteworthy AI, a NVIDIA Inception member, on a pilot project to automate utility pole inspections. Fixed camera systems powered by NVIDIA Jetson were secured to the roof of service trucks and collected standardized, high-resolution images of their utility poles, power lines, and pole-mounted assets. The images were analyzed at the edge to determine if repairs or vegetation management was needed. Edge computing can help monitor the estimated 185 million utility poles in the United States, and reduce the tens of millions of dollars spent each year by utilities to manually track and maintain poles.
Power grid simulation: Intelligent forecasting using GPU-accelerated grid simulations combined with historical data on energy usage and weather can inform more efficient generation, distribution, and management of energy resources to consumers. AI helps manage the bidirectional flow of power in a grid, delivering reliable energy to residents and enterprises while automating the process for consumers to sell their additional energy back to the grid.

Thanks to edge AI, the future of energy is more sustainable than ever. Explore how NVIDIA is building an ecosystem to accelerate the energy transition.

Misc

Let Me Upgrade You: GeForce NOW Adds Resolution Upscaling and More This GFN Thursday

Post author By
Post date January 27, 2022
No Comments on Let Me Upgrade You: GeForce NOW Adds Resolution Upscaling and More This GFN Thursday

GeForce NOW is taking cloud gaming to new heights. This GFN Thursday delivers an upgraded streaming experience as part of an update that is now available to all members. It includes new resolution upscaling options to make members’ gaming experiences sharper, plus the ability to customize streaming settings in session. The GeForce NOW app is Read article >

The post Let Me Upgrade You: GeForce NOW Adds Resolution Upscaling and More This GFN Thursday appeared first on The Official NVIDIA Blog.