Categories
Misc

Fully Vectorized Conv2D Implementation

Hey guys.

I wrote a post describing in detail a full vectorized implementation of the convolution operation in NumPy: https://lucasdavid.github.io/vectorization/

I would appreciate if you could give me any notes. I’m also trying to translate this to TensorFlow, but it’s not as trivial as I initially thought, considering indexing is very different (my implementation relies on selecting the multiple regions at once with `image[…, r, c]`, where `r` and `c` are two index matrices).

Any ideas on this would be greatly appreciated!
Have a great day. 🙂

submitted by /u/deepdipship
[visit reddit] [comments]

Categories
Misc

From Experimentation to Products: The Production Machine Learning Journey + Google’s experience with TensorFlow Extended (TFX)

From Experimentation to Products: The Production Machine Learning Journey + Google’s experience with TensorFlow Extended (TFX) submitted by /u/mto96
[visit reddit] [comments]
Categories
Misc

Building Real-time Dermatology Classification with NVIDIA Clara AGX

A 3-step diagram showing the workflow for skin mole detection and classification. Starting from an input, moving to the YOLOv4 model for detection, and ending with an EfficientNet model for final classification.The most commonly diagnosed cancer in the US today is skin cancer. There are three main variants: melanoma, basal cell carcinoma (BCC), and squamous cell carcinoma (SCC). Though melanoma only accounts for roughly 1% of all skin cancers, it is the most fatal, metastasizing rapidly without early detection and treatment. This makes early detection critical, … ContinuedA 3-step diagram showing the workflow for skin mole detection and classification. Starting from an input, moving to the YOLOv4 model for detection, and ending with an EfficientNet model for final classification.

The most commonly diagnosed cancer in the US today is skin cancer. There are three main variants: melanoma, basal cell carcinoma (BCC), and squamous cell carcinoma (SCC). Though melanoma only accounts for roughly 1% of all skin cancers, it is the most fatal, metastasizing rapidly without early detection and treatment. This makes early detection critical, as numerous studies show significantly better survival rates when detection is done in its earliest stages.

The current diagnosis procedure is done through a visual examination by a dermatologist, followed by a biopsy to confirm any suspected pathology. This manual examination is dependent on human subjectivity and thus suffers from error at a concerning rate. When a primary care physician looks for skin cancer, their sensitivity, or ability to identify a patient with the disease correctly, is only 0.45, while a dermatologist has a sensitivity of 0.97.

In recent years, the use of deep learning to perform medical diagnostics has become a quickly growing field. In this post, we discuss developing an end-to-end example of how deep learning could lead to an automated dermatology exam system free of human bias, using the recently announced NVIDIA Clara AGX development kit.

Datasets and models

This reference application is the pairing of two deep learning models:

  • An object detection model (YOLOv4) that looks for moles on the body through a camera. This model was trained with an original dataset created from annotating body mole images.
  • A classification model (EfficientNet) that receives moles from the object detection model and then determines if it is benign, unknown, or melanoma. The classification model was trained using the SIIM-ISIC melanoma Kaggle challenge dataset.

Figure 1 shows the workflow of the algorithm using a single video frame. The application can use a high-definition webcam or IP camera as input to the models, or even run on a previously captured video.

A 3-step diagram showing the workflow for skin mole detection and classification.  Starting from an input, moving to the YOLOv4 model for detection, and ending with an EfficientNet model for final classification.
Figure 1. Skin mole detection and classification workflow.

Clara AGX development kit

This reference application was built using the NVIDIA Clara AGX development kit, a high-end performance workstation built with medical applications in mind. The system includes an RTX 6000 GPU, delivering 200+ INT8 AI TOPs of peak performance and 24 GB of VRAM, leaving plenty of overhead for running multiple models.

A rendered image of the Clara AGX Developer Kit showing the inside the case with key components being highlighted.  The three main components are the NVIDIA Jetson AGX Xavier, NVIDIA Mellanox ConnectX-6, and an NVIDIA RTX 6000 GPU.
Figure 2. Clara AGX Developer Kit.

In addition, the AGX platform offers support for high bandwidth sensors through 100G Ethernet and an NVIDIA ConnectX-6 network interface card (NIC). NVIDIA partners are currently using the NVIDIA Clara AGX development kit to develop applications in ultrasound, genomics, and endoscopy.

The Clara AGX Developer Kit is currently available exclusively for members of the NVIDIA Clara Developer Partner Program. After you register, we’ll be in touch.

Summary

We’ve provided a research prototype of a dermatology application, but what would it take to transform this into a real application?

  • Commercially usable data. The SIIM-ISIC dataset is strictly for non-commercial use.
  • A much larger object detection dataset. The dataset that we used consisted of only a few hundred annotated images, which did lead to a larger than desired number of false positives.
  • Run the models at the “speed of light” (SOL). SOL often entails training models to run using mixed precision and then transforming the models to work with the NVIDIA TensorRT framework. TensorRT is designed to optimize model inference on NVIDIA GPUs and work with common frameworks such as PyTorch and TensorFlow. These steps would help to ensure that your application pipeline runs in real-time.
  • FDA clearance. Any developed medical application must be cleared by the FDA. Today, there are over 70 FDA-cleared AI applications, and the FDA has been active in soliciting feedback from developers in this area. This is typically a long (18 months) and arduous process, but a necessary one.

For more information, see the dermatology reference Docker container on NGC.

Categories
Misc

NVIDIA Research: Fast Uncertainty Quantification for Deep Object Pose Estimation

Researchers from NVIDIA, University of Texas at Austin and Caltech developed a simple, efficient, and plug-and-play uncertainty quantification method for the 6-DoF object pose estimation task, using an ensemble of K pre-trained estimators with different architectures and/or training data sources.

Researchers from NVIDIA, University of Texas at Austin and Caltech developed a simple, efficient, and plug-and-play uncertainty quantification method for the 6-DoF (degrees of freedom) object pose estimation task, using an ensemble of K pre-trained estimators with different architectures and/or training data sources.

The researchers presented their paper “Fast Uncertainty Quantification (“FastUQ”) for Deep Object Pose Estimation” at the 2021 International Conference on Robotics and Automation (ICRA 2021).

FastUQ focuses on the uncertainty quantification for deep object pose estimation. In deep learning-based object pose estimation (see NVIDIA DOPE), a big challenge is deep-learning-based pose estimators might be overconfident in their pose predictions.

For example, the two figures below are the pose estimation results for the “Ketchup” object from a DOPE model in a manipulation task. Both results are very confident, but the left one is incorrect.

Another challenge addressed is the sim2real gap. Typically, deep learning-based pose estimators are trained from synthetic datasets (by NVIDIA’s ray tracing renderer, NViSII), but we want to apply these estimators in the real world and quantify the uncertainty. For example, the left figure is from the synthetic NViSII dataset, and the right one is from the real world.

In this project, we propose an ensemble-based method for the fast uncertainty quantification of deep learning-based pose estimators. The idea is demonstrated in the following two figures, where in the left one the deep models in the ensemble disagree with each other, which implies more uncertainty; and in the right one these models agree with each other, which reflects less uncertainty. 

This research is definitely interdisciplinary and it was solved by the joint efforts of different research teams at NVIDIA:

  • The AI Algorithms team led by Anima Anandkumar, and the NVIDIA AI Robotics Research Lab in Seattle working on the uncertainty quantification methods
  • The Learning and Perception Research team led by Jan Kautz for training the deep object pose estimation models, and providing photorealistic synthetic data from NVIDIA’s ray-tracing renderer, NViSII

For training the deep estimators and generating the high-fidelity photorealistic synthetic datasets, the team used NVIDIA V100 GPUs and NVIDIA OptiX (C++/CUDA back-end) for acceleration.

FastUQ is a novel fast uncertainty quantification method for deep object pose estimation, which is efficient, plug-and-play, and supports a general class of pose estimation tasks. This research has potentially significant impacts in autonomous driving and general autonomy, including more robust and safe perception, and uncertainty-aware control and planning.

To learn more about the research, visit the FastUQ project website.

Categories
Misc

NVIDIA RTX GPUs and KeyShot Accelerate Rendering for Caustics

NVIDIA RTX ray tracing has transformed graphics and rendering. With powerful software applications like Luxion KeyShot, more users can take advantage of RTX technology to speed up graphic workflows — like rendering caustics.

NVIDIA RTX ray tracing has transformed graphics and rendering. With powerful software applications like Luxion KeyShot, more users can take advantage of RTX technology to speed up graphic workflows — like rendering caustics.

Caustics are formed as light refracts through or reflects off specular surfaces. Examples of caustics include the light focusing through a glass, the shimmering light at the bottom of a swimming pool, and even the beams of light from windows into a dusty environment.

When it comes to rendering caustics, photon mapping is an important factor. Photon mapping works by tracing photons from the light sources into the scene and storing these photons as they interact with the surfaces or volumes in the scene.

KeyShot implements a full progressive photon-mapping algorithm on the GPU that’s capable of rendering caustics, including reflections of caustics.

With the combination of RTX technology and CUDA programming framework, it is now possible for users to achieve features like ray tracing, photon mapping, and shading running on the GPU. The power of the RTX GPUs accelerates full rendering of caustics, resulting in detailed, interactive images with reflections and refractions.

Image courtesy of David Merz III at Vyzdom.

When developing KeyShot 10, the team analyzed the GPU implementation and decided to see if they could improve the photon map implementation and obtain faster caustics. The result is a caustics algorithm that is able to handle thousands of lights, quickly render highly detailed caustics up close, and run significantly faster on the new NVIDIA Ampere RTX GPUs.

“Using the new caustics algorithm in KeyShot 10, we started getting details and fine structures in the caustics that normally would not be seen due to the time it takes to get to this detail level”, said Dr. Henrik Wann Jensen, chief scientist at Luxion.

All images are rendered on the GPU and they can all be manipulated interactively in KeyShot 10 and the caustics updates with any changes made to the scene.

Image courtesy of David Merz III at Vyzdom.

“The additional memory in NVIDIA RTX A6000 is exactly what I need for working with geometry nodes in KeyShot, and I even tested how high I could push the VRAM with the ice cube scene,” said David Merz, founder and chief creative at Vyzdom. “I’m hooked on the GDDR6 memory, and I could definitely get used to this 48GB ceiling. Not needing to worry about VRAM limitations allows me to get into a carefree creative rhythm, turning a process from frustrating to fluid artistic execution, which benefits both the client and me”.

With NVIDIA RTX GPUs powering caustics algorithms in KeyShot 10, users working with transparent or reflective products can easily render high-quality images with photorealistic details.

Read the KeyShot blog to learn more about NVIDIA RTX and caustics rendering.

Categories
Misc

How I used the OpenCV AI Kit to Control a Drone

How I used the OpenCV AI Kit to Control a Drone submitted by /u/AugmentedStartups
[visit reddit] [comments]
Categories
Misc

A Theoretical and Practical Guide to Probabilistic Graphical Models with Tensorflow

A Theoretical and Practical Guide to Probabilistic Graphical Models with Tensorflow submitted by /u/OB_two
[visit reddit] [comments]
Categories
Misc

What Is Synthetic Data?

Data is the new oil in today’s age of AI, but only a lucky few are sitting on a gusher. So, many are making their own fuel, one that’s both inexpensive and effective. It’s called synthetic data. What Is Synthetic Data? Synthetic data is annotated information that computer simulations or algorithms generate as an alternative Read article >

The post What Is Synthetic Data? appeared first on The Official NVIDIA Blog.

Categories
Misc

Oh, Canada: NuPort Brings Autonomous Trucking to Toronto Roads with NVIDIA DRIVE

Autonomous truck technology is making its way across the Great White North. Self-driving trucking startup NuPort Robotics is leveraging NVIDIA DRIVE to develop autonomous driving systems for middle-mile short-haul routes. The Canada-based company is working with the Ontario government as well as Canadian Tire on a two-year pilot project to accelerate the commercial deployment of Read article >

The post Oh, Canada: NuPort Brings Autonomous Trucking to Toronto Roads with NVIDIA DRIVE appeared first on The Official NVIDIA Blog.

Categories
Misc

tf2/keras: Custom RoI pool layer is not being evaluated correctly on each step

Trying to write a custom Keras layer to perform RoI pooling quickly. I have an implementation that relies on repeated tf.map_fn() calls but it is painfully slow. I’ve seen some that use ordinary Python for-loops and I thought I’d try my own. Executed on its own using a model that consists only of this custom layer, it works just fine. However when used in a training loop (where I repeatedly call Model.predict_on_batch and Model.train_on_batch), it produces bizarre results.

It’s quite difficult to figure out what exactly is going on because reading the layer output is non-trivial and I suspect is giving me a result different than what Keras sees during training.

So I’ve inserted a print statement and notice that during training, it will produce numerical tensors on some steps, e.g.:

tf.Tensor( [8.87275487e-02 0.00000000e+00 0.00000000e+00 0.00000000e+00 6.44880116e-01 0.00000000e+00 2.37839603e+00 0.00000000e+00 0.00000000e+00 2.50582743e+00 0.00000000e+00 0.00000000e+00 4.21218348e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 4.73125458e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 9.98033524e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00 4.39077109e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.72268832e+00 0.00000000e+00 1.20860779e+00 0.00000000e+00 0.00000000e+00 2.05427575e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 2.32518530e+00 0.00000000e+00 8.84961128e-01 0.00000000e+00 0.00000000e+00 1.05681539e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 3.33451724e+00 1.71899879e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 9.97039509e+00 

But on most steps I see this:

Tensor("model_3/roi_pool/map/while/Max:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_1:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_2:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_3:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_4:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_5:0", shape=(512,), dtype=float32) 

I believe these are tensor objects being used to construct a compute graph for deferred execution? I’m not sure why it is choosing to do this on most steps but not all.

This is causing training to fail to progress because it behaves similarly to returning a tensor full of zeros.

My layer code:

“` class RoIPoolingLayer(Layer): “”” Input shape: Two tensors [xmaps, x_rois] each with shape: x_maps: (samples, height, width, channels), representing the feature maps for this batch, of type tf.float32 x_rois: (samples, num_rois, 4), where RoIs have the ordering (y, x, height, width), all tf.int32 Output shape: (samples, num_rois, pool_size, pool_size, channels) “”” def __init(self, pool_size, **kwargs): self.pool_size = pool_size super().init_(**kwargs)

def get_config(self): config = { “pool_size”: self.pool_size, } base_config = super(RoIPoolingLayer, self).get_config() return dict(list(base_config.items()) + list(config.items()))

def compute_output_shape(self, input_shape): map_shape, rois_shape = input_shape assert len(map_shape) == 4 and len(rois_shape) == 3 and rois_shape[2] == 4 assert map_shape[0] == rois_shape[0] # same number of samples num_samples = map_shape[0] num_channels = map_shape[3] num_rois = rois_shape[1] return (num_samples, num_rois, self.pool_size, self.pool_size, num_channels)

def call(self, inputs): return tf.map_fn( fn = lambda input_pair: RoIPoolingLayer._compute_pooled_rois(feature_map = input_pair[0], rois = input_pair[1], pool_size = self.pool_size), elems = inputs, fn_output_signature = tf.float32 # this is absolutely required else the fn type inference seems to fail spectacularly )

def _compute_pooled_rois(feature_map, rois, pool_size): num_channels = feature_map.shape[2] num_rois = rois.shape[0] pools = [] for roi_idx in range(num_rois): region_y = rois[roi_idx, 0] region_x = rois[roi_idx, 1] region_height = rois[roi_idx, 2] region_width = rois[roi_idx, 3] region_of_interest = tf.slice(feature_map, [region_y, region_x, 0], [region_height, region_width, num_channels]) x_step = tf.cast(region_width, dtype = tf.float32) / tf.cast(pool_size, dtype = tf.float32) y_step = tf.cast(region_height, dtype = tf.float32) / tf.cast(pool_size, dtype = tf.float32) for y in range(pool_size): for x in range(pool_size): pool_y_start = y pool_x_start = x

 pool_y_start_int = tf.cast(pool_y_start, dtype = tf.int32) pool_x_start_int = tf.cast(pool_x_start, dtype = tf.int32) y_start = tf.cast(pool_y_start * y_step, dtype = tf.int32) x_start = tf.cast(pool_x_start * x_step, dtype = tf.int32) y_end = tf.cond((pool_y_start_int + 1) < pool_size, lambda: tf.cast((pool_y_start + 1) * y_step, dtype = tf.int32), lambda: region_height ) x_end = tf.cond((pool_x_start_int + 1) < pool_size, lambda: tf.cast((pool_x_start + 1) * x_step, dtype = tf.int32), lambda: region_width ) y_size = tf.math.maximum(y_end - y_start, 1) # if RoI is smaller than pool area, y_end - y_start can be less than 1 (0); we want to sample at least one cell x_size = tf.math.maximum(x_end - x_start, 1) pool_cell = tf.slice(region_of_interest, [y_start, x_start, 0], [y_size, x_size, num_channels]) pooled = tf.math.reduce_max(pool_cell, axis=(1,0)) # keep channels independent print(pooled) pools.append(pooled) return tf.reshape(tf.stack(pools, axis = 0), shape = (num_rois, pool_size, pool_size, num_channels)) 

“`

Note the print statement in the loop.

Strangely, if I build a simple test model consisting of only this layer (i.e., in my unit test), I can verify that it does work:

input_map = Input(shape = (9,8,num_channels)) # input map size input_rois = Input(shape = (num_rois,4), dtype = tf.int32) # N RoIs, each of length 4 (y,x,h,w) output = RoIPoolingLayer(pool_size = pool_size)([input_map, input_rois]) model = Model([input_map, input_rois], output)

I can then call model.predict() on some sample input and I get valid output.

But in a training loop, where I perform a prediction followed by a training step on the trainable layers, it’s not clear what it is doing. My reference implementation works fine (it does not use for-loops).

How can I debug this further?

Thank you 🙂

submitted by /u/BartTrzy
[visit reddit] [comments]