Categories
Misc

NVIDIA Research: Fast Uncertainty Quantification for Deep Object Pose Estimation

Researchers from NVIDIA, University of Texas at Austin and Caltech developed a simple, efficient, and plug-and-play uncertainty quantification method for the 6-DoF object pose estimation task, using an ensemble of K pre-trained estimators with different architectures and/or training data sources.

Researchers from NVIDIA, University of Texas at Austin and Caltech developed a simple, efficient, and plug-and-play uncertainty quantification method for the 6-DoF (degrees of freedom) object pose estimation task, using an ensemble of K pre-trained estimators with different architectures and/or training data sources.

The researchers presented their paper “Fast Uncertainty Quantification (“FastUQ”) for Deep Object Pose Estimation” at the 2021 International Conference on Robotics and Automation (ICRA 2021).

FastUQ focuses on the uncertainty quantification for deep object pose estimation. In deep learning-based object pose estimation (see NVIDIA DOPE), a big challenge is deep-learning-based pose estimators might be overconfident in their pose predictions.

For example, the two figures below are the pose estimation results for the “Ketchup” object from a DOPE model in a manipulation task. Both results are very confident, but the left one is incorrect.

Another challenge addressed is the sim2real gap. Typically, deep learning-based pose estimators are trained from synthetic datasets (by NVIDIA’s ray tracing renderer, NViSII), but we want to apply these estimators in the real world and quantify the uncertainty. For example, the left figure is from the synthetic NViSII dataset, and the right one is from the real world.

In this project, we propose an ensemble-based method for the fast uncertainty quantification of deep learning-based pose estimators. The idea is demonstrated in the following two figures, where in the left one the deep models in the ensemble disagree with each other, which implies more uncertainty; and in the right one these models agree with each other, which reflects less uncertainty. 

This research is definitely interdisciplinary and it was solved by the joint efforts of different research teams at NVIDIA:

  • The AI Algorithms team led by Anima Anandkumar, and the NVIDIA AI Robotics Research Lab in Seattle working on the uncertainty quantification methods
  • The Learning and Perception Research team led by Jan Kautz for training the deep object pose estimation models, and providing photorealistic synthetic data from NVIDIA’s ray-tracing renderer, NViSII

For training the deep estimators and generating the high-fidelity photorealistic synthetic datasets, the team used NVIDIA V100 GPUs and NVIDIA OptiX (C++/CUDA back-end) for acceleration.

FastUQ is a novel fast uncertainty quantification method for deep object pose estimation, which is efficient, plug-and-play, and supports a general class of pose estimation tasks. This research has potentially significant impacts in autonomous driving and general autonomy, including more robust and safe perception, and uncertainty-aware control and planning.

To learn more about the research, visit the FastUQ project website.

Categories
Offsites

Using Variational Transformer Networks to Automate Document Layout Design

Information in a written document is not only conveyed by the meaning of the words contained in it, but also by the overall document layout. Layouts are commonly used to direct the order in which the reader parses a document to enable a better understanding (e.g., with columns or paragraphs), to provide helpful summaries (e.g., with titles) or for aesthetic purposes (e.g., when displaying advertisements).

While these design rules are easy to follow, it is difficult to explicitly define them without quickly needing to include exceptions or encountering ambiguous cases. This makes the automation of document design difficult, as any system with a hardcoded set of production rules will either be overly simplistic and thus incapable of producing original layouts (causing a lack of diversity in the layout of synthesized data), or too complex, with a large set of rules and their accompanying exceptions. In an attempt to solve this challenge, some have proposed machine learning (ML) techniques to synthesize document layouts. However, most ML-based solutions for automatic document design do not scale to a large number of layout components, or they rely on additional information for training, such as the relationships between the different components of a document.

In “Variational Transformer Networks for Layout Generation”, to be presented at CVPR 2021, we create a document layout generation system that scales to an arbitrarily large number of elements and does not require any additional information to capture the relationships between design elements. We use self-attention layers as building blocks of a variational autoencoder (VAE), which is able to model document layout design rules as a distribution, rather than using a set of predetermined heuristics, increasing the diversity of the generated layouts. The resulting Variational Transformer Network (VTN) model is able to extract meaningful relationships between the layout elements (paragraphs, tables, images, etc.), resulting in realistic synthetic documents (e.g., better alignment and margins). We show the effectiveness of this combination across different domains, such as scientific papers, UI layouts, and even furniture arrangements.

VAEs for Layout Generation
The ultimate goal of this system is to infer the design rules for a given type of layout from a collection of examples. If one considers these design rules as the distribution underlying the data, it is possible to use probabilistic models to discover it. We propose doing this with a VAE (widely used for tasks like image generation or anomaly detection), an autoencoder architecture that consists of two distinct subparts, the encoder and decoder. The encoder learns to compress the input to fewer dimensions, retaining only the necessary information to reconstruct the input, while the decoder learns to undo this operation. The compressed representation (also called the bottleneck) can be forced to behave like a known distribution (e.g., a uniform Gaussian). Feeding samples from this a priori distribution to the decoder segment of the network results in outputs similar to the training data.

An additional advantage of the VAE formulation is that it is agnostic to the type of operations used to implement the encoder and decoder segments. As such, we use self-attention layers (typically seen in Transformer architectures) to automatically capture the influence that each layout element has over the rest.

Transformers use self-attention layers to model long, sequenced relationships, often applied to an array of natural language understanding tasks, such as translation and summarization, as well as beyond the language domain in object detection or document layout understanding tasks. The self-attention operation relates every element in a sequence to every other and determines how they influence each other. This property is ideal to model relationships across different elements in a layout without the need for explicit annotations.

In order to synthesize new samples from these relationships, some approaches for layout generation [e.g., 1] and even for other domains [e.g., 2, 3] rely on greedy search algorithms, such as beam search, nucleus sampling or top-k sampling. Since these strategies are often based on exploration rules that tend to favor the most likely outcome at every step, the diversity of the generated samples is not guaranteed. However, by combining self-attention with the VAE’s probabilistic techniques, the model is able to directly learn a distribution from which it can extract new elements.

Modeling the Variational Bottleneck
The bottleneck of a VAE is commonly modeled as a vector representing the input. Since self-attention layers are a sequence-to-sequence architecture, i.e., a sequence of n input elements is mapped onto n output elements, the standard VAE formulation is difficult to apply. Inspired by BERT, we append an auxiliary token to the beginning of the sequence and treat it as the autoencoder bottleneck vector z. During training, the vector associated with this token is the only piece of information passed to the decoder, so the encoder needs to learn how to compress the entire document information in this vector. The decoder then learns to infer the number of elements in the document as well as the locations of each element in the input sequence from this vector alone. This strategy allows us to use standard techniques to regularize the bottleneck, such as the KL divergence.

Decoding
In order to synthesize documents with varying numbers of elements, the network needs to model sequences of arbitrary length, which is not trivial. While self-attention enables the encoder to adapt automatically to any number of elements, the decoder segment does not know the number of elements in advance. We overcome this issue by decoding sequences in an autoregressive way — at every step, the decoder produces an element, which is concatenated to the previously decoded elements (starting with the bottleneck vector z as input), until a special stop element is produced.

A visualization of our proposed architecture

Turning Layouts into Input Data
A document is often composed of several design elements, such as paragraphs, tables, images, titles, footnotes, etc. In terms of design, layout elements are often represented by the coordinates of their enclosing bounding boxes. To make this information easily digestible for a neural network, we define each element with four variables (x, y, width, height), representing the element’s location on the page (x, y) and size (width, height).

Results
We evaluate the performance of the VTN following two criteria: layout quality and layout diversity. We train the model on publicly available document datasets, such as PubLayNet, a collection of scientific papers with layout annotations, and evaluate the quality of generated layouts by quantifying the amount of overlap and alignment between elements. We measure how well the synthetic layouts resemble the training distribution using the Wasserstein distance over the distributions of element classes (e.g., paragraphs, images, etc.) and bounding boxes. In order to capture the layout diversity, we find the most similar real sample for each generated document using the DocSim metric, where a higher number of unique matches to the real data indicates a more diverse outcome.

We compare the VTN approach to previous works like LayoutVAE and Gupta et al. The former is a VAE-based formulation with an LSTM backbone, whereas Gupta et al. use a self-attention mechanism similar to ours, combined with standard search strategies (beam search). The results below show that LayoutVAE struggles to comply with design rules, like strict alignments, as in the case of PubLayNet. Thanks to the self-attention operation, Gupta et al. can model these constraints much more effectively, but the usage of beam search affects the diversity of the results.

IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Matches ↑
LayoutVAE   0.171 0.321 0.472 0.045 241
Gupta et al.   0.039 0.006 0.361 0.018 0.012 546
VTN 0.031 0.017 0.347 0.022 0.012 697
Real Data   0.048 0.007 0.353
Results on PubLayNet. Down arrows (↓) indicate that a lower score is better, whereas up arrows (↑) indicate higher is better.

We also explore the ability of our approach to learn design rules in other domains, such as Android UIs (RICO), natural scenes (COCO) and indoor scenes (SUN RGB-D). Our method effectively learns the design rules of these datasets and produces synthetic layouts of similar quality as the current state of the art and a higher degree of diversity.

IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Matches ↑
LayoutVAE   0.193 0.400 0.416 0.045 496
Gupta et al.   0.086 0.145 0.366 0.004 0.023 604
VTN 0.115 0.165 0.373 0.007 0.018 680
Real Data   0.084 0.175 0.410
Results on RICO. Down arrows (↓) indicate that a lower score is better, whereas up arrows (↑) indicate higher is better.
IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Matches ↑
LayoutVAE   0.325 2.819 0.246 0.062 700
Gupta et al.   0.194 1.709 0.334 0.001 0.016 601
VTN 0.197 2.384 0.330 0.0005 0.013 776
Real Data   0.192 1.724 0.347
Results for COCO. Down arrows (↓) indicate that a lower score is better, whereas up arrows (↑) indicate higher is better.

Below are some examples of layouts produced by our method compared to existing methods. The design rules learned by the network (location, margins, alignment) resemble those of the original data and show a high degree of variability.

LayoutVAE  
Gupta et al.  
VTN
Qualitative results of our method on PubLayNet compared to existing state-of-the-art methods.

Conclusion
In this work we show the feasibility of using self-attention as part of the VAE formulation. We validate the effectiveness of this approach for layout generation, achieving state-of-the-art performance on various datasets and across different tasks. Our research paper also explores alternative architectures for the integration of self-attention and VAEs, exploring non-autoregressive decoding strategies and different types of priors, and analyzes advantages and disadvantages. The layouts produced by our method can help to create synthetic training data for downstream tasks, such as document parsing or automating graphic design tasks. We hope that this work provides a foundation for continued research in this area, as many subproblems are still not completely solved, such as how to suggest styles for the elements in the layout (text font, which image to choose, etc.) or how to reduce the amount of training data necessary for the model to generalize.

AcknowledgementsWe thank our co-author Janis Postels, as well as Alessio Tonioni and Luca Prasso for helping with the design of several of our experiments. We also thank Tom Small for his help creating the animations for this post.

Categories
Misc

NVIDIA RTX GPUs and KeyShot Accelerate Rendering for Caustics

NVIDIA RTX ray tracing has transformed graphics and rendering. With powerful software applications like Luxion KeyShot, more users can take advantage of RTX technology to speed up graphic workflows — like rendering caustics.

NVIDIA RTX ray tracing has transformed graphics and rendering. With powerful software applications like Luxion KeyShot, more users can take advantage of RTX technology to speed up graphic workflows — like rendering caustics.

Caustics are formed as light refracts through or reflects off specular surfaces. Examples of caustics include the light focusing through a glass, the shimmering light at the bottom of a swimming pool, and even the beams of light from windows into a dusty environment.

When it comes to rendering caustics, photon mapping is an important factor. Photon mapping works by tracing photons from the light sources into the scene and storing these photons as they interact with the surfaces or volumes in the scene.

KeyShot implements a full progressive photon-mapping algorithm on the GPU that’s capable of rendering caustics, including reflections of caustics.

With the combination of RTX technology and CUDA programming framework, it is now possible for users to achieve features like ray tracing, photon mapping, and shading running on the GPU. The power of the RTX GPUs accelerates full rendering of caustics, resulting in detailed, interactive images with reflections and refractions.

Image courtesy of David Merz III at Vyzdom.

When developing KeyShot 10, the team analyzed the GPU implementation and decided to see if they could improve the photon map implementation and obtain faster caustics. The result is a caustics algorithm that is able to handle thousands of lights, quickly render highly detailed caustics up close, and run significantly faster on the new NVIDIA Ampere RTX GPUs.

“Using the new caustics algorithm in KeyShot 10, we started getting details and fine structures in the caustics that normally would not be seen due to the time it takes to get to this detail level”, said Dr. Henrik Wann Jensen, chief scientist at Luxion.

All images are rendered on the GPU and they can all be manipulated interactively in KeyShot 10 and the caustics updates with any changes made to the scene.

Image courtesy of David Merz III at Vyzdom.

“The additional memory in NVIDIA RTX A6000 is exactly what I need for working with geometry nodes in KeyShot, and I even tested how high I could push the VRAM with the ice cube scene,” said David Merz, founder and chief creative at Vyzdom. “I’m hooked on the GDDR6 memory, and I could definitely get used to this 48GB ceiling. Not needing to worry about VRAM limitations allows me to get into a carefree creative rhythm, turning a process from frustrating to fluid artistic execution, which benefits both the client and me”.

With NVIDIA RTX GPUs powering caustics algorithms in KeyShot 10, users working with transparent or reflective products can easily render high-quality images with photorealistic details.

Read the KeyShot blog to learn more about NVIDIA RTX and caustics rendering.

Categories
Misc

How I used the OpenCV AI Kit to Control a Drone

How I used the OpenCV AI Kit to Control a Drone submitted by /u/AugmentedStartups
[visit reddit] [comments]
Categories
Misc

A Theoretical and Practical Guide to Probabilistic Graphical Models with Tensorflow

A Theoretical and Practical Guide to Probabilistic Graphical Models with Tensorflow submitted by /u/OB_two
[visit reddit] [comments]
Categories
Misc

What Is Synthetic Data?

Data is the new oil in today’s age of AI, but only a lucky few are sitting on a gusher. So, many are making their own fuel, one that’s both inexpensive and effective. It’s called synthetic data. What Is Synthetic Data? Synthetic data is annotated information that computer simulations or algorithms generate as an alternative Read article >

The post What Is Synthetic Data? appeared first on The Official NVIDIA Blog.

Categories
Misc

Oh, Canada: NuPort Brings Autonomous Trucking to Toronto Roads with NVIDIA DRIVE

Autonomous truck technology is making its way across the Great White North. Self-driving trucking startup NuPort Robotics is leveraging NVIDIA DRIVE to develop autonomous driving systems for middle-mile short-haul routes. The Canada-based company is working with the Ontario government as well as Canadian Tire on a two-year pilot project to accelerate the commercial deployment of Read article >

The post Oh, Canada: NuPort Brings Autonomous Trucking to Toronto Roads with NVIDIA DRIVE appeared first on The Official NVIDIA Blog.

Categories
Misc

Data Science & Data Analytics eBooks bundle by Mercury

submitted by /u/reps_up
[visit reddit] [comments]

Categories
Misc

tf2/keras: Custom RoI pool layer is not being evaluated correctly on each step

Trying to write a custom Keras layer to perform RoI pooling quickly. I have an implementation that relies on repeated tf.map_fn() calls but it is painfully slow. I’ve seen some that use ordinary Python for-loops and I thought I’d try my own. Executed on its own using a model that consists only of this custom layer, it works just fine. However when used in a training loop (where I repeatedly call Model.predict_on_batch and Model.train_on_batch), it produces bizarre results.

It’s quite difficult to figure out what exactly is going on because reading the layer output is non-trivial and I suspect is giving me a result different than what Keras sees during training.

So I’ve inserted a print statement and notice that during training, it will produce numerical tensors on some steps, e.g.:

tf.Tensor( [8.87275487e-02 0.00000000e+00 0.00000000e+00 0.00000000e+00 6.44880116e-01 0.00000000e+00 2.37839603e+00 0.00000000e+00 0.00000000e+00 2.50582743e+00 0.00000000e+00 0.00000000e+00 4.21218348e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 4.73125458e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 9.98033524e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00 4.39077109e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00 1.72268832e+00 0.00000000e+00 1.20860779e+00 0.00000000e+00 0.00000000e+00 2.05427575e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 2.32518530e+00 0.00000000e+00 8.84961128e-01 0.00000000e+00 0.00000000e+00 1.05681539e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 3.33451724e+00 1.71899879e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00 9.97039509e+00 

But on most steps I see this:

Tensor("model_3/roi_pool/map/while/Max:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_1:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_2:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_3:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_4:0", shape=(512,), dtype=float32) Tensor("model_3/roi_pool/map/while/Max_5:0", shape=(512,), dtype=float32) 

I believe these are tensor objects being used to construct a compute graph for deferred execution? I’m not sure why it is choosing to do this on most steps but not all.

This is causing training to fail to progress because it behaves similarly to returning a tensor full of zeros.

My layer code:

“` class RoIPoolingLayer(Layer): “”” Input shape: Two tensors [xmaps, x_rois] each with shape: x_maps: (samples, height, width, channels), representing the feature maps for this batch, of type tf.float32 x_rois: (samples, num_rois, 4), where RoIs have the ordering (y, x, height, width), all tf.int32 Output shape: (samples, num_rois, pool_size, pool_size, channels) “”” def __init(self, pool_size, **kwargs): self.pool_size = pool_size super().init_(**kwargs)

def get_config(self): config = { “pool_size”: self.pool_size, } base_config = super(RoIPoolingLayer, self).get_config() return dict(list(base_config.items()) + list(config.items()))

def compute_output_shape(self, input_shape): map_shape, rois_shape = input_shape assert len(map_shape) == 4 and len(rois_shape) == 3 and rois_shape[2] == 4 assert map_shape[0] == rois_shape[0] # same number of samples num_samples = map_shape[0] num_channels = map_shape[3] num_rois = rois_shape[1] return (num_samples, num_rois, self.pool_size, self.pool_size, num_channels)

def call(self, inputs): return tf.map_fn( fn = lambda input_pair: RoIPoolingLayer._compute_pooled_rois(feature_map = input_pair[0], rois = input_pair[1], pool_size = self.pool_size), elems = inputs, fn_output_signature = tf.float32 # this is absolutely required else the fn type inference seems to fail spectacularly )

def _compute_pooled_rois(feature_map, rois, pool_size): num_channels = feature_map.shape[2] num_rois = rois.shape[0] pools = [] for roi_idx in range(num_rois): region_y = rois[roi_idx, 0] region_x = rois[roi_idx, 1] region_height = rois[roi_idx, 2] region_width = rois[roi_idx, 3] region_of_interest = tf.slice(feature_map, [region_y, region_x, 0], [region_height, region_width, num_channels]) x_step = tf.cast(region_width, dtype = tf.float32) / tf.cast(pool_size, dtype = tf.float32) y_step = tf.cast(region_height, dtype = tf.float32) / tf.cast(pool_size, dtype = tf.float32) for y in range(pool_size): for x in range(pool_size): pool_y_start = y pool_x_start = x

 pool_y_start_int = tf.cast(pool_y_start, dtype = tf.int32) pool_x_start_int = tf.cast(pool_x_start, dtype = tf.int32) y_start = tf.cast(pool_y_start * y_step, dtype = tf.int32) x_start = tf.cast(pool_x_start * x_step, dtype = tf.int32) y_end = tf.cond((pool_y_start_int + 1) < pool_size, lambda: tf.cast((pool_y_start + 1) * y_step, dtype = tf.int32), lambda: region_height ) x_end = tf.cond((pool_x_start_int + 1) < pool_size, lambda: tf.cast((pool_x_start + 1) * x_step, dtype = tf.int32), lambda: region_width ) y_size = tf.math.maximum(y_end - y_start, 1) # if RoI is smaller than pool area, y_end - y_start can be less than 1 (0); we want to sample at least one cell x_size = tf.math.maximum(x_end - x_start, 1) pool_cell = tf.slice(region_of_interest, [y_start, x_start, 0], [y_size, x_size, num_channels]) pooled = tf.math.reduce_max(pool_cell, axis=(1,0)) # keep channels independent print(pooled) pools.append(pooled) return tf.reshape(tf.stack(pools, axis = 0), shape = (num_rois, pool_size, pool_size, num_channels)) 

“`

Note the print statement in the loop.

Strangely, if I build a simple test model consisting of only this layer (i.e., in my unit test), I can verify that it does work:

input_map = Input(shape = (9,8,num_channels)) # input map size input_rois = Input(shape = (num_rois,4), dtype = tf.int32) # N RoIs, each of length 4 (y,x,h,w) output = RoIPoolingLayer(pool_size = pool_size)([input_map, input_rois]) model = Model([input_map, input_rois], output)

I can then call model.predict() on some sample input and I get valid output.

But in a training loop, where I perform a prediction followed by a training step on the trainable layers, it’s not clear what it is doing. My reference implementation works fine (it does not use for-loops).

How can I debug this further?

Thank you 🙂

submitted by /u/BartTrzy
[visit reddit] [comments]

Categories
Misc

AI Nails It: Viral Manicure Robot Powered by GPU-Accelerated Computer Vision

San Francisco startup Clockwork recently launched a pop-up location offering the first robot nail painting service in the form of 10-minute “minicures.”

San Francisco startup Clockwork recently launched a pop-up location offering the first robot nail painting service in the form of 10-minute “minicures.” 

The robot uses structured-light 3D scanners to detect the shape of a customer’s fingernails — with AI models directing a plastic-tipped cartridge that paints nails one at a time. Appointments don’t yet include other manicure services such as nail trimming, buffing and shaping. 

After taking photos of a client’s fingernail, Clockwork uses CUDA to accelerate the 3D point cloud reconstruction from 20 seconds to under a second. The images from both cameras are stitched together and handed over to the AI model, which identifies edges with 0.3 mm accuracy. The process repeats for each fingernail.

An NVIDIA GPU on the robot is used for real-time inference during the minicure, while the AI models are trained on NVIDIA Tensor Core GPUs on Google Cloud.   

The company, which is charging $8 per nail painting appointment, plans to make its devices available in offices, retail stores and apartment buildings. After going viral on TikTok, appointments at its San Francisco store are booked up for weeks. 

Read the full article in The New York Times >>