Categories
Misc

Variational Autoencoder – ValueError: No gradients provided for any variable (TensorFlow2.6)

I am implementing a toy Variational Autoencoder in TensorFlow 2.6, Python 3.9 for MNIST dataset. The code is:

 # Specify latent space- latent_dim = 3 class Sampling(layers.Layer): ''' Create a sampling layer. Uses (z_mean, z_log_var) to sample z - the vector encoding a digit. ''' def call(self, inputs): z_mean, z_log_var = inputs batch = tf.shape(z_mean)[0] dim = tf.shape(z_mean)[1] epsilon = tf.keras.backend.random_normal(shape = (batch, dim)) return z_mean + tf.exp(0.5 * z_log_var) * epsilon class Encoder(Model): def __init__(self): super(Encoder, self).__init__() self.conv1 = Conv2D( filters = 32, kernel_size = (3, 3), activation = None , strides = 2, padding="same") self.conv2 = Conv2D( filters = 64, kernel_size = (3, 3), activation = "relu", strides = 2, padding = "same") self.flatten = Flatten() self.dense = Dense( units = 16, activation = None ) def call(self, x): x = tf.keras.activations.relu(self.conv1(x)) x = tf.keras.activations.relu(self.conv2(x)) x = self.flatten(x) x = tf.keras.activations.relu(self.dense(x)) return x class Decoder(Model): def __init__(self): super(Decoder, self).__init__() self.dense = Dense( units = 7 * 7 * 64, activation = None) self.conv_tran_1 = Conv2DTranspose( filters = 64, kernel_size = (3, 3), activation = None, strides = 2, padding = "same") self.conv_tran_2 = Conv2DTranspose( filters = 32, kernel_size = (3, 3), activation = None, strides = 2, padding = "same") self.decoder_outputs = Conv2DTranspose( filters = 1, kernel_size = (3, 3), activation = None, padding = "same") def call(self, x): x = tf.keras.activations.relu(self.dense(x)) x = layers.Reshape((7, 7, 64))(x) x = tf.keras.activations.relu(self.conv_tran_1(x)) x = tf.keras.activations.relu(self.conv_tran_2(x)) x = self.decoder_outputs(x) return x class VAE(Model): def __init__(self, latent_space = 3): super(VAE, self).__init__() self.latent_space = latent_space self.encoder = Encoder() self.z_mean = Dense(units = self.latent_space, activation = None) self.z_log_var = Dense(units = self.latent_space, activation = None) self.decoder = Decoder() def reparameterize(self, encoded_mean, encoded_log_var): # NOT USED! # encoded_mean = self.z_mean(x) # encoded_log_var = self.z_log_var(x) batch = tf.shape(encoded_mean)[0] encoded_dim = tf.shape(encoded_mean)[1] epsilon = tf.keras.backend.random_normal(shape = (batch, encoded_dim)) return encoded_mean + tf.exp(0.5 * encoded_log_var) * epsilon def call(self, x): x = self.encoder(x) mu = self.z_mean(x) log_var = self.z_log_var(x) # z = self.reparameterize(mu, log_var) z = Sampling()([mu, log_var]) """ print(f"encoded_x.shape: {x.shape}, mu.shape: {mu.shape}," f" log_var.shape: {log_var.shape} & z.shape: {z.shape}") """ # encoded_x.shape: (batch_size, 16), mu.shape: (6, 3), log_var.shape: (6, 3) & z.shape: (6, 3) x = tf.keras.activations.sigmoid(self.decoder(z)) return x, mu, log_var # Initialize a VAE architecture- model = VAE(latent_space = 3) X = X_train[:6, :] # Sanity check- recon_output, mu, log_var = model(X) X.shape, recon_output.shape # ((6, 28, 28, 1), TensorShape([6, 28, 28, 1])) mu.shape, log_var.shape # (TensorShape([6, 3]), TensorShape([6, 3])) # Define optimizer- optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001) # Either of the two can be used- # recon_loss = tf.reduce_mean(tf.reduce_sum(tf.keras.losses.binary_crossentropy(X, recon_output), axis = (1, 2))) recon_loss = tf.reduce_mean(tf.reduce_sum(tf.keras.losses.mean_squared_error(X, recon_output), axis = (1, 2))) recon_loss.numpy() # 180.46837 # Implement training step using tf.GradientTape API- with tf.GradientTape() as tape: # z_mean, z_log_var, z = self.encoder(data) # reconstruction = self.decoder(z) reconstruction_loss = tf.reduce_mean( tf.reduce_sum( tf.keras.losses.mean_squared_error(X, recon_output), axis=(1, 2) ) ) kl_loss = -0.5 * (1 + log_var - tf.square(mu) - tf.exp(log_var)) kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis = 1)) total_loss = reconstruction_loss + kl_loss kl_loss.numpy(), reconstruction_loss.numpy(), total_loss.numpy() # (0.005274256, 180.46837, 180.47365) # Compute gradients wrt cost- grads = tape.gradient(total_loss, model.trainable_weights) type(grads), len(grads) # (list, 18) # Apply gradient descent using defined optimizer- optimizer.apply_gradients(zip(grads, model.trainable_weights)) 

This (optimizer.apply_gradients()) gives me the error-

————————————————————————— ValueError Traceback (most recent call

last) ~AppDataLocalTemp/ipykernel_6484/111942921.py in <module>

—-> 1 optimizer.apply_gradients(zip(grads, model.trainable_weights))

~anaconda3envstf-cpulibsite-packagestensorflowpythonkerasoptimizer_v2optimizer_v2.py

in apply_gradients(self, grads_and_vars, name,

experimental_aggregate_gradients)

639 RuntimeError: If called in a cross-replica context.

640 “””

–> 641 grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)

642 var_list = [v for (_, v) in grads_and_vars]

643

~anaconda3envstf-cpulibsite-packagestensorflowpythonkerasoptimizer_v2utils.py

in filter_empty_gradients(grads_and_vars)

73

74 if not filtered:

—> 75 raise ValueError(“No gradients provided for any variable: %s.” %

76 ([v.name for _, v in grads_and_vars],))

77 if vars_with_empty_grads:

ValueError: No gradients provided for any variable:

[‘vae_2/encoder_2/conv2d_4/kernel:0’,

‘vae_2/encoder_2/conv2d_4/bias:0’,

‘vae_2/encoder_2/conv2d_5/kernel:0’,

‘vae_2/encoder_2/conv2d_5/bias:0’, ‘vae_2/encoder_2/dense_8/kernel:0’,

‘vae_2/encoder_2/dense_8/bias:0’, ‘vae_2/dense_9/kernel:0’,

‘vae_2/dense_9/bias:0’, ‘vae_2/dense_10/kernel:0’,

‘vae_2/dense_10/bias:0’, ‘vae_2/decoder_2/dense_11/kernel:0’,

‘vae_2/decoder_2/dense_11/bias:0’,

‘vae_2/decoder_2/conv2d_transpose_6/kernel:0’,

‘vae_2/decoder_2/conv2d_transpose_6/bias:0’,

‘vae_2/decoder_2/conv2d_transpose_7/kernel:0’,

‘vae_2/decoder_2/conv2d_transpose_7/bias:0’,

‘vae_2/decoder_2/conv2d_transpose_8/kernel:0’,

‘vae_2/decoder_2/conv2d_transpose_8/bias:0’].

How can I fix this?

submitted by /u/grid_world
[visit reddit] [comments]

Categories
Misc

Finding memory required for neural network to load on embedded device?

I was interested as to how I could determine how much memory my saved neural network model requires. The reason I’m asking is that I’d like to test on an embedded device, and I’d like to see how much memory my current model takes first, and then I’d like to see how much memory my downsampled model requires next, and compare the performance reductions. Also, I have svm performing the same classification task, so I’m simply trying to figure out which is best for embedded devices.

submitted by /u/Mother-Beyond9493
[visit reddit] [comments]

Categories
Misc

Best way to map Text and Image while loading the data

Best way to map Text and Image while loading the data

I have a csv file which looks somewhat like the photo in the post.

I’m building a model that takes both image and its corresponding text (df[‘Content’]) as input .

I wanted to know the best way to load this data in the following way:

  • Loading the images from df[‘Image_location’] into a tensor.
  • And preserving the order of the image to the corresponding text.

Any ideas on how this can be done?

https://preview.redd.it/q4km9ape2rg81.png?width=700&format=png&auto=webp&s=66be88e9b8d5ebbd7d9357cf64bbe1ea7866b098

submitted by /u/xanthan_011
[visit reddit] [comments]

Categories
Offsites

Robot See, Robot Do

People learn to do things by watching others — from mimicking new dance moves, to watching YouTube cooking videos. We’d like robots to do the same, i.e., to learn new skills by watching people do things during training. Today, however, the predominant paradigm for teaching robots is to remote control them using specialized hardware for teleoperation and then train them to imitate pre-recorded demonstrations. This limits both who can provide the demonstrations (programmers & roboticists) and where they can be provided (lab settings). If robots could instead self-learn new tasks by watching humans, this capability could allow them to be deployed in more unstructured settings like the home, and make it dramatically easier for anyone to teach or communicate with them, expert or otherwise. Perhaps one day, they might even be able to use Youtube videos to grow their collection of skills over time.

Our motivation is to have robots watch people do tasks, naturally with their hands, and then use that data as demonstrations for learning. Video by Teh Aik Hui and Nathaniel Lim. License: CC-BY

However, an obvious but often overlooked problem is that a robot is physically different from a human, which means it often completes tasks differently than we do. For example, in the pen manipulation task below, the hand can grab all the pens together and quickly transfer them between containers, whereas the two-fingered gripper must transport one at a time. Prior research assumes that humans and robots can do the same task similarly, which makes manually specifying one-to-one correspondences between human and robot actions easy. But with stark differences in physique, defining such correspondences for seemingly easy tasks can be surprisingly difficult and sometimes impossible.

Physically different end-effectors (i.e., “grippers”) (i.e., the part that interacts with the environment) induce different control strategies when solving the same task. Left: The hand grabs all pens and quickly transfers them between containers. Right: The two-fingered gripper transports one pen at a time.

In “XIRL: Cross-Embodiment Inverse RL”, presented as an oral paper at CoRL 2021, we explore these challenges further and introduce a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL). Rather than focusing on how individual human actions should correspond to robot actions, XIRL learns the high-level task objective from videos, and summarizes that knowledge in the form of a reward function that is invariant to embodiment differences, such as shape, actions and end-effector dynamics. The learned rewards can then be used together with reinforcement learning to teach the task to agents with new physical embodiments through trial and error. Our approach is general and scales autonomously with data — the more embodiment diversity presented in the videos, the more invariant and robust the reward functions become. Experiments show that our learned reward functions lead to significantly more sample efficient (roughly 2 to 4 times) reinforcement learning on new embodiments compared to alternative methods. To extend and build on our work, we are releasing an accompanying open-source implementation of our method along with X-MAGICAL, our new simulated benchmark for cross-embodiment imitation.

Cross-Embodiment Inverse Reinforcement Learning (XIRL)
The underlying observation in this work is that in spite of the many differences induced by different embodiments, there still exist visual cues that reflect progression towards a common task objective. For example, in the pen manipulation task above, the presence of pens in the cup but not the mug, or the absence of pens on the table, are key frames that are common to different embodiments and indirectly provide cues for how close to being complete a task is. The key idea behind XIRL is to automatically discover these key moments in videos of different length and cluster them meaningfully to encode task progression. This motivation shares many similarities with unsupervised video alignment research, from which we can leverage a method called Temporal Cycle Consistency (TCC), which aligns videos accurately while learning useful visual representations for fine-grained video understanding without requiring any ground-truth correspondences.

We leverage TCC to train an encoder to temporally align video demonstrations of different experts performing the same task. The TCC loss tries to maximize the number of cycle-consistent frames (or mutual nearest-neighbors) between pairs of sequences using a differentiable formulation of soft nearest-neighbors. Once the encoder is trained, we define our reward function as simply the negative Euclidean distance between the current observation and the goal observation in the learned embedding space. We can subsequently insert the reward into a standard MDP and use an RL algorithm to learn the demonstrated behavior. Surprisingly, we find that this simple reward formulation is effective for cross-embodiment imitation.

XIRL self-supervises reward functions from expert demonstrations using temporal cycle consistency (TCC), then uses them for downstream reinforcement learning to learn new skills from third-person demonstrations.

X-MAGICAL Benchmark
To evaluate the performance of XIRL and baseline alternatives (e.g., TCN, LIFS, Goal Classifier) in a consistent environment, we created X-MAGICAL, which is a simulated benchmark for cross-embodiment imitation. X-MAGICAL features a diverse set of agent embodiments, with differences in their shapes and end-effectors, designed to solve tasks in different ways. This leads to differences in execution speeds and state-action trajectories, which poses challenges for current imitation learning techniques, e.g., ones that use time as a heuristic for weak correspondences between two trajectories. The ability to generalize across embodiments is precisely what X-MAGICAL evaluates.

The SweepToTop task we considered for our experiments is a simplified 2D equivalent of a common household robotic sweeping task, where an agent has to push three objects into a goal zone in the environment. We chose this task specifically because its long-horizon nature highlights how different agent embodiments can generate entirely different trajectories (shown below). X-MAGICAL features a Gym API and is designed to be easily extendable to new tasks and embodiments. You can try it out today with pip install x-magical.

Different agent shapes in the SweepToTop task in the X-MAGICAL benchmark need to use different strategies to reposition objects into the target area (pink), i.e., to “clear the debris”. For example, the long-stick can clear them all in one fell swoop, whereas the short-stick needs to do multiple consecutive back-and-forths.
Left: Heatmap of state visitation for each embodiment across all expert demonstrations. Right: Examples of expert trajectories for each embodiment.

Highlights
In our first set of experiments, we checked whether our learned embodiment-invariant reward function can enable successful reinforcement learning, when the expert demonstrations are provided through the agent itself. We find that XIRL significantly outperforms alternative methods especially on the tougher agents (e.g., short-stick and gripper).

Same-embodiment setting: Comparison of XIRL with baseline reward functions, using SAC for RL policy learning. XIRL is roughly 2 to 4 times more sample efficient than some of the baselines on the harder agents (short-stick and gripper).

We also find that our approach shows great potential for learning reward functions that generalize to novel embodiments. For instance, when reward learning is performed on embodiments that are different from the ones on which the policy is trained, we find that it results in significantly more sample efficient agents compared to the same baselines. Below, in the gripper subplot (bottom right) for example, the reward is first learned on demonstration videos from long-stick, medium-stick and short-stick, after which the reward function is used to train the gripper agent.

Cross-embodiment setting: XIRL performs favorably when compared with other baseline reward functions, trained on observation-only demonstrations from different embodiments. Each agent (long-stick, medium-stick, short-stick, gripper) had its reward trained using demonstrations from the other three embodiments.

We also find that we can train on real-world human demonstrations, and use the learned reward to train a Sawyer arm in simulation to push a puck to a designated target zone. In these experiments as well, our method outperforms baseline alternatives. For example, our XIRL variant trained only on the real-world demonstrations (purple in the plots below) reaches 80% of the total performance roughly 85% faster than the RLV baseline (orange).

What Do The Learned Reward Functions Look Like?
To further explore the qualitative nature of our learned rewards in more challenging real-world scenarios, we collect a dataset of the pen transfer task using various household tools.

Below, we show rewards extracted from a successful (top) and unsuccessful (bottom) demonstration. Both demonstrations follow a similar trajectory at the start of the task execution. The successful one nets a high reward for placing the pens consecutively into the mug then into the glass cup, while the unsuccessful one obtains a low reward because it drops the pens outside the glass cup towards the end of the execution (orange circle). These results are promising because they show that our learned encoder can represent fine-grained visual differences relevant to a task.

Conclusion
We highlighted XIRL, our approach to tackling the cross-embodiment imitation problem. XIRL learns an embodiment-invariant reward function that encodes task progress using a temporal cycle-consistency objective. Policies learned using our reward functions are significantly more sample-efficient than baseline alternatives. Furthermore, the reward functions do not require manually paired video frames between the demonstrator and the learner, giving them the ability to scale to an arbitrary number of embodiments or experts with varying skill levels. Overall, we are excited about this direction of work, and hope that our benchmark promotes further research in this area. For more details, please check out our paper and download the code from our GitHub repository.

Acknowledgments
Kevin and Andy summarized research performed together with Pete Florence, Jonathan Tompson, Jeannette Bohg (faculty at Stanford University) and Debidatta Dwibedi. All authors would additionally like to thank Alex Nichol, Nick Hynes, Sean Kirmani, Brent Yi, Jimmy Wu, Karl Schmeckpeper and Minttu Alakuijala for fruitful technical discussions, and Sam Toyer for invaluable help with setting up the simulated benchmark.

Categories
Misc

layers.Conv2D( ) -> Running this function kills the Python Kernel

layers.Conv2D( ) -> Running this function kills the Python Kernel
import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.datasets import cifar10 tf.__version__ # outputs -> '2.8.0' # Normalizing the data (x_train, y_train), (x_test, y_test) = cifar10.load_data() x_train = x_train.astype("float32") / 255.0 x_test = x_test.astype("float32") / 255.0 # defining Model using Functional api def my_model(): inputs = keras.Input(shape=(32,32,3)) x = layers.Conv2D(32,3)(inputs) x = layers.BatchNormalization()(x) x = keras.activations.relu(x) x = layers.MaxPooling2D()(x) x = layers.Conv2D(64,5,padding="same")(x) x = layers.BatchNormalization()(x) x = keras.activations.relu(x) x = layers.Conv2D(128,3)(x) x = layers.BatchNormalization()(x) x = keras.activations.relu(x) x = layers.Flatten()(x) x = layers.Dense(64, activation='relu')(x) outputs = layers.Dense(10)(x) model = keras.Model(inputs= inputs, outputs=outputs) return model # Building the model model = my_model() # Compiling the model model.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits = True), optimizer = keras.optimizers.Adam(learning_rate=3e-4), metrics=['accuracy'], ) # running the model model.fit(x_train, y_train, batch_size= 64, epochs=10, verbose =2) # testing the model model.evaluate(x_test, y_test, batch_size = 1, verbose =2) 

I tried this above code on both Jupyter notebook & VSCODE.

On both occasions, its killing the python kernel. Below is the error message screen shot from VS code.

Error message from VS Code

when i run a simple MLP & also deep MLP on MNIST digit dataset. it works fine even when i had more than 10 million parameters. So I am guessing its definitely not the VRAM because for the above CNN model parameters from model.summary( ) = ~400K.

This problem occurs only when i use Conv2D function.

Using tensor flow 2.8.0

CPU: i7 – 7920hq

GPU: Quadro P4000 8 GB

RAM: 64GB.

OS: Win 11 pro

Using latest nvidia driver, cuda tool kit, cudnn.

Installed tensorlow using the tutorial below.

https://github.com/jeffheaton/t81_558_deep_learning/blob/master/install/manual_setup2.ipynb

(found from a youtube video) https://www.youtube.com/watch?v=OEFKlRSd8Ic

I would appreciate if someone can help me fix this issue.

submitted by /u/Hello_World_2K22
[visit reddit] [comments]

Categories
Offsites

Unlocking the Full Potential of Datacenter ML Accelerators with Platform-Aware Neural Architecture Search

Continuing advances in the design and implementation of datacenter (DC) accelerators for machine learning (ML), such as TPUs and GPUs, have been critical for powering modern ML models and applications at scale. These improved accelerators exhibit peak performance (e.g., FLOPs) that is orders of magnitude better than traditional computing systems. However, there is a fast-widening gap between the available peak performance offered by state-of-the-art hardware and the actual achieved performance when ML models run on that hardware.

One approach to address this gap is to design hardware-specific ML models that optimize both performance (e.g., throughput and latency) and model quality. Recent applications of neural architecture search (NAS), an emerging paradigm to automate the design of ML model architectures, have employed a platform-aware multi-objective approach that includes a hardware performance objective. While this approach has yielded improved model performance in practice, the details of the underlying hardware architecture are opaque to the model. As a result, there is untapped potential to build full capability hardware-friendly ML model architectures, with hardware-specific optimizations, for powerful DC ML accelerators.

In “Searching for Fast Model Families on Datacenter Accelerators”, published at CVPR 2021, we advanced the state of the art of hardware-aware NAS by automatically adapting model architectures to the hardware on which they will be executed. The approach we propose finds optimized families of models for which additional hardware performance gains cannot be achieved without loss in model quality (called Pareto optimization). To accomplish this, we infuse a deep understanding of hardware architecture into the design of the NAS search space for discovery of both single models and model families. We provide quantitative analysis of the performance gap between hardware and traditional model architectures and demonstrate the advantages of using true hardware performance (i.e., throughput and latency), instead of the performance proxy (FLOPs), as the performance optimization objective. Leveraging this advanced hardware-aware NAS and building upon the EfficientNet architecture, we developed a family of models, called EfficientNetX, that demonstrate the effectiveness of this approach for Pareto-optimized ML models on TPUs and GPUs.

Platform-Aware NAS for DC ML Accelerators
To achieve high performance, ML models need to adapt to modern ML accelerators. Platform-aware NAS integrates knowledge of the hardware accelerator properties into all three pillars of NAS: (i) the search objectives; (ii) the search space; and (iii) the search algorithm (shown below). We focus on the new search space because it contains the building blocks needed to compose the models and is the key link between the ML model architectures and accelerator hardware architectures.

We construct TPU/GPU specialized search spaces with TPU/GPU-friendly operations to infuse hardware awareness into NAS. For example, a key adaptation is maximizing parallelism to ensure different hardware components inside the accelerators work together as efficiently as possible. This includes the matrix multiplication units (MXUs) in TPUs and the TensorCore in GPUs for matrix/tensor computation, as well as the vector processing units (VPUs) in TPUs and CUDA cores in GPUs for vector processing. Maximizing model arithmetic intensity (i.e., optimizing the parallelism between computation and operations on the high bandwidth memory) is also critical to achieve top performance. To tap into the full potential of the hardware, it is crucial for ML models to achieve high parallelism inside and across these hardware components.

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search space and search objectives.

Advanced platform-aware NAS has an optimized search space containing a set of complementary techniques to holistically improve parallelism for ML model execution on TPUs and GPUs:

  1. It uses specialized tensor reshaping techniques to maximize the parallelism in the MXUs / TensorCores.
  2. It dynamically selects different activation functions depending on matrix operation types to ensure overlapping of vector and matrix/tensor processing.
  3. It employs hybrid convolutions and a novel fusion strategy to strike a balance between total compute and arithmetic intensity to ensure that computation and memory access happens in parallel and to reduce the contention on VPUs / CUDA cores.
  4. With latency-aware compound scaling (LACS), which uses hardware performance instead of FLOPs as the performance objective to search for model depth, width and resolutions, we ensure parallelism at all levels for the entire model family on the Pareto-front.

EfficientNet-X: Platform-Aware NAS-Optimized Computer Vision Models for TPUs and GPUs
Using this approach to platform-aware NAS, we have designed EfficientNet-X, an optimized computer vision model family for TPUs and GPUs. This family builds upon the EfficientNet architecture, which itself was originally designed by traditional multi-objective NAS without true hardware-awareness as the baseline. The resulting EfficientNet-X model family achieves an average speedup of ~1.5x–2x over EfficientNet on TPUv3 and GPUv100, respectively, with comparable accuracy.

In addition to the improved speeds, EfficientNet-X has shed light on the non-proportionality between FLOPs and true performance. Many think FLOPs are a good ML performance proxy (i.e., FLOPs and performance are proportional), but they are not. While FLOPs are a good performance proxy for simple hardware such as scalar machines, they can exhibit a margin of error of up to 400% on advanced matrix/tensor machines. For example, because of its hardware-friendly model architecture, EfficientNet-X requires ~2x more FLOPs than EfficientNet, but is ~2x faster on TPUs and GPUs.

EfficientNet-X family achieves 1.5x–2x speedup on average over the state-of-the-art EfficientNet family, with comparable accuracy on TPUv3 and GPUv100.

Self-Driving ML Model Performance on New Accelerator Hardware Platforms
Platform-aware NAS exposes the inner workings of the hardware and leverages these properties when designing hardware-optimized ML models. In a sense, the “platform-awareness” of the model is a “gene” that preserves knowledge of how to optimize performance for a hardware family, even on new generations, without the need to redesign the models. For example, TPUv4i delivers up to 3x higher peak performance (FLOPS) than its predecessor TPUv2, but EfficientNet performance only improves by 30% when migrating from TPUv2 to TPUv4i. In comparison, EfficientNet-X retains its platform-aware properties even on new hardware and achieves a 2.6x speedup when migrating from TPUv2 to TPUv4i, utilizing almost all of the 3x peak performance gain expected when upgrading between the two generations.

Hardware peak performance ratio of TPUv2 to TPUv4i and the geometric mean speedup of EfficientNet-X and EfficientNet families, respectively, when migrating from TPUv2 to TPUv4i.

Conclusion and Future Work
We demonstrate how to improve the capabilities of platform-aware NAS for datacenter ML accelerators, especially TPUs and GPUs. Both platform-aware NAS and the EfficientNet-X model family have been deployed in production and materialize up to ~40% efficiency gains and significant quality improvements for various internal computer vision projects across Google. Additionally, because of its deep understanding of accelerator hardware architecture, platform-aware NAS was able to identify critical performance bottlenecks on TPUv2-v4i architectures and has enabled design enhancements to future TPUs with significant potential performance uplift. As next steps, we are working on expanding platform-aware NAS’s capabilities to the ML hardware and model design beyond computer vision.

Acknowledgements
Special thanks to our co-authors: Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le. We also thank many collaborators including Jeff Dean, David Patterson, Shengqi Zhu, Yun Ni, Gang Wu, Tao Chen, Xin Li, Yuan Qi, Amit Sabne, Shahab Kamali, and many others from the broad Google research and engineering teams who helped on the research and the subsequent broad production deployment of platform-aware NAS.

Categories
Misc

Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience

Eating into open hours and menus, a labor shortage has gobbled up fast-food services employees, but some restaurants are trying out a new staff member to bring back the drive-thru good times: AI. Toronto startup HuEx is in pilot tests with a conversational AI assistant for drive-thrus to help support service at several popular Canadian Read article >

The post Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience appeared first on The Official NVIDIA Blog.

Categories
Misc

Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop

Don’t be fooled by the candy canes, hot cocoa and CEO’s jolly demeanor. Santa’s workshop is the very model of a 21st-century enterprise: pioneering mass customization and perfecting a worldwide distribution system able to meet almost bottomless global demand.

The post Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop appeared first on The Official NVIDIA Blog.

Categories
Misc

Wrong output from tensorflow pose classification while test data is 97% accurate

Hi all, I’m running into a problem with a student project.

I’m trying to use the tensorflow pose estimation library to create a script that recognizes different human gestures (specifically, pointing up, pointing left, and pointing right) using Movenet.

I followed the tutorial [ https://www.tensorflow.org/lite/tutorials/pose_classification ] to train my neural network using 3000+ pictures of gestures sourced from fellow students. The testing section of the tutorial shows that the model has a 97% accuracy on the test data subselection.

The output of this tutorial gives a .tflite file, and links to the [ https://github.com/tensorflow/examples/tree/master/lite/examples/pose_estimation/raspberry_pi ] github as a tutorial on how to use this .tflite to classify new input.

However the classifications seem to be completely off, not one gesture seems to be recognized. Suspicious of this result, I tried inserting some of the old training videos as input. These also seem to be classified completely wrongly, which leads me to think there is something wrong with my execution of the code.

Has anyone run into a similar problem using the tensorflow pose classification before? Or does anyone have an idea on what I could be doing wrong? I followed all the steps in the tutorials multiple times and am getting a bit hopeles…

The code I use to run the pose classification from the github:

import pose_estimation
pose_estimation.run(
‘movenet_lightning’, # estimation_model: str,
‘keypoint’, # tracker_type: str, # Apparantly not needed when using singlepose
‘gesture_classifier_using_lighting’, # classification_model: str,
‘gesture_labels.txt’, # label_file: str,
‘Kaj7.mp4’, # camera_id: int, #right now set to be an example video used in training
600, # width: int,
600) # height: int

submitted by /u/newbroo
[visit reddit] [comments]

Categories
Misc

NVIDIA and SoftBank Group Announce Termination of NVIDIA’s Acquisition of Arm Limited

NVIDIA and SoftBank Group Corp. (SBG) today announced the termination of the previously announced transaction whereby NVIDIA would acquire Arm Limited from SBG.