Categories
Offsites

Making Better Future Predictions by Watching Unlabeled Videos

Machine learning (ML) agents are increasingly deployed in the real world to make decisions and assist people in their daily lives. Making reasonable predictions about the future at varying timescales is one of the most important capabilities for such agents because it enables them to predict changes in the world around them, including other agents’ behaviors, and plan how to act next. Importantly, successful future prediction requires both capturing meaningful transitions in the environment (e.g., dough transforming into bread) and adapting to how transitions unfold over time in order to make decisions.

Previous work in future prediction from visual observations has largely been constrained by the format of its output (e.g., pixels that represent an image) or a manually-defined set of human activities (e.g., predicting if someone will keep walking, sit down, or jump). These are either too detailed and hard to predict or lack important information about the richness of the real world. For example, predicting “person jumping” does not capture why they’re jumping, what they’re jumping onto, etc. Also, with very few exceptions, previous models were designed to make predictions at a fixed offset into the future, which is a limiting assumption because we rarely know when meaningful future states will happen.

For example, in a video about making ice cream (depicted below), the meaningful transition from “cream” to “ice cream” occurs over 35 seconds, so models predicting such transitions would need to look 35 seconds ahead. But this time interval varies a large amount across different activities and videos — meaningful transitions occur at any distance into the future. Learning to make such predictions at flexible intervals is hard because the desired ground truth may be relatively ambiguous. For example, the correct prediction could be the just-churned ice cream in the machine, or scoops of the ice cream in a bowl. In addition, collecting such annotations at scale (i.e., frame-by-frame for millions of videos) is infeasible. However, many existing instructional videos come with speech transcripts, which often offer concise, general descriptions throughout entire videos. This source of data can guide a model’s attention toward important parts of the video, obviating the need for manual labeling and allowing a flexible, data-driven definition of the future.

In “Learning Temporal Dynamics from Cycles in Narrated Video”, published at ICCV 2021, we propose an approach that is self-supervised, using a recent large unlabeled dataset of diverse human action. The resulting model operates at a high level of abstraction, can make predictions arbitrarily far into the future, and chooses how far into the future to predict based on context. Called Multi-Modal Cycle Consistency (MMCC), it leverages narrated instructional video to learn a strong predictive model of the future. We demonstrate how MMCC can be applied, without fine-tuning, to a variety of challenging tasks, and qualitatively examine its predictions. In the example below, MMCC predicts the future (d) from present frame (a), rather than less relevant potential futures (b) or (c).

This work uses cues from vision and language to predict high-level changes (such as cream becoming ice cream) in video (video from HowTo100M).

Viewing Videos as Graphs
The foundation of our method is to represent narrated videos as graphs. We view videos as a collection of nodes, where nodes are either video frames (sampled at 1 frame per second) or segments of narrated text (extracted with automatic speech recognition systems), encoded by neural networks. During training, MMCC constructs a graph from the nodes, using cross-modal edges to connect video frames and text segments that refer to the same state, and temporal edges to connect the present (e.g., strawberry-flavored cream) and the future (e.g., soft-serve ice cream). The temporal edges operate on both modalities equally — they can start from either a video frame, some text, or both, and can connect to a future (or past) state in either modality. MMCC achieves this by learning a latent representation shared by frames and text and then making predictions in this representation space.

Multi-modal Cycle Consistency
To learn the cross-modal and temporal edge functions without supervision, we apply the idea of cycle consistency. Here, cycle consistency refers to the construction of cycle graphs, in which the model constructs a series of edges from an initial node to other nodes and back again: Given a start node (e.g., a sample video frame), the model is expected to find its cross-modal counterpart (i.e., text describing the frame) and combine them as the present state. To do this, at the start of training, the model assumes that frames and text with the same timestamps are counterparts, but then relaxes this assumption later. The model then predicts a future state, and the node most similar to this prediction is selected. Finally, the model attempts to invert the above steps by predicting the present state backward from the future node, and thus connecting the future node back with the start node.

The discrepancy between the model’s prediction of the present from the future and the actual present is the cycle-consistency loss. Intuitively, this training objective requires the predicted future to contain enough information about its past to be invertible, leading to predictions that correspond to meaningful changes to the same entities (e.g., tomato becoming marinara sauce, or flour and eggs in a bowl becoming dough). Moreover, the inclusion of cross-modal edges ensures future predictions are meaningful in either modality.

To learn the temporal and cross-modal edge functions end-to-end, we use the soft attention technique, which first outputs how likely each node is to be the target node of the edge, and then “picks” a node by taking the weighted average among all possible candidates. Importantly, this cyclic graph constraint makes few assumptions for the kind of temporal edges the model should learn, as long as they end up forming a consistent cycle. This enables the emergence of long-term temporal dynamics critical for future prediction without requiring manual labels of meaningful changes.

An example of the training objective: A cycle graph is expected to be constructed between the chicken with soy sauce and the chicken in chili oil because they are two adjacent steps in the chicken’s preparation (video from HowTo100M).

Discovering Cycles in Real-World Video
MMCC is trained without any explicit ground truth, using only long video sequences and randomly sampled starting conditions (a frame or text excerpt) and asking the model to find temporal cycles. After training, MMCC can identify meaningful cycles that capture complex changes in video.

Given frames as input (left), MMCC selects relevant text from video narrations and uses both modalities to predict a future frame (middle). It then finds text relevant to this future and uses it to predict the past (right). Using its knowledge of how objects and scenes change over time, MMCC “closes the cycle” and ends up where it started (videos from HowTo100M).
The model can also start from narrated text rather than frames and still find relevant transitions (videos from HowTo100M).

Zero-Shot Applications
For MMCC to identify meaningful transitions over time in an entire video, we define a “likely transition score” for each pair (A, B) of frames in a video, according to the model’s predictions — the closer B is to our model’s prediction of the future of A, the higher the score assigned. We then rank all pairs according to this score and show the highest-scoring pairs of present and future frames detected in previously unseen videos (examples below).

The highest-scoring pairs from eight random videos, which showcase the versatility of the model across a wide range of tasks (videos from HowTo100M).

We can use this same approach to temporally sort an unordered collection of video frames without any fine-tuning by finding an ordering that maximizes the overall confidence scores between all adjacent frames in the sorted sequence.

Left: Shuffled frames from three videos. Right: MMCC unshuffles the frames. The true order is shown under each frame. Even when MMCC does not predict the ground truth, its predictions often appear reasonable, and so, it can present an alternate ordering (videos from HowTo100M).

Evaluating Future Prediction
We evaluate the model’s ability to anticipate action, potentially minutes in advance, using the top-k recall metric, which here measures a model’s ability to retrieve the correct future (higher is better). On CrossTask, a dataset of instruction videos with labels describing key steps, MMCC outperforms the previous self-supervised state-of-the-art models in inferring possible future actions.

Recall
Model    Top-1       Top-5       Top-10   
Cross-modal    2.9 14.2 24.3
Repr. Ant. 3.0 13.3 26.0
MemDPC 2.9 15.8 27.4
TAP 4.5 17.1 27.9
MMCC 5.4 19.9 33.8

Conclusions
We have introduced a self-supervised method to learn temporal dynamics by cycling through narrated instructional videos. Despite the simplicity of the model’s architecture, it can discover meaningful long-term transitions in vision and language, and can be applied without further training to challenging downstream tasks, such as anticipating far-away action and ordering collections of images. An interesting future direction is transferring the model to agents so they can use it to conduct long-term planning.

Acknowledgements
The core team includes Dave Epstein, Jiajun Wu, Cordelia Schmid, and Chen Sun. We thank Alexei Efros, Mia Chiquier, and Shiry Ginosar for their feedback, and Allan Jabri for inspiration in figure design. Dave would like to thank Dídac Surís and Carl Vondrick for insightful early discussions on cycling through time in video.

Categories
Offsites

An Open Source Vibrotactile Haptics Platform for On-Body Applications.

Most wearable smart devices and mobile phones have the means to communicate with the user through tactile feedback, enabling applications from simple notifications to sensory substitution for accessibility. Typically, they accomplish this using vibrotactile actuators, which are small electric vibration motors. However, designing a haptic system that is well-targeted and effective for a given task requires experimentation with the number of actuators and their locations in the device, yet most practical applications require standalone on-body devices and integration into small form factors. This combination of factors can be difficult to address outside of a laboratory as integrating these systems can be quite time-consuming and often requires a high level of expertise.

A typical lab setup on the left and the VHP board on the right.

In “VHP: Vibrotactile Haptics Platform for On-body Applications”, presented at ACM UIST 2021, we develop a low-power miniature electronics board that can drive up to 12 independent channels of haptic signals with arbitrary waveforms. The VHP electronics board can be battery-powered, and integrated into wearable devices and small gadgets. It allows all-day wear, has low latency, battery life between 3 and 25 hours, and can run 12 actuators simultaneously. We show that VHP can be used in bracelet, sleeve, and phone-case form factors. The bracelet was programmed with an audio-to-tactile interface to aid lipreading and remained functional when worn for multiple months by developers. To facilitate greater progress in the field of wearable multi-channel haptics with the necessary tools for their design, implementation, and experimentation, we are releasing the hardware design and software for the VHP system via GitHub.

Front and back sides of the VHP circuit board.
Block diagram of the system.

Platform Specifications.
VHP consists of a custom designed circuit board, where the main components are the microcontroller and haptic amplifier, which converts microcontroller’s digital output into signals that drive the actuators. The haptic actuators can be controlled by signals arriving via serial, USB, and Bluetooth Low Energy (BLE), as well as onboard microphones, using an nRF52840 microcontroller, which was chosen because it offers many input and output options and BLE, all in a small package. We added several sensors into the board to provide more experimental flexibility: an on-board digital microphone, an analog microphone amplifier, and an accelerometer. The firmware is a portable C/C++ library that works in the Arduino ecosystem.

To allow for rapid iteration during development, the interface between the board and actuators is critical. The 12 tactile signals’ wiring have to be quick to set up in order to allow for such development, while being flexible and robust to stand up to prolonged use. For the interface, we use a 24-pin FPC (flexible printed circuit) connector on the VHP. We support interfacing to the actuators in two ways: with a custom flexible circuit board and with a rigid breakout board.

VHP board (small board on the right) connected to three different types of tactile actuators via rigid breakout board (large board on the left).

Using Haptic Actuators as Sensors
In our previous blog post, we explored how back-EMF in a haptic actuator could be used for sensing and demonstrated a variety of useful applications. Instead of using back-EMF sensing in the VHP system, we measure the electrical current that drives each vibrotactile actuator and use the current load as the sensing mechanism. Unlike back-EMF sensing, this current-sensing approach allows simultaneous sensing and actuation, while minimizing the additional space needed on the board.

One challenge with the current-sensing approach is that there is a wide variety of vibrotactile actuators, each of which may behave differently and need different presets. In addition, because different actuators can be added and removed during prototyping with the adapter board, it would be useful if the VHP were able to identify the actuator automatically. This would improve the speed of prototyping and make the system more novice-friendly.

To explore this possibility, we collected current-load data from three off-the-shelf haptic actuators and trained a simple support vector machine classifier to recognize the difference in the signal pattern between actuators. The test accuracy was 100% for classifying the three actuators, indicating that each actuator has a very distinct response.

Different actuators have a different current signature during a frequency sweep, thus allowing for automatic identification.

Additionally, vibrotactile actuators require proper contact with the skin for consistent control over stimulation. Thus, the device should measure skin contact and either provide an alert or self-adjust if it is not loaded correctly. To test whether a skin contact measuring technique works in practice, we measured the current load on actuators in a bracelet as it was tightened and loosened around the wrist. As the bracelet strap is tightened, the contact pressure between the skin and the actuator increases and the current required to drive the actuator signal increases commensurately.

Current load sensing is responding to touch, while the actuator is driven at 250 Hz frequency.

Quality of the fit of the bracelet is measured.

Audio-to-Tactile Feedback
To demonstrate the utility of the VHP platform, we used it to develop an audio-to-tactile feedback device to help with lipreading. Lipreading can be difficult for many speech sounds that look similar (visemes), such as “pin” and “min”. In order to help the user differentiate visemes like these, we attach a microphone to the VHP system, which can then pick up the speech sounds and translate the audio to vibrations on the wrist. For audio-to-tactile translation, we used our previously developed algorithms for real-time audio-to-tactile conversion, available via GitHub. Briefly, audio filters are paired with neural networks to recognize certain viesemes (e.g., picking up the hard consonant “p” in “pin”), and are then translated to vibrations in different parts of the bracelet. Our approach is inspired by tactile phonemic sleeve (TAPS), however the major difference is that in our approach the tactile signal is presented continuously and in real-time.

One of the developers who employs lipreading in daily life wore the bracelet daily for several months and found it to give better information to facilitate lipreading than previous devices, allowing improved understanding of lipreading visemes with the bracelet versus lipreading alone. In the future, we plan to conduct full-scale experiments with multiple users wearing the device for an extended time.

Left: Audio-to-tactile sleeve. Middle: Audio-to-tactile bracelet. Right: One of our developers tests out the bracelets, which are worn on both arms.

Potential Applications
The VHP platform enables rapid experimentation and prototyping that can be used to develop techniques for a variety of applications. For example:

  • Rich haptics on small devices: Expanding the number of actuators on mobile phones, which typically only have one or two, could be useful to provide additional tactile information. This is especially useful as fingers are sensitive to vibrations. We demonstrated a prototype mobile phone case with eight vibrotactile actuators. This could be used to provide rich notifications and enhance effects in a mobile game or when watching a video.
  • Lab psychophysical experiments: Because VHP can be easily set up to send and receive haptic signals in real time, e.g., from a Jupyter notebook, it could be used to perform real-time haptic experiments.
  • Notifications and alerts: The wearable VHP could be used to provide haptic notifications from other devices, e.g., alerting if someone is at the door, and could even communicate distinguishable alerts through use of multiple actuators.
  • Sensory substitution: Besides the lipreading assistance example above, there are many other potential applications for accessibility using sensory substitution, such as visual-to-tactile sensing or even sensing magnetic fields.
  • Loading sensing: The ability to sense from the haptic actuator current load is unique to our platform, and enables a variety of features, such as pressure sensing or automatically adjusting actuator output.
Integrating eight voice coils into a phone case. We used loading sensing to understand which voice coils are being touched.

What’s next?
We hope that others can utilize the platform to build a diverse set of applications. If you are interested and have ideas about using our platform or want to receive updates, please fill out this form. We hope that with this platform, we can help democratize the use of haptics and inspire a more widespread use of tactile devices.

Acknowledgments
This work was done by Artem Dementyev, Pascal Getreuer, Dimitri Kanevsky, Malcolm Slaney and Richard Lyon. We thank Alex Olwal, Thad Starner, Hong Tan, Charlotte Reed, Sarah Sterman for valuable feedback and discussion on the paper. Yuhui Zhao, Dmitrii Votintcev, Chet Gnegy, Whitney Bai and Sagar Savla for feedback on the design and engineering.

Categories
Misc

How to get a complete model history per testing sample in each epoch instead of just one average value per epoch?

So let’s say I have 1 metric I want to monitor across the mode fitting process. So far, I’ve adapted the default CallBack function which allows me to get one single value for that given metric in each epoch. With this, I can save a time-series (just as with CSV logger) plot to check for the performance evolution but it doesn’t provide robustness results. Therefore, I would like to get the metric value for every sample in each epoch and save it, allowing me to get a time-series plot with a 95% confidence.

Thank you

submitted by /u/throwaway469271
[visit reddit] [comments]

Categories
Misc

Multi-class classification

Hey, I am doing skin cancer classification with 9 classes, i am having a problem with overfitting as my train Accuracy can reach up to 90 with test acc=50 at best.

https://www.kaggle.com/pattnaiksatyajit/skin-cancer

I have tried dropout, l2,l1 data augmentation lowering my network size, class weights, all it did was a longer training epoch with the same results as test acc never improved.

I am using cnn, is there another architect I can use? can I transfer this problem to “one vs all”?using TensorFlow ?

submitted by /u/Ali99695
[visit reddit] [comments]

Categories
Misc

Why my model perform really bad even though with val_accuracy: 0.9800

Why my model perform really bad even though with val_accuracy: 0.9800

Hello I’m trying to make an image classifier which classifies given tomato plant leaf as [‘Tomato___Early_blight’, ‘Tomato___Septoria_leaf_spot’, ‘Tomato___healthy’]. I took the dataset from here It is already augmented and from that I took only Tomato plant leaf images and further reduced it to only three classes as mentioned before.

Here is my code

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from tensorflow.keras.preprocessing.image import load_img, img_to_array

import numpy as np

import matplotlib.pyplot as plt

train_gen = ImageDataGenerator(rescale=1./255)

test_gen = ImageDataGenerator(rescale=1./255)

train_data = train_gen.flow_from_directory(directory=’/Users/saibalaji/Documents/TensorFlowProjects/TomatoDataSet/train’,target_size=(256,256))

valiation_data = test_gen.flow_from_directory(directory=’/Users/saibalaji/Documents/TensorFlowProjects/TomatoDataSet/valid’,target_size=(256,256))

model = tf.keras.models.Sequential([

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Conv2D(32, 3, activation=’relu’),

tf.keras.layers.MaxPooling2D(),

tf.keras.layers.Conv2D(64, 3, activation=’relu’),

tf.keras.layers.MaxPooling2D(),

tf.keras.layers.Conv2D(128, 3, activation=’relu’),

tf.keras.layers.MaxPooling2D(),

tf.keras.layers.Flatten(),

tf.keras.layers.Dense(256, activation=’relu’),

tf.keras.layers.Dense(3, activation= ‘softmax’)

])

model.compile(optimizer=tf.optimizers.RMSprop(learning_rate=0.001),loss=tf.losses.categorical_crossentropy,metrics=[‘accuracy’])

model.fit(train_data,epochs=12,validation_data=valiation_data)

This is my prediction code

#load the image

my_image = load_img(‘/Users/saibalaji/Documents/TensorFlowProjects/TomatoDataSet/train/Tomato___Septoria_leaf_spot/ffd3c6f3-17d3-45f1-a599-2623e111ec71___Matt.S_CG 6493.JPG’, target_size=(256, 256))

plt.imshow(my_image)

#preprocess the image

my_image = img_to_array(my_image)

expand_image = np.expand_dims(my_image, axis=0)

print(expand_image.shape)

#make the prediction

prediction = model.predict(expand_image)

plt.xlabel(class_labels[np.argmax(prediction)])

https://preview.redd.it/364hamfitnp71.png?width=770&format=png&auto=webp&s=f27ae6c0fb5a703141a526d5ce755f5f74f13afd

As you can see my model validation accuracy is good but its classifications are really bad even for the images from training dataset. How can I solve this problem .

https://preview.redd.it/ebgy8k70tnp71.png?width=1732&format=png&auto=webp&s=7a45c1b3b34692f834c1658a78f992d76e78e123

submitted by /u/kudoshinichi-8211
[visit reddit] [comments]

Categories
Misc

I used tensorflow for implementing an AI that plays Subway Surfers (convolutional neural network for image classification)

I used tensorflow for implementing an AI that plays Subway Surfers (convolutional neural network for image classification) submitted by /u/nikp06
[visit reddit] [comments]
Categories
Misc

Should tensorflow have a dedicated server in a business environment?

Apologies if this information is readily available and I couldn’t find it. Would it be appropriate to set up a dedicated server strictly for tensorflow in a business environment? The goal would be classifying document imagines. The volume, I can only estimate, would be 1 to 4 million.

submitted by /u/travisadam0313
[visit reddit] [comments]

Categories
Misc

Accelerating Embedding with the HugeCTR TensorFlow Embedding Plugin

Recommender systems are the economic engine of the Internet. It is hard to imagine any other type of applications with more direct impact in our daily digital lives: Trillions of items to be recommended to billions of people. Recommender systems filter products and services among an overwhelming number of options, easing the paradox of choice … Continued

Recommender systems are the economic engine of the Internet. It is hard to imagine any other type of applications with more direct impact in our daily digital lives: Trillions of items to be recommended to billions of people. Recommender systems filter products and services among an overwhelming number of options, easing the paradox of choice that most users face.

As the amount of data increases, deep learning (DL) recommender systems are starting to show advantages over traditional machine learning-based approaches, such as gradient boosted trees. To give a concrete data point, recently, the NVIDIA RAPIDS.AI team won three recommendation competitions with DL:

This was not consistently happening even just a year before, when NVIDIA data scientists asked, Why Are Deep Learning Models Not Consistently Winning Recommender Systems Competitions Yet?.

Embeddings play a critical role in modern DL-based recommender architectures, encoding individual information for billions of entities (users, products, and their characteristics). As the amount of data increases, so does the size of the embedding tables, now spanning multiple GBs to TBs. There are unique challenges in training this type of DL system, with its huge embedding tables with sparse access patterns spanning potentially multiple GPUs, if not nodes.

This post focuses on how the NVIDIA Merlin recommendation system framework addresses these challenges and introduces an optimized embedding implementation that is up to 8x more performant than other frameworks’ embedding layers. This optimized implementation is also made available as a TensorFlow plugin that works seamlessly with TensorFlow and acts as a convenient drop-in replacement for the TensorFlow native embedding layers.

Embeddings

Embedding is a machine learning technique that represents each object of interest (users, products, categories, and so on) as a dense numerical vector. Embedding tables are hence nothing other than a specific type of key-value store, with keys being the ID used to uniquely identify objects and values being vectors of real numbers.

Embedding is a key building block in modern DL recommender systems, typically lying immediately after the input layer and before “feature interaction” and dense layers. Embedding layers are learned from data and end-to-end training, just like other layers of a deep neural network. It is the embedding layers that differentiate DL recommender models from other types of DL workloads: they contribute an enormous number of parameters to the model but require little to no computation, while the compute-intensive dense layers have a much smaller number of parameters.

Take a specific example: The original Wide and Deep model has several dense layers of size [1024, 512, 256], hence only a few million parameters, while its embedding layers can have billions of entries, and multiple billions of parameters. This contrasts with, for example, a BERT model architecture popular in the NLP domain, where the embedding layer has only tens of thousands of entries amounting to several millions of parameters, but the dense feed-forward and attention layers consist of several hundreds of millions of parameters. This differentiation also leads to another observation: the amount of compute per byte of input data for DL recommender networks is typically much smaller compared to other types of DL models.

Why optimizing embeddings matters for recommender workflows

To understand why optimization of the embedding layer and related operations matters, here are the challenges of training embeddings: size and access speed.

Size

With online platforms and services acquiring hundreds of millions to even billions of users, and with the number of unique products on offer reaching billions, it is not surprising that embedding tables are increasing in size.

Instagram reportedly has been working on recommender models reaching 10 TB in size. Likewise, Baidu reported an ad-ranking model which also reached the 10 TB realm. Across the industry, models in the hundreds of GBs to TBs are becoming increasingly popular, such as Pinterest’s 4-TB model and Google’s 1.2-TB model.

Naturally, it presents a significant challenge fitting a TB-scale model on a single node of compute, let alone a single compute accelerator. For reference, the largest NVIDIA A100 GPU is currently equipped with 80 GB of HBM.

Access speed

Training recommender systems is inherently a memory bandwidth-intensive task. This is because each training sample or batch usually involves a small number of entities in the embedding tables. These entries must be retrieved to calculate the forward pass, then updated in the backward pass.

The CPU main memory has high capacity but limited bandwidth, with high-end models typically in the high tens of GB/s range. The GPU, on the other hand, has limited memory capacity but high bandwidth. An NVIDIA A100 80-GB GPU offers 2 TB/s of memory bandwidth.

Solutions

These challenges have been addressed in different ways. For example, keeping the entire embedding table on main memory solves the size issue. However, it most often results in extremely slow training throughput that is often dwarfed by the amount and velocity of new data, forbidding the system to be retrained in a timely manner.

Alternatively, the embedding can be carefully spread across multiple GPUs and multiple nodes, only to be bogged down by the communication bottleneck, resulting in sustained severe GPU-compute under-utilization and training performance just on par with pure CPU training.

The embedding layer is one of the major bottlenecks in recommender systems. Optimizing the embedding layer is key to unlocking the GPU’s high compute throughput.

In the next section, we discuss how the NVIDIA Merlin HugeCTR recommender framework solves the challenges of large-scale embeddings, by using NVIDIA technologies such as GPUDirect remote direct memory access (RDMA), NVIDIA Collective Communications Library (NCCL), NVLink, and NVSwitch. It unlocks both the high-compute and high-bandwidth capacity of the GPU, while addressing the memory capacity problem with out-of-the-box, multi-GPU, multinode support and model parallelism.

Overview of NVIDIA Merlin HugeCTR embeddings

NVIDIA Merlin addresses the challenges of training large-scale recommender systems. It’s an end-to-end recommender framework that accelerates all phases of recommendation system development, from data preprocessing to training and inference. NVIDIA Merlin HugeCTR is an open-source, recommender system, dedicated DL framework. In this post, we focus on one specific aspect of HugeCTR: embedding optimization.

There are two ways to leverage the embedding optimization work in HugeCTR:

  • Using the native NVIDIA Merlin HugeCTR framework for your training and inference workloads
  • Using the NVIDIA Merlin HugeCTR TensorFlow plugin, which is designed to work seamlessly with TensorFlow

Native HugeCTR embedding optimization

To overcome the embedding challenges and enable faster training, HugeCTR implemented its own embedding layer, which includes a GPU-accelerated hash table, efficient sparse optimizers implemented in a memory-saving manner, and various embedding distribution strategies. It harnesses NCCL as its inter-GPU communication primitives.

The hash-table implementation is based on RAPIDS cuDF, which is a GPU DataFrame library, forming part of the RAPIDS data science platform from NVIDIA. The cuDF GPU hash table can achieve up to 35x speedup over CPU implementation, such as the concurrent_hash_map from Threading Building Blocks (TBB). For more information, see Introducing NVIDIA Merlin HugeCTR: A Training Framework Dedicated to Recommender Systems.

Built with scalability in mind, HugeCTR supports model parallelism for the embedding layer by default. The embedding tables are distributed across the available GPUs and nodes. The dense layers, on the other hand, employ data parallelism (Figure 1).

HugeCTR employs model parallelism for the embedding layer by default. The embedding tables will be distributed across the available GPUs
Figure 1. Embedding layer parallelism in HugeCTR

The Tencent recommendation team is one of the first adopters of the native HugeCTR framework, making heavy use of its native embedding layers. In a recent interview, Xiangting Kong, lead of the Tencent Advertising and Deep Learning Platform said, “HugeCTR, as a recommendation training framework, is integrated into the [Tencent] advertising recommendation training system to make the update frequency of model training faster, and more samples can be trained to improve online effects.”

HugeCTR TensorFlow plugin

All components of the NVIDIA Merlin framework are open-source and designed to be interoperable with the larger deep learning and data science ecosystem. Our long-term vision is to accelerate recommendation workloads on the GPU, regardless of your preferred framework. The HugeCTR TensorFlow embedding plugin was created as a step towards realizing this goal.

At a high level, the TensorFlow embedding plugin is designed by leveraging many of the same embedding optimization techniques that were employed for the native HugeCTR embedding layer. In particular, this would be the GPU hash table and NCCL under the hood for inter-GPU communication.

The HugeCTR embedding plugin is designed to work conveniently and seamlessly with TensorFlow as a drop in replacement for the TensorFlow-native embedding layers, such as tf.nn.embedding_lookup and tf.nn.embedding_lookup_sparse. It also offers advanced features out of the box, such as model parallelism that distributes the embedding tables over multiple GPUs.

NVIDIA Merlin HugeCTR TensorFlow plugin walkthrough 

Here’s how to make use of the TensorFlow embedding plugin. The full example is available at the HugeCTR repository, where we also provide a full benchmarking notebook for reproducing the performance figures.

The most convenient way to access the HugeCTR embedding plugin is through using the NGC NVIDIA Merlin TensorFlow training Docker image, in which it is precompiled and installed, along with other components of the NVIDIA Merlin framework, as well as TensorFlow. The most up-to-date version can be pulled directly from the HugeCTR repository, compiled and installed on the fly. When TensorFlow is updated, the plugin must also be recompiled for the newly installed TensorFlow version.

For comparison, here’s how the native TensorFlow embedding layers are used. First, you initialize a 2D-array variable to hold the value of the embeddings. Then, use tf.nn.embedding_lookup to look up the embedding value corresponding to a list of IDs.

embedding_var = tf.Variable(initial_value=initial_value, dtype=tf.float32, name='embedding_variables')

@tf.function
def _train_step(inputs, labels):
    emb_vectors = tf.nn.embedding_lookup([self.embedding_var], inputs)
    ...

for i, (inputs, labels) in enumerate(dataset):
    _train_step(inputs)

In the same fashion, the HugeCTR embedding plugin can be employed. First, you initialize an embedding layer. Next, this embedding layer is used to look up the corresponding embedding values for a list of IDs.

import sparse_operation_kit as sok

emb_layer = sok.All2AllDenseEmbedding(max_vocabulary_size_per_gpu,
                                      embedding_vec_size,
                                      slot_num, nnz_per_slot)

@tf.function
def _train_step(inputs, labels):
    emb_vectors = emb_layer(inputs)
    ...

for i, (inputs, labels) in enumerate(dataset):
    _train_step(inputs)

The HugeCTR embedding plugin is designed to work seamlessly with TensorFlow, including other layers and optimizers such as Adam and sgd. Before TensorFlow v2.5, the Adam optimizer was a CPU-based implementation.

To fully realize the potential of the HugeCTR embedding plugin, we also provide a GPU-based plugin_adam version in sok.optimizers.Adam. Starting from TensorFlow v2.5, the standard Adam optimizer tf.keras.optimizers.Adam, which now comes with a GPU implementation, can be used with similar accuracy and performance.

Performance benchmark

In this section, we showcase the performance of the HugeCTR TensorFlow embedding plugin through synthetic and real use cases.

Synthetic data

In this example, we use a synthetic dataset with 100 feature fields, each with 10 lookups, and a vocabulary size of 8192. The recommender model is an MLP with six layers, each of size 1024. Using the exact model architecture, optimizer, and data loader in TensorFlow, we observed that on 1x A100 GPU, the HugeCTR embedding plugin improves the average iteration time by 7.9x compared to the native TensorFlow embedding layer (Figure 2).

When being strong-scaled from one to four A100 GPUs, we observed a total speedup of 23.6x. This benefit of multi-GPU scaling is provided by the HugeCTR embedding plugin by default. Under the hood, the embedding plugin automatically distributes the table corresponding to feature fields on to the available GPUs in a model parallel fashion. This contrasts with the native TensorFlow embedding layer, where a significant extra effort is required for distributed model-parallel multi-GPU training. The TensorFlow distribution strategies, MirroredStrategy and MultiWorkerMirroredStrategy, are both designed to do data-parallel synchronized training.

HugeCTR TensorFlow plugin provides a 7.9x speedup over native TensorFlow 2.5 embedding lookup layer.
Figure 2. HugeCTR TensorFlow embedding plugin performance on NVIDIA DGX A100 80GB on synthetic dataset

Real use case: Meituan recommender systems

The Meituan recommender systems team is one of the first teams to adopt the HugeCTR TensorFlow plugin with great success. At first, the team optimized their training framework based on CPU, but as their models became more and more complex, it was difficult to optimize the training framework more deeply. Now, Meituan is working on integrating NVIDIA HugeCTR into their training system based on A100 GPUs.

“A single server with 8x A100 GPUs can replace hundreds of workers in the CPU based training system. The cost is also greatly reduced. This is a preliminary optimization result, and there is still much room to optimize in the future,” shared Jun Huang, senior technical expert at Meituan.

Meituan used DIEN as the recommendation model. The total number of embedding parameters is tens of billions, and there are thousands of feature fields in each sample. As the range of input features is not fixed and unknown in advance, the team uses hash tables to uniquely identify each input feature before feeding into an embedding layer.

Using the exact model architecture, optimizer, and data loader in TensorFlow, we observed that on a single A100 GPU, the HugeCTR embedding plugin achieved a 11.5x speedup compared to the original TensorFlow embedding. With weak scaling, the iteration time on 8x A100 GPUs only increased slightly to 1.17x that of 1x A100 GPU (Figure 3).

On a real use case, HugeCTR TensorFlow plugin provides a 11.6x speedup over native TensorFlow 2.5 embedding lookup layer.
Figure 3. HugeCTR TensorFlow embedding plugin performance on NVIDIA DGX A100 80GB on Meituan data

Conclusion

The HugeCTR TensorFlow embedding plugin is available today from the HugeCTR GitHub repository, as well as from the NGC NVIDIA Merlin TensorFlow container. If you are a TensorFlow user looking to build and deploy large-scale recommender systems with large embedding tables, the HugeCTR TensorFlow plugin is a great effortless drop-in replacement for TensorFlow embedding lookup layers.

Try it out to see the full potential of your GPUs unlocked. If you feel that you need even more performance and optimization, then the full-fledged native HugeCTR framework might be the next thing that you want to try.

Categories
Misc

Explore and Test Experimental Models for DLSS Research

Developers are encouraged to download, explore, and evaluate experimental AI models for Deep Learning Super Sampling.

Today, NVIDIA is enabling developers to explore and evaluate experimental AI models for Deep Learning Super Sampling (DLSS). Developers can download experimental Dynamic-link libraries (DLLs), test how the latest DLSS research enhances their games, and provide feedback for future improvements.

NVIDIA DLSS is a deep learning neural network that boosts frame rates and generates beautiful, sharp images for your games. It gives you the performance headroom to maximize ray tracing settings and increase output resolution. 

Powered by dedicated AI processors on NVIDIA RTX GPUs called Tensor Cores, NVIDIA DLSS technology has already been adopted and implemented in over 100 games and applications. These include gaming franchises such as Cyberpunk, Call of Duty, DOOM, Fortnite, LEGO, Minecraft, Rainbow Six, and Red Dead Redemption, with support coming soon for Battlefield 2042.

One of the key advantages of a deep learning approach to super sampling is that the AI model can continuously improve through ongoing training on the NVIDIA supercomputer. In fact, each major production release of DLSS has delivered better image quality across wider ranges of games and applications.  

We are inviting the developer community to test the latest experimental DLSS models straight off the supercomputer, and provide us feedback. The experimental DLLs contain improvements that show promise of image quality, but have not yet been thoroughly validated. Your early input is important to helping us push the state of the art in AI graphics technology.

Get Started Now 

Go to the NVIDIA developer page to see the DLLs available for testing. Getting started is seamless—download the DLL package of your choice, replace the existing DLL in your game, and then simply run.

Learn more about NVIDIA game developer offerings >>

Categories
Misc

GPU Dashboards in Jupyter Lab

Learn how NVDashboard in Jupyter Lab is a great open-source package to monitor system resources for all GPU and RAPIDS users to achieve optimal performance and day to day model and workflow development.

This post was originally published on the RAPIDS AI blog here.

Introduction

NVDashboard is an open-source package for the real-time visualization of NVIDIA GPU metrics in interactive Jupyter Lab environments. NVDashboard is a great way for all GPU users to monitor system resources. However, it is especially valuable for users of RAPIDS, NVIDIA’s open-source suite of GPU-accelerated data-science software libraries.

Given the computational intensity of modern data-science algorithms, there are many cases in which GPUs can offer game-changing workflow acceleration. To achieve optimal performance, it is absolutely critical for the underlying software to use system resources effectively. Although acceleration libraries (like cuDNN and RAPIDS) are specifically designed to do the heavy lifting in terms of performance optimization, it can be very useful for both developers and end-users to verify that their software is actually leveraging GPU resources as intended. While this can be accomplished with command-line tools like nvidia-smi, many professional data scientists prefer to use interactive Jupyter notebooks for day-to-day model and workflow development.

Figure 1: The NVDashboard Jupyter-Lab extension in action. The GPU dashboards are shown along the right-hand side of the screen, while two dask-labextension dashboards are shown on the bottom left) [GIF].

As illustrated in Fig. 1, NVDashboard enables Jupyter notebook users to visualize system hardware metrics within the same interactive environment they use for development. Supported metrics include:

  • GPU-compute utilization
  • GPU-memory consumption
  • PCIe throughput
  • NVLink throughput

The package is built upon a Python-based dashboard server, which support the Bokeh visualization library to display and update figures in real time[1] . An additional Jupyter-Lab extension embeds these dashboards as movable windows within an interactive environment. Most GPU metrics are collected through PyNVML, an open-source Python package composing wrappers for the NVIDIA Management Library (NVML). For this reason, the available dashboards can be modified/extended to display any queryable GPU metrics accessible through NVML.

Using NVDashboard

The nvdashboard package is available on PyPI, and consists of two basic components:

  • Bokeh Server: The server component leverages the wonderful Bokeh visualization library to display and update GPU-diagnostic dashboards in real time. The desired hardware metrics are accessed with PyNVML, an open-source python package composing wrappers for the NVIDIA Management Library (NVML). For this reason, NVDashboard can be modified/extended to display any queryable GPU metrics accessible through NVML, easily from Python.
  • Jupyter-Lab Extension: The Jupyter-Lab extension makes embedding the GPU-diagnostic dashboards as movable windows within an interactive Jupyter-Lab environment.

The Jupyter Lab Extension

The practice of directly querying hardware metrics is often the best way to validate efficient run-time behavior, and this is especially true for interactive Jupyter-notebook users. In this case, the development process is often iterative, and improper GPU utilization results in huge productivity losses. As shown in Fig. 1, NVDashboard makes visualizing resource utilization easy for Jupyter-Lab users right alongside their code.

To install both the server and client-side components, run the following in a terminal:

$ pip install jupyterlab-nvdashboard
$ jupyter labextension install jupyterlab-nvdashboard

After NVDashborad is installed, a “GPU Dashboards” menu should be visible along the left-hand side of your Jupyter-Lab environment (see Fig. 2). Clicking on one of these buttons automatically adds a movable window, with a real-time display of the desired dashboard.

Figure 2: Main menu for the Jupyter-Lab extension.

It is important to clarify that NVDashboard automatically monitors the GPU resources for the entire machine, not only those being used by the local Jupyter environment. The Jupyter-Lab eExtension can certainly be used for non-iPython/notebook development. For example, in Fig. 3, the “NVLink Timeline” and “GPU Utilization” dashboards are being used within a Jupyter-Lab environment to monitor a multi-GPU deep-learning workflow executed from the command line.

Figure 3: The “NVLink Timeline” dashboard being used with Jupyter Lab [GIF].

The Boker server

While the Jupyter-Lab extension is certainly ideal for fans of iPython/notebook-based development, other GPU users can also access the dashboards using a sandalone Bokeh server. This is accomplished by running.

$ python -m jupyterlab_nvdashboard.server

After starting the Bokeh server, the GPU dashboards are accessed by opening the appropriate url in a standard web browser (for example, http://:). As shown in Fig. 4, the main menu lists all dashboards available in NVDashboard.

Figure 4: The main menu for the Bokeh-server component of NVDashboard.

For example, selecting the “GPU-Resources” link opens the dashboard shown in Fig. 5, which summarizes the utilization of various GPU resources using aligned timeline plots.

Figure 5: The “GPU Resources” dashboard being used outside of Jupyter Lab [GIF].

To use NVDashboard in this way, only the pip-installation step is needed (the lab extension installation step can be skipped):

$ pip install jupyterlab-nvdashboard

Alternatively, one can also clone the jupyterlab-nvdashboard repository, and simply execute the server.py script (for example, python jupyterlab_nvdashboard/server.py ).

Implementation details

The existing nvdashboard package provides a number of useful GPU-resource dashboards. However, it is fairly straightforward to modify existing dashboards and/or create completely new ones. In order to do this, you simply need to leverage PyNVML and Bokeh.

PyNVML dasic

PyNVML is a python wrapper for the NVIDIA Management Library (NVML), which is a C-based API for monitoring and managing various states of NVIDIA GPU devices. NVML is directly used by the better-known NVIDIA System Management Interface (nvidia-smi). According to the NVIDIA developer site, NVML provides access to the following query-able states (in additional to modifiable states not discussed here):

  • ECC error counts: Both correctable single bit and detectable double bit errors are reported. Error counts are provided for both the current boot cycle and for the lifetime of the GPU.
  • GPU utilization: Current utilization rates are reported for both the compute resources of the GPU and the memory interface.
  • Active compute process: The list of active processes running on the GPU is reported, along with the corresponding process name/id and allocated GPU memory.
  • Clocks and PState: Max and current clock rates are reported for several important clock domains, as well as the current GPU performance state.
  • Temperature and fan speed: The current core GPU temperature is reported, along with fan speeds for non-passive products.
  • Power management: For supported products, the current board power draw and power limits are reported.
  • Identification: Various dynamic and static information is reported, including board serial numbers, PCI device ids, VBIOS/Inforom version numbers and product names.

Although several different python wrappers for NVML currently exist, we use the PyNVML package hosted by GoAi on GitHub. This version of PyNVML uses ctypes to wrap most of the NVML C API. NVDashboard utilizes only a small subset of the API needed to query real-time GPU-resource utilization, including:

  • nvmlInit(): Initialize NVML. Upon successful initialization, the GPU handles are cached to lower the latency of data queries during active monitoring in a dashboard.
  • nvmlShutdown(): Finalize NVML
  • nvmlDeviceGetCount(): Get the number of available GPU devices
  • nvmlDeviceGetHandleByIndex(): Get a handle for a device (given an integer index)
  • nvmlDeviceGetMemoryInfo(): Get a memory-info object (given a device handle)
  • nvmlDeviceGetUtilizationRates(): Get a utilization-rate object (given a device handle)
  • nvmlDeviceGetPcieThroughput(): Get a PCIe-throughput object (given a device handle)
  • nvmlDeviceGetNvLinkUtilizationCounter(): Get an NVLink utilization counter (given a device handle and link index)

In the current version of PyNVML, the python function names are usually chosen to exactly match the C API. For example, to query the current GPU-utilization rate on every available device, the code would look something like this:

In [1]: from pynvml import *
In [2]: nvmlInit()
In [3]: ngpus = nvmlDeviceGetCount()
In [4]: for i in range(ngpus):
…: handle = nvmlDeviceGetHandleByIndex(i)
…: gpu_util = nvmlDeviceGetUtilizationRates(handle).gpu
…: print(‘GPU %d Utilization = %d%%’ % (i, gpu_util))
…:
GPU 0 Utilization = 43%
GPU 1 Utilization = 0%
GPU 2 Utilization = 15%
GPU 3 Utilization = 0%
GPU 4 Utilization = 36%
GPU 5 Utilization = 0%
GPU 6 Utilization = 0%
GPU 7 Utilization = 11%

Note that, in addition to the GitHub repository, PyNVML is also hosted on PyPI and Conda Forge.

Dashboard Code

In order to modify/add a GPU dashboard, it is only necessary to work with two files (jupyterlab_bokeh_server/server.py and jupyterlab_nvdashboard/apps/gpu.py). Most of the PyNVML and bokeh code needed to add/modify a dashboard will be in gpu.py. It is only necessary to modify server.py in the case that you are adding or changing a menu/display name. In this case, the new/modified name must be specified in routes dictionary (with the key being the desired name, and the value being the corresponding dashboard definition):

routes = {
   "/GPU-Utilization": apps.gpu.gpu,
   "/GPU-Memory": apps.gpu.gpu_mem,
   "/GPU-Resources": apps.gpu.gpu_resource_timeline,
   "/PCIe-Throughput": apps.gpu.pci,
   "/NVLink-Throughput": apps.gpu.nvlink,
   "/NVLink-Timeline": apps.gpu.nvlink_timeline,
   "/Machine-Resources": apps.cpu.resource_timeline,
}

In order for the server to constantly refresh the PyNVML data used by the bokeh applications, we use bokeh’s ColumnDataSource class to define the source of data in each of our plots. The ColumnDataSource class allows an update function to be passed for each type of data, which can be called within a dedicated callback function (cb) for each application. For example, the existing gpu application is defined like this:

Note that the real-time update of PyNVML GPU-utilization data is performed within the source.data.update() call. With the necessary ColumnDataSource logic in place, the standard GPU definition (above) can be modified in many ways. For example, swapping the x and y axes, specifying a different color palette, or even changing the figure from an hbar to something else entirely.

Users should feel free to open a pull request to contribute valuable improvements/additions — Community engagement is certainly encouraged!

Jupyter Lab extension code

In order to package this up as a Jupyter Lab extension we need our bokeh server to be run when Jupyter Lab starts. We can do this by adding jupyter-server-proxy as a dependency and registering an entrypoint in our setup.py:

entry_points={
   'jupyter_serverproxy_servers': [
       'nvdashboard = jupyterlab_nvdashboard:launch_server',
   ]
},

This results in Jupyter Lab launching our bokeh server when it is run and proxying traffic through from /nvdashboard to it.

Then we can write a JavaScript frontend extension for Jupyter Lab which adds a sidebar menu with a list of dashboards which are opened as IFrames in new tabs in the interface. This is installed by running jupyter labextension install jupyterlab-nvdashboard.

There is also a custom Bokeh server endpoint in the Python side of things which serves a list of all the available dashboards and their URLs as a json file at /nvdashboard/index.json. The front end can use this json file to populate the menu automatically. This means that if we want to add any new dashboards we can do everything on the Python side by adding new Bokeh apps and the Jupyter Lab extension will pick them up automatically.