Categories
Offsites

Enhanced Sleep Sensing in Nest Hub

Earlier this year, we launched Contactless Sleep Sensing in Nest Hub, an opt-in feature that can help users better understand their sleep patterns and nighttime wellness. While some of the most critical sleep insights can be derived from a person’s overall schedule and duration of sleep, that alone does not tell the complete story. The human brain has special neurocircuitry to coordinate sleep cycles — transitions between deep, light, and rapid eye movement (REM) stages of sleep — vital not only for physical and emotional wellbeing, but also for optimal physical and cognitive performance. Combining such sleep staging information with disturbance events can help you better understand what’s happening while you’re sleeping.

Today we announced enhancements to Sleep Sensing that provide deeper sleep insights. While not intended for medical purposes1, these enhancements allow better understanding of sleep through sleep stages and the separation of the user’s coughs and snores from other sounds in the room. Here we describe how we developed these novel technologies, through transfer learning techniques to estimate sleep stages and sensor fusion of radar and microphone signals to disambiguate the source of sleep disturbances.

To help people understand their sleep patterns, Nest Hub displays a hypnogram, plotting the user’s sleep stages over the course of a sleep session. Potential sound disturbances during sleep will now include “Other sounds” in the timeline to separate the user’s coughs and snores from other sound disturbances detected from sources in the room outside of the calibrated sleeping area.

Training and Evaluating the Sleep Staging Classification Model
Most people cycle through sleep stages 4-6 times a night, about every 80-120 minutes, sometimes with a brief awakening between cycles. Recognizing the value for users to understand their sleep stages, we have extended Nest Hub’s sleep-wake algorithms using Soli to distinguish between light, deep, and REM sleep. We employed a design that is generally similar to Nest Hub’s original sleep detection algorithm: sliding windows of raw radar samples are processed to produce spectrogram features, and these are continuously fed into a Tensorflow Lite model. The key difference is that this new model was trained to predict sleep stages rather than simple sleep-wake status, and thus required new data and a more sophisticated training process.

In order to assemble a rich and diverse dataset suitable for training high-performing ML models, we leveraged existing non-radar datasets and applied transfer learning techniques to train the model. The gold standard for identifying sleep stages is polysomnography (PSG), which employs an array of wearable sensors to monitor a number of body functions during sleep, such as brain activity, heartbeat, respiration, eye movement, and motion. These signals can then be interpreted by trained sleep technologists to determine sleep stages.

To develop our model, we used publicly available data from the Sleep Heart Health Study (SHHS) and Multi-ethnic Study of Atherosclerosis (MESA) studies with over 10,000 sessions of raw PSG sensor data with corresponding sleep staging ground-truth labels, from the National Sleep Research Resource. The thoracic respiratory inductance plethysmography (RIP) sensor data within these PSG datasets is collected through a strap worn around the patient’s chest to measure motion due to breathing. While this is a very different sensing modality from radar, both RIP and radar provide signals that can be used to characterize a participant’s breathing and movement. This similarity between the two domains makes it possible to leverage a plethysmography-based model and adapt it to work with radar.

To do so, we first computed spectrograms from the RIP time series signals and used these as features to train a convolutional neural network (CNN) to predict the groundtruth sleep stages. This model successfully learned to identify breathing and motion patterns in the RIP signal that could be used to distinguish between different sleep stages. This indicated to us that the same should also be possible when using radar-based signals.

To test the generality of this model, we substituted similar spectrogram features computed from Nest Hub’s Soli sensor and evaluated how well the model was able to generalize to a different sensing modality. As expected, the model trained to predict sleep stages from a plethysmograph sensor was much less accurate when given radar sensor data instead. However, the model still performed much better than chance, which demonstrated that it had learned features that were relevant across both domains.

To improve on this, we collected a smaller secondary dataset of radar sensor data with corresponding PSG-based groundtruth labels, and then used a portion of this dataset to fine-tune the weights of the initial model. This smaller amount of additional training data allowed the model to adapt the original features it had learned from plethysmography-based sleep staging and successfully generalize them to our domain. When evaluated on an unseen test set of new radar data, we found the fine-tuned model produced sleep staging results comparable to that of other consumer sleep trackers.

The custom ML model efficiently processes a continuous stream of 3D radar tensors (as shown in the spectrogram at the top of the figure) to automatically compute probabilities of each sleep stage — REM, light, and deep — or detect if the user is awake or restless.

More Intelligent Audio Sensing Through Audio Source Separation
Soli-based sleep tracking gives users a convenient and reliable way to see how much sleep they are getting and when sleep disruptions occur. However, to understand and improve their sleep, users also need to understand why their sleep may be disrupted. We’ve previously discussed how Nest Hub can help monitor coughing and snoring, frequent sources of sleep disturbances of which people are often unaware. To provide deeper insight into these disturbances, it is important to understand if the snores and coughs detected are your own.

The original algorithms on Nest Hub used an on-device, CNN-based detector to process Nest Hub’s microphone signal and detect coughing or snoring events, but this audio-only approach did not attempt to distinguish from where a sound originated. Combining audio sensing with Soli-based motion and breathing cues, we updated our algorithms to separate sleep disturbances from the user-specified sleeping area versus other sources in the room. For example, when the primary user is snoring, the snoring in the audio signal will correspond closely with the inhalations and exhalations detected by Nest Hub’s radar sensor. Conversely, when snoring is detected outside the calibrated sleeping area, the two signals will vary independently. When Nest Hub detects coughing or snoring but determines that there is insufficient correlation between the audio and motion features, it will exclude these events from the user’s coughing or snoring timeline and instead note them as “Other sounds” on Nest Hub’s display. The updated model continues to use entirely on-device audio processing with privacy-preserving analysis, with no raw audio data sent to Google’s servers. A user can then opt to save the outputs of the processing (sound occurrences, such as the number of coughs and snore minutes) in Google Fit, in order to view their night time wellness over time.

Snoring sounds that are synchronized with the user’s breathing pattern (left) will be displayed in the user’s Nest Hub’s Snoring timeline. Snoring sounds that do not align with the user’s breathing pattern (right) will be displayed in Nest Hub’s “Other sounds” timeline.

Since Nest Hub with Sleep Sensing launched, researchers have expressed interest in investigational studies using Nest Hub’s digital quantification of nighttime cough. For example, a small feasibility study supported by the Cystic Fibrosis Foundation2 is currently underway to evaluate the feasibility of measuring night time cough using Nest Hub in families of children with cystic fibrosis (CF), a rare inherited disease, which can result in a chronic cough due to mucus in the lungs. Researchers are exploring if quantifying cough at night could be a proxy for monitoring response to treatment.

Conclusion
Based on privacy-preserving radar and audio signals, these improved sleep staging and audio sensing features on Nest Hub provide deeper insights that we hope will help users translate their night time wellness into actionable improvements for their overall wellbeing.

Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, clinicians, and cross-functional contributors. Special thanks to Dr. Logan Schneider, a sleep neurologist whose clinical expertise and contributions were invaluable to continuously guide this research. In addition to the authors, key contributors to this research include Anupam Pathak, Jeffrey Yu, Arno Charton, Jian Cui, Sinan Hersek, Jonathan Hsu, Andi Janti, Linda Lei, Shao-Po Ma, ‎Jo Schaeffer, Neil Smith, Siddhant Swaroop, Bhavana Koka, Dr. Jim Taylor, and the extended team. Thanks to Mark Malhotra and Shwetak Patel for their ongoing leadership, as well as the Nest, Fit, and Assistant teams we collaborated with to build and validate these enhancements to Sleep Sensing on Nest Hub.


1Not intended to diagnose, cure, mitigate, prevent or treat any disease or condition. 
2Google did not have any role in study design, execution, or funding. 

Categories
Offsites

Model Ensembles Are Faster Than You Think

When building a deep model for a new machine learning application, researchers often begin with existing network architectures, such as ResNets or EfficientNets. If the initial model’s accuracy isn’t high enough, a larger model may be a tempting alternative, but may not actually be the best solution for the task at hand. Instead, better performance potentially could be achieved by designing a new model that is optimized for the task. However, such efforts can be challenging and usually require considerable resources.

In “Wisdom of Committees: An Overlooked Approach to Faster and More Accurate Models”, we discuss model ensembles and a subset called model cascades, both of which are simple approaches that construct new models by collecting existing models and combining their outputs. We demonstrate that ensembles of even a small number of models that are easily constructed can match or exceed the accuracy of state-of-the-art models while being considerably more efficient.

What Are Model Ensembles and Cascades?
Ensembles and cascades are related approaches that leverage the advantages of multiple models to achieve a better solution. Ensembles execute multiple models in parallel and then combine their outputs to make the final prediction. Cascades are a subset of ensembles, but execute the collected models sequentially, and merge the solutions once the prediction has a high enough confidence. For simple inputs, cascades use less computation, but for more complex inputs, may end up calling on a greater number of models, resulting in higher computation costs.

Overview of ensembles and cascades. While this example shows 2-model combinations for both ensembles and cascades, any number of models can potentially be used.

Compared to a single model, ensembles can provide improved accuracy if there is variety in the collected models’ predictions. For example, the majority of images in ImageNet are easy for contemporary image recognition models to classify, but there are many images for which predictions vary between models and that will benefit most from an ensemble.

While ensembles are well-known, they are often not considered a core building block of deep model architectures and are rarely explored when researchers are developing more efficient models (with a few notable exceptions [1, 2, 3]). Therefore, we conduct a comprehensive analysis of ensemble efficiency and show that a simple ensemble or cascade of off-the-shelf pre-trained models can enhance both the efficiency and accuracy of state-of-the-art models.

To encourage the adoption of model ensembles, we demonstrate the following beneficial properties:

  1. Simple to build: Ensembles do not require complicated techniques (e.g., early exit policy learning).
  2. Easy to maintain: Ensembles are trained independently, making them easy to maintain and deploy.
  3. Affordable to train: The total training cost of models in an ensemble is often lower than a similarly accurate single model.
  4. On-device speedup: The reduction in computation cost (FLOPS) successfully translates to a speedup on real hardware.

Efficiency and Training Speed
It’s not surprising that ensembles can increase accuracy, but using multiple models in an ensemble may introduce extra computational cost at runtime. So, we investigate whether an ensemble can be more accurate than a single model that has the same computational cost. We analyze a series of models, EfficientNet-B0 to EfficientNet-B7, that have different levels of accuracy and FLOPS when applied to ImageNet inputs. The ensemble predictions are computed by averaging the predictions of each individual model.

We find that ensembles are significantly more cost-effective in the large computation regime (>5B FLOPS). For example, an ensemble of two EfficientNet-B5 models matches the accuracy of a single EfficientNet-B7 model, but does so using ~50% fewer FLOPS. This demonstrates that instead of using a large model, in this situation, one should use an ensemble of multiple considerably smaller models, which will reduce computation requirements while maintaining accuracy. Moreover, we find that the training cost of an ensemble can be much lower (e.g., two B5 models: 96 TPU days total; one B7 model: 160 TPU days). In practice, model ensemble training can be parallelized using multiple accelerators leading to further reductions. This pattern holds for the ResNet and MobileNet families as well.

Ensembles outperform single models in the large computation regime (>5B FLOPS).

Power and Simplicity of Cascades
While we have demonstrated the utility of model ensembles, applying an ensemble is often wasteful for easy inputs where a subset of the ensemble will give the correct answer. In these situations, cascades save computation by allowing for an early exit, potentially stopping and outputting an answer before all models are used. The challenge is to determine when to exit from the cascade.

To highlight the practical benefit of cascades, we intentionally choose a simple heuristic to measure the confidence of the prediction — we take the confidence of the model to be the maximum of the probabilities assigned to each class. For example, if the predicted probabilities for an image being either a cat, dog, or horse were 20%, 80% and 20%, respectively, then the confidence of the model’s prediction (dog) would be 0.8. We use a threshold on the confidence score to determine when to exit from the cascade.

To test this approach, we build model cascades for the EfficientNet, ResNet, and MobileNetV2 families to match either computation costs or accuracy (limiting the cascade to a maximum of four models). By design in cascades, some inputs incur more FLOPS than others, because more challenging inputs go through more models in the cascade than easier inputs. So we report the average FLOPS computed over all test images. We show that cascades outperform single models in all computation regimes (when FLOPS range from 0.15B to 37B) and can enhance accuracy or reduce the FLOPS (sometimes both) for all models tested.

Cascades of EfficientNet (left), ResNet (middle) and MobileNetV2 (right) models on ImageNet. When using similar FLOPS, cascades obtain a higher accuracy than single models (shown by the red arrows pointing up). Cascades can also match the accuracy of single models with significantly fewer FLOPS e.g., 5.4x for B7 (green arrows pointing left).
Summary of accuracy vs. FLOPS for ensembles and cascades. Squares and stars represent ensembles and cascades, respectively,, and the “+” notation indicates the models that comprise the ensemble or cascade. For example, ”B3+B4+B5+B7” at a star refers to a cascade of EfficientNet-B3, B4, B5 and B7 models.

In some cases it is not the average computation cost but the worst-case cost that is the limiting factor. By adding a simple constraint to the cascade building procedure, one can guarantee an upper bound to the computation cost of the cascade. See the paper for more details.

Other than convolutional neural networks, we also consider a Transformer-based architecture, ViT. We build a cascade of ViT-Base and ViT-Large models to match the average computation or accuracy of a single state-of-the-art ViT-Large model, and show that the benefit of cascades also generalizes to Transformer-based architectures.

        Single Models Cascades – Similar Throughput    Cascades – Similar Accuracy
Top-1 (%) Throughput Top-1 (%) Throughput △Top-1 Top-1 (%) Throughput SpeedUp
ViT-L-224 82.0 192 83.1 221 1.1 82.3 409 2.1x
ViT-L-384 85.0 54 86.0 69 1.0 85.2 125 2.3x
Cascades of ViT models on ImageNet. “224” and “384” indicate the image resolution on which the model is trained. Throughput is measured as the number of images processed per second. Our cascades can achieve a 1.0% higher accuracy than ViT-L-384 with a similar throughput or achieve a 2.3x speedup over that model while matching its accuracy.

Earlier works on cascades have also shown efficiency improvements for state-of-the-art models, but here we demonstrate that a simple approach with a handful of models is sufficient.

Inference Latency
In the above analysis, we average FLOPS to measure the computational cost. It is also important to verify that the FLOPS reduction obtained by cascades actually translates into speedup on hardware. We examine this by comparing on-device latency and speed-up for similarly performing single models versus cascades. We find a reduction in the average online latency on TPUv3 of up to 5.5x for cascades of models from the EfficientNet family compared to single models with comparable accuracy. As models become larger the more speed-up we find with comparable cascades.

Average latency of cascades on TPUv3 for online processing. Each pair of same colored bars has comparable accuracy. Notice that cascades provide drastic latency reduction.

Building Cascades from Large Pools of Models
Above, we limit the model types and only consider ensembles/cascades of at most four models. While this highlights the simplicity of using ensembles, it also allows us to check all combinations of models in very little time so we can find optimal model collections with only a few CPU hours on a held out set of predictions.

When a large pool of models exists, we would expect cascades to be even more efficient and accurate, but brute force search is not feasible. However, efficient cascade search methods have been proposed. For example, the algorithm of Streeter (2018), when applied to a large pool of models, produced cascades that matched the accuracy of state-of-the-art neural architecture search–based ImageNet models with significantly fewer FLOPS, for a range of model sizes.

Conclusion
As we have seen, ensemble/cascade-based models obtain superior efficiency and accuracy over state-of-the-art models from several standard architecture families. In our paper we show more results for other models and tasks. For practitioners, this outlines a simple procedure to boost accuracy while retaining efficiency using off-the-shelf models. We encourage you to try it out!

Acknowledgement
This blog presents research done by Xiaofang Wang (while interning at Google Research), Dan Kondratyuk, Eric Christiansen, Kris M. Kitani, Yair Alon (prev. Movshovitz-Attias), and Elad Eban. We thank Sergey Ioffe, Shankar Krishnan, Max Moroz, Josh Dillon, Alex Alemi, Jascha Sohl-Dickstein‎, Rif A Saurous, and Andrew Helton for their valuable help and feedback.

Categories
Offsites

Making Better Future Predictions by Watching Unlabeled Videos

Machine learning (ML) agents are increasingly deployed in the real world to make decisions and assist people in their daily lives. Making reasonable predictions about the future at varying timescales is one of the most important capabilities for such agents because it enables them to predict changes in the world around them, including other agents’ behaviors, and plan how to act next. Importantly, successful future prediction requires both capturing meaningful transitions in the environment (e.g., dough transforming into bread) and adapting to how transitions unfold over time in order to make decisions.

Previous work in future prediction from visual observations has largely been constrained by the format of its output (e.g., pixels that represent an image) or a manually-defined set of human activities (e.g., predicting if someone will keep walking, sit down, or jump). These are either too detailed and hard to predict or lack important information about the richness of the real world. For example, predicting “person jumping” does not capture why they’re jumping, what they’re jumping onto, etc. Also, with very few exceptions, previous models were designed to make predictions at a fixed offset into the future, which is a limiting assumption because we rarely know when meaningful future states will happen.

For example, in a video about making ice cream (depicted below), the meaningful transition from “cream” to “ice cream” occurs over 35 seconds, so models predicting such transitions would need to look 35 seconds ahead. But this time interval varies a large amount across different activities and videos — meaningful transitions occur at any distance into the future. Learning to make such predictions at flexible intervals is hard because the desired ground truth may be relatively ambiguous. For example, the correct prediction could be the just-churned ice cream in the machine, or scoops of the ice cream in a bowl. In addition, collecting such annotations at scale (i.e., frame-by-frame for millions of videos) is infeasible. However, many existing instructional videos come with speech transcripts, which often offer concise, general descriptions throughout entire videos. This source of data can guide a model’s attention toward important parts of the video, obviating the need for manual labeling and allowing a flexible, data-driven definition of the future.

In “Learning Temporal Dynamics from Cycles in Narrated Video”, published at ICCV 2021, we propose an approach that is self-supervised, using a recent large unlabeled dataset of diverse human action. The resulting model operates at a high level of abstraction, can make predictions arbitrarily far into the future, and chooses how far into the future to predict based on context. Called Multi-Modal Cycle Consistency (MMCC), it leverages narrated instructional video to learn a strong predictive model of the future. We demonstrate how MMCC can be applied, without fine-tuning, to a variety of challenging tasks, and qualitatively examine its predictions. In the example below, MMCC predicts the future (d) from present frame (a), rather than less relevant potential futures (b) or (c).

This work uses cues from vision and language to predict high-level changes (such as cream becoming ice cream) in video (video from HowTo100M).

Viewing Videos as Graphs
The foundation of our method is to represent narrated videos as graphs. We view videos as a collection of nodes, where nodes are either video frames (sampled at 1 frame per second) or segments of narrated text (extracted with automatic speech recognition systems), encoded by neural networks. During training, MMCC constructs a graph from the nodes, using cross-modal edges to connect video frames and text segments that refer to the same state, and temporal edges to connect the present (e.g., strawberry-flavored cream) and the future (e.g., soft-serve ice cream). The temporal edges operate on both modalities equally — they can start from either a video frame, some text, or both, and can connect to a future (or past) state in either modality. MMCC achieves this by learning a latent representation shared by frames and text and then making predictions in this representation space.

Multi-modal Cycle Consistency
To learn the cross-modal and temporal edge functions without supervision, we apply the idea of cycle consistency. Here, cycle consistency refers to the construction of cycle graphs, in which the model constructs a series of edges from an initial node to other nodes and back again: Given a start node (e.g., a sample video frame), the model is expected to find its cross-modal counterpart (i.e., text describing the frame) and combine them as the present state. To do this, at the start of training, the model assumes that frames and text with the same timestamps are counterparts, but then relaxes this assumption later. The model then predicts a future state, and the node most similar to this prediction is selected. Finally, the model attempts to invert the above steps by predicting the present state backward from the future node, and thus connecting the future node back with the start node.

The discrepancy between the model’s prediction of the present from the future and the actual present is the cycle-consistency loss. Intuitively, this training objective requires the predicted future to contain enough information about its past to be invertible, leading to predictions that correspond to meaningful changes to the same entities (e.g., tomato becoming marinara sauce, or flour and eggs in a bowl becoming dough). Moreover, the inclusion of cross-modal edges ensures future predictions are meaningful in either modality.

To learn the temporal and cross-modal edge functions end-to-end, we use the soft attention technique, which first outputs how likely each node is to be the target node of the edge, and then “picks” a node by taking the weighted average among all possible candidates. Importantly, this cyclic graph constraint makes few assumptions for the kind of temporal edges the model should learn, as long as they end up forming a consistent cycle. This enables the emergence of long-term temporal dynamics critical for future prediction without requiring manual labels of meaningful changes.

An example of the training objective: A cycle graph is expected to be constructed between the chicken with soy sauce and the chicken in chili oil because they are two adjacent steps in the chicken’s preparation (video from HowTo100M).

Discovering Cycles in Real-World Video
MMCC is trained without any explicit ground truth, using only long video sequences and randomly sampled starting conditions (a frame or text excerpt) and asking the model to find temporal cycles. After training, MMCC can identify meaningful cycles that capture complex changes in video.

Given frames as input (left), MMCC selects relevant text from video narrations and uses both modalities to predict a future frame (middle). It then finds text relevant to this future and uses it to predict the past (right). Using its knowledge of how objects and scenes change over time, MMCC “closes the cycle” and ends up where it started (videos from HowTo100M).
The model can also start from narrated text rather than frames and still find relevant transitions (videos from HowTo100M).

Zero-Shot Applications
For MMCC to identify meaningful transitions over time in an entire video, we define a “likely transition score” for each pair (A, B) of frames in a video, according to the model’s predictions — the closer B is to our model’s prediction of the future of A, the higher the score assigned. We then rank all pairs according to this score and show the highest-scoring pairs of present and future frames detected in previously unseen videos (examples below).

The highest-scoring pairs from eight random videos, which showcase the versatility of the model across a wide range of tasks (videos from HowTo100M).

We can use this same approach to temporally sort an unordered collection of video frames without any fine-tuning by finding an ordering that maximizes the overall confidence scores between all adjacent frames in the sorted sequence.

Left: Shuffled frames from three videos. Right: MMCC unshuffles the frames. The true order is shown under each frame. Even when MMCC does not predict the ground truth, its predictions often appear reasonable, and so, it can present an alternate ordering (videos from HowTo100M).

Evaluating Future Prediction
We evaluate the model’s ability to anticipate action, potentially minutes in advance, using the top-k recall metric, which here measures a model’s ability to retrieve the correct future (higher is better). On CrossTask, a dataset of instruction videos with labels describing key steps, MMCC outperforms the previous self-supervised state-of-the-art models in inferring possible future actions.

Recall
Model    Top-1       Top-5       Top-10   
Cross-modal    2.9 14.2 24.3
Repr. Ant. 3.0 13.3 26.0
MemDPC 2.9 15.8 27.4
TAP 4.5 17.1 27.9
MMCC 5.4 19.9 33.8

Conclusions
We have introduced a self-supervised method to learn temporal dynamics by cycling through narrated instructional videos. Despite the simplicity of the model’s architecture, it can discover meaningful long-term transitions in vision and language, and can be applied without further training to challenging downstream tasks, such as anticipating far-away action and ordering collections of images. An interesting future direction is transferring the model to agents so they can use it to conduct long-term planning.

Acknowledgements
The core team includes Dave Epstein, Jiajun Wu, Cordelia Schmid, and Chen Sun. We thank Alexei Efros, Mia Chiquier, and Shiry Ginosar for their feedback, and Allan Jabri for inspiration in figure design. Dave would like to thank Dídac Surís and Carl Vondrick for insightful early discussions on cycling through time in video.

Categories
Offsites

An Open Source Vibrotactile Haptics Platform for On-Body Applications.

Most wearable smart devices and mobile phones have the means to communicate with the user through tactile feedback, enabling applications from simple notifications to sensory substitution for accessibility. Typically, they accomplish this using vibrotactile actuators, which are small electric vibration motors. However, designing a haptic system that is well-targeted and effective for a given task requires experimentation with the number of actuators and their locations in the device, yet most practical applications require standalone on-body devices and integration into small form factors. This combination of factors can be difficult to address outside of a laboratory as integrating these systems can be quite time-consuming and often requires a high level of expertise.

A typical lab setup on the left and the VHP board on the right.

In “VHP: Vibrotactile Haptics Platform for On-body Applications”, presented at ACM UIST 2021, we develop a low-power miniature electronics board that can drive up to 12 independent channels of haptic signals with arbitrary waveforms. The VHP electronics board can be battery-powered, and integrated into wearable devices and small gadgets. It allows all-day wear, has low latency, battery life between 3 and 25 hours, and can run 12 actuators simultaneously. We show that VHP can be used in bracelet, sleeve, and phone-case form factors. The bracelet was programmed with an audio-to-tactile interface to aid lipreading and remained functional when worn for multiple months by developers. To facilitate greater progress in the field of wearable multi-channel haptics with the necessary tools for their design, implementation, and experimentation, we are releasing the hardware design and software for the VHP system via GitHub.

Front and back sides of the VHP circuit board.
Block diagram of the system.

Platform Specifications.
VHP consists of a custom designed circuit board, where the main components are the microcontroller and haptic amplifier, which converts microcontroller’s digital output into signals that drive the actuators. The haptic actuators can be controlled by signals arriving via serial, USB, and Bluetooth Low Energy (BLE), as well as onboard microphones, using an nRF52840 microcontroller, which was chosen because it offers many input and output options and BLE, all in a small package. We added several sensors into the board to provide more experimental flexibility: an on-board digital microphone, an analog microphone amplifier, and an accelerometer. The firmware is a portable C/C++ library that works in the Arduino ecosystem.

To allow for rapid iteration during development, the interface between the board and actuators is critical. The 12 tactile signals’ wiring have to be quick to set up in order to allow for such development, while being flexible and robust to stand up to prolonged use. For the interface, we use a 24-pin FPC (flexible printed circuit) connector on the VHP. We support interfacing to the actuators in two ways: with a custom flexible circuit board and with a rigid breakout board.

VHP board (small board on the right) connected to three different types of tactile actuators via rigid breakout board (large board on the left).

Using Haptic Actuators as Sensors
In our previous blog post, we explored how back-EMF in a haptic actuator could be used for sensing and demonstrated a variety of useful applications. Instead of using back-EMF sensing in the VHP system, we measure the electrical current that drives each vibrotactile actuator and use the current load as the sensing mechanism. Unlike back-EMF sensing, this current-sensing approach allows simultaneous sensing and actuation, while minimizing the additional space needed on the board.

One challenge with the current-sensing approach is that there is a wide variety of vibrotactile actuators, each of which may behave differently and need different presets. In addition, because different actuators can be added and removed during prototyping with the adapter board, it would be useful if the VHP were able to identify the actuator automatically. This would improve the speed of prototyping and make the system more novice-friendly.

To explore this possibility, we collected current-load data from three off-the-shelf haptic actuators and trained a simple support vector machine classifier to recognize the difference in the signal pattern between actuators. The test accuracy was 100% for classifying the three actuators, indicating that each actuator has a very distinct response.

Different actuators have a different current signature during a frequency sweep, thus allowing for automatic identification.

Additionally, vibrotactile actuators require proper contact with the skin for consistent control over stimulation. Thus, the device should measure skin contact and either provide an alert or self-adjust if it is not loaded correctly. To test whether a skin contact measuring technique works in practice, we measured the current load on actuators in a bracelet as it was tightened and loosened around the wrist. As the bracelet strap is tightened, the contact pressure between the skin and the actuator increases and the current required to drive the actuator signal increases commensurately.

Current load sensing is responding to touch, while the actuator is driven at 250 Hz frequency.

Quality of the fit of the bracelet is measured.

Audio-to-Tactile Feedback
To demonstrate the utility of the VHP platform, we used it to develop an audio-to-tactile feedback device to help with lipreading. Lipreading can be difficult for many speech sounds that look similar (visemes), such as “pin” and “min”. In order to help the user differentiate visemes like these, we attach a microphone to the VHP system, which can then pick up the speech sounds and translate the audio to vibrations on the wrist. For audio-to-tactile translation, we used our previously developed algorithms for real-time audio-to-tactile conversion, available via GitHub. Briefly, audio filters are paired with neural networks to recognize certain viesemes (e.g., picking up the hard consonant “p” in “pin”), and are then translated to vibrations in different parts of the bracelet. Our approach is inspired by tactile phonemic sleeve (TAPS), however the major difference is that in our approach the tactile signal is presented continuously and in real-time.

One of the developers who employs lipreading in daily life wore the bracelet daily for several months and found it to give better information to facilitate lipreading than previous devices, allowing improved understanding of lipreading visemes with the bracelet versus lipreading alone. In the future, we plan to conduct full-scale experiments with multiple users wearing the device for an extended time.

Left: Audio-to-tactile sleeve. Middle: Audio-to-tactile bracelet. Right: One of our developers tests out the bracelets, which are worn on both arms.

Potential Applications
The VHP platform enables rapid experimentation and prototyping that can be used to develop techniques for a variety of applications. For example:

  • Rich haptics on small devices: Expanding the number of actuators on mobile phones, which typically only have one or two, could be useful to provide additional tactile information. This is especially useful as fingers are sensitive to vibrations. We demonstrated a prototype mobile phone case with eight vibrotactile actuators. This could be used to provide rich notifications and enhance effects in a mobile game or when watching a video.
  • Lab psychophysical experiments: Because VHP can be easily set up to send and receive haptic signals in real time, e.g., from a Jupyter notebook, it could be used to perform real-time haptic experiments.
  • Notifications and alerts: The wearable VHP could be used to provide haptic notifications from other devices, e.g., alerting if someone is at the door, and could even communicate distinguishable alerts through use of multiple actuators.
  • Sensory substitution: Besides the lipreading assistance example above, there are many other potential applications for accessibility using sensory substitution, such as visual-to-tactile sensing or even sensing magnetic fields.
  • Loading sensing: The ability to sense from the haptic actuator current load is unique to our platform, and enables a variety of features, such as pressure sensing or automatically adjusting actuator output.
Integrating eight voice coils into a phone case. We used loading sensing to understand which voice coils are being touched.

What’s next?
We hope that others can utilize the platform to build a diverse set of applications. If you are interested and have ideas about using our platform or want to receive updates, please fill out this form. We hope that with this platform, we can help democratize the use of haptics and inspire a more widespread use of tactile devices.

Acknowledgments
This work was done by Artem Dementyev, Pascal Getreuer, Dimitri Kanevsky, Malcolm Slaney and Richard Lyon. We thank Alex Olwal, Thad Starner, Hong Tan, Charlotte Reed, Sarah Sterman for valuable feedback and discussion on the paper. Yuhui Zhao, Dmitrii Votintcev, Chet Gnegy, Whitney Bai and Sagar Savla for feedback on the design and engineering.

Categories
Misc

How to get a complete model history per testing sample in each epoch instead of just one average value per epoch?

So let’s say I have 1 metric I want to monitor across the mode fitting process. So far, I’ve adapted the default CallBack function which allows me to get one single value for that given metric in each epoch. With this, I can save a time-series (just as with CSV logger) plot to check for the performance evolution but it doesn’t provide robustness results. Therefore, I would like to get the metric value for every sample in each epoch and save it, allowing me to get a time-series plot with a 95% confidence.

Thank you

submitted by /u/throwaway469271
[visit reddit] [comments]

Categories
Misc

Multi-class classification

Hey, I am doing skin cancer classification with 9 classes, i am having a problem with overfitting as my train Accuracy can reach up to 90 with test acc=50 at best.

https://www.kaggle.com/pattnaiksatyajit/skin-cancer

I have tried dropout, l2,l1 data augmentation lowering my network size, class weights, all it did was a longer training epoch with the same results as test acc never improved.

I am using cnn, is there another architect I can use? can I transfer this problem to “one vs all”?using TensorFlow ?

submitted by /u/Ali99695
[visit reddit] [comments]

Categories
Misc

Why my model perform really bad even though with val_accuracy: 0.9800

Why my model perform really bad even though with val_accuracy: 0.9800

Hello I’m trying to make an image classifier which classifies given tomato plant leaf as [‘Tomato___Early_blight’, ‘Tomato___Septoria_leaf_spot’, ‘Tomato___healthy’]. I took the dataset from here It is already augmented and from that I took only Tomato plant leaf images and further reduced it to only three classes as mentioned before.

Here is my code

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras.preprocessing.image import ImageDataGenerator

from tensorflow.keras.preprocessing.image import load_img, img_to_array

import numpy as np

import matplotlib.pyplot as plt

train_gen = ImageDataGenerator(rescale=1./255)

test_gen = ImageDataGenerator(rescale=1./255)

train_data = train_gen.flow_from_directory(directory=’/Users/saibalaji/Documents/TensorFlowProjects/TomatoDataSet/train’,target_size=(256,256))

valiation_data = test_gen.flow_from_directory(directory=’/Users/saibalaji/Documents/TensorFlowProjects/TomatoDataSet/valid’,target_size=(256,256))

model = tf.keras.models.Sequential([

tf.keras.layers.BatchNormalization(),

tf.keras.layers.Conv2D(32, 3, activation=’relu’),

tf.keras.layers.MaxPooling2D(),

tf.keras.layers.Conv2D(64, 3, activation=’relu’),

tf.keras.layers.MaxPooling2D(),

tf.keras.layers.Conv2D(128, 3, activation=’relu’),

tf.keras.layers.MaxPooling2D(),

tf.keras.layers.Flatten(),

tf.keras.layers.Dense(256, activation=’relu’),

tf.keras.layers.Dense(3, activation= ‘softmax’)

])

model.compile(optimizer=tf.optimizers.RMSprop(learning_rate=0.001),loss=tf.losses.categorical_crossentropy,metrics=[‘accuracy’])

model.fit(train_data,epochs=12,validation_data=valiation_data)

This is my prediction code

#load the image

my_image = load_img(‘/Users/saibalaji/Documents/TensorFlowProjects/TomatoDataSet/train/Tomato___Septoria_leaf_spot/ffd3c6f3-17d3-45f1-a599-2623e111ec71___Matt.S_CG 6493.JPG’, target_size=(256, 256))

plt.imshow(my_image)

#preprocess the image

my_image = img_to_array(my_image)

expand_image = np.expand_dims(my_image, axis=0)

print(expand_image.shape)

#make the prediction

prediction = model.predict(expand_image)

plt.xlabel(class_labels[np.argmax(prediction)])

https://preview.redd.it/364hamfitnp71.png?width=770&format=png&auto=webp&s=f27ae6c0fb5a703141a526d5ce755f5f74f13afd

As you can see my model validation accuracy is good but its classifications are really bad even for the images from training dataset. How can I solve this problem .

https://preview.redd.it/ebgy8k70tnp71.png?width=1732&format=png&auto=webp&s=7a45c1b3b34692f834c1658a78f992d76e78e123

submitted by /u/kudoshinichi-8211
[visit reddit] [comments]

Categories
Misc

I used tensorflow for implementing an AI that plays Subway Surfers (convolutional neural network for image classification)

I used tensorflow for implementing an AI that plays Subway Surfers (convolutional neural network for image classification) submitted by /u/nikp06
[visit reddit] [comments]
Categories
Misc

Should tensorflow have a dedicated server in a business environment?

Apologies if this information is readily available and I couldn’t find it. Would it be appropriate to set up a dedicated server strictly for tensorflow in a business environment? The goal would be classifying document imagines. The volume, I can only estimate, would be 1 to 4 million.

submitted by /u/travisadam0313
[visit reddit] [comments]

Categories
Misc

Accelerating Embedding with the HugeCTR TensorFlow Embedding Plugin

Recommender systems are the economic engine of the Internet. It is hard to imagine any other type of applications with more direct impact in our daily digital lives: Trillions of items to be recommended to billions of people. Recommender systems filter products and services among an overwhelming number of options, easing the paradox of choice … Continued

Recommender systems are the economic engine of the Internet. It is hard to imagine any other type of applications with more direct impact in our daily digital lives: Trillions of items to be recommended to billions of people. Recommender systems filter products and services among an overwhelming number of options, easing the paradox of choice that most users face.

As the amount of data increases, deep learning (DL) recommender systems are starting to show advantages over traditional machine learning-based approaches, such as gradient boosted trees. To give a concrete data point, recently, the NVIDIA RAPIDS.AI team won three recommendation competitions with DL:

This was not consistently happening even just a year before, when NVIDIA data scientists asked, Why Are Deep Learning Models Not Consistently Winning Recommender Systems Competitions Yet?.

Embeddings play a critical role in modern DL-based recommender architectures, encoding individual information for billions of entities (users, products, and their characteristics). As the amount of data increases, so does the size of the embedding tables, now spanning multiple GBs to TBs. There are unique challenges in training this type of DL system, with its huge embedding tables with sparse access patterns spanning potentially multiple GPUs, if not nodes.

This post focuses on how the NVIDIA Merlin recommendation system framework addresses these challenges and introduces an optimized embedding implementation that is up to 8x more performant than other frameworks’ embedding layers. This optimized implementation is also made available as a TensorFlow plugin that works seamlessly with TensorFlow and acts as a convenient drop-in replacement for the TensorFlow native embedding layers.

Embeddings

Embedding is a machine learning technique that represents each object of interest (users, products, categories, and so on) as a dense numerical vector. Embedding tables are hence nothing other than a specific type of key-value store, with keys being the ID used to uniquely identify objects and values being vectors of real numbers.

Embedding is a key building block in modern DL recommender systems, typically lying immediately after the input layer and before “feature interaction” and dense layers. Embedding layers are learned from data and end-to-end training, just like other layers of a deep neural network. It is the embedding layers that differentiate DL recommender models from other types of DL workloads: they contribute an enormous number of parameters to the model but require little to no computation, while the compute-intensive dense layers have a much smaller number of parameters.

Take a specific example: The original Wide and Deep model has several dense layers of size [1024, 512, 256], hence only a few million parameters, while its embedding layers can have billions of entries, and multiple billions of parameters. This contrasts with, for example, a BERT model architecture popular in the NLP domain, where the embedding layer has only tens of thousands of entries amounting to several millions of parameters, but the dense feed-forward and attention layers consist of several hundreds of millions of parameters. This differentiation also leads to another observation: the amount of compute per byte of input data for DL recommender networks is typically much smaller compared to other types of DL models.

Why optimizing embeddings matters for recommender workflows

To understand why optimization of the embedding layer and related operations matters, here are the challenges of training embeddings: size and access speed.

Size

With online platforms and services acquiring hundreds of millions to even billions of users, and with the number of unique products on offer reaching billions, it is not surprising that embedding tables are increasing in size.

Instagram reportedly has been working on recommender models reaching 10 TB in size. Likewise, Baidu reported an ad-ranking model which also reached the 10 TB realm. Across the industry, models in the hundreds of GBs to TBs are becoming increasingly popular, such as Pinterest’s 4-TB model and Google’s 1.2-TB model.

Naturally, it presents a significant challenge fitting a TB-scale model on a single node of compute, let alone a single compute accelerator. For reference, the largest NVIDIA A100 GPU is currently equipped with 80 GB of HBM.

Access speed

Training recommender systems is inherently a memory bandwidth-intensive task. This is because each training sample or batch usually involves a small number of entities in the embedding tables. These entries must be retrieved to calculate the forward pass, then updated in the backward pass.

The CPU main memory has high capacity but limited bandwidth, with high-end models typically in the high tens of GB/s range. The GPU, on the other hand, has limited memory capacity but high bandwidth. An NVIDIA A100 80-GB GPU offers 2 TB/s of memory bandwidth.

Solutions

These challenges have been addressed in different ways. For example, keeping the entire embedding table on main memory solves the size issue. However, it most often results in extremely slow training throughput that is often dwarfed by the amount and velocity of new data, forbidding the system to be retrained in a timely manner.

Alternatively, the embedding can be carefully spread across multiple GPUs and multiple nodes, only to be bogged down by the communication bottleneck, resulting in sustained severe GPU-compute under-utilization and training performance just on par with pure CPU training.

The embedding layer is one of the major bottlenecks in recommender systems. Optimizing the embedding layer is key to unlocking the GPU’s high compute throughput.

In the next section, we discuss how the NVIDIA Merlin HugeCTR recommender framework solves the challenges of large-scale embeddings, by using NVIDIA technologies such as GPUDirect remote direct memory access (RDMA), NVIDIA Collective Communications Library (NCCL), NVLink, and NVSwitch. It unlocks both the high-compute and high-bandwidth capacity of the GPU, while addressing the memory capacity problem with out-of-the-box, multi-GPU, multinode support and model parallelism.

Overview of NVIDIA Merlin HugeCTR embeddings

NVIDIA Merlin addresses the challenges of training large-scale recommender systems. It’s an end-to-end recommender framework that accelerates all phases of recommendation system development, from data preprocessing to training and inference. NVIDIA Merlin HugeCTR is an open-source, recommender system, dedicated DL framework. In this post, we focus on one specific aspect of HugeCTR: embedding optimization.

There are two ways to leverage the embedding optimization work in HugeCTR:

  • Using the native NVIDIA Merlin HugeCTR framework for your training and inference workloads
  • Using the NVIDIA Merlin HugeCTR TensorFlow plugin, which is designed to work seamlessly with TensorFlow

Native HugeCTR embedding optimization

To overcome the embedding challenges and enable faster training, HugeCTR implemented its own embedding layer, which includes a GPU-accelerated hash table, efficient sparse optimizers implemented in a memory-saving manner, and various embedding distribution strategies. It harnesses NCCL as its inter-GPU communication primitives.

The hash-table implementation is based on RAPIDS cuDF, which is a GPU DataFrame library, forming part of the RAPIDS data science platform from NVIDIA. The cuDF GPU hash table can achieve up to 35x speedup over CPU implementation, such as the concurrent_hash_map from Threading Building Blocks (TBB). For more information, see Introducing NVIDIA Merlin HugeCTR: A Training Framework Dedicated to Recommender Systems.

Built with scalability in mind, HugeCTR supports model parallelism for the embedding layer by default. The embedding tables are distributed across the available GPUs and nodes. The dense layers, on the other hand, employ data parallelism (Figure 1).

HugeCTR employs model parallelism for the embedding layer by default. The embedding tables will be distributed across the available GPUs
Figure 1. Embedding layer parallelism in HugeCTR

The Tencent recommendation team is one of the first adopters of the native HugeCTR framework, making heavy use of its native embedding layers. In a recent interview, Xiangting Kong, lead of the Tencent Advertising and Deep Learning Platform said, “HugeCTR, as a recommendation training framework, is integrated into the [Tencent] advertising recommendation training system to make the update frequency of model training faster, and more samples can be trained to improve online effects.”

HugeCTR TensorFlow plugin

All components of the NVIDIA Merlin framework are open-source and designed to be interoperable with the larger deep learning and data science ecosystem. Our long-term vision is to accelerate recommendation workloads on the GPU, regardless of your preferred framework. The HugeCTR TensorFlow embedding plugin was created as a step towards realizing this goal.

At a high level, the TensorFlow embedding plugin is designed by leveraging many of the same embedding optimization techniques that were employed for the native HugeCTR embedding layer. In particular, this would be the GPU hash table and NCCL under the hood for inter-GPU communication.

The HugeCTR embedding plugin is designed to work conveniently and seamlessly with TensorFlow as a drop in replacement for the TensorFlow-native embedding layers, such as tf.nn.embedding_lookup and tf.nn.embedding_lookup_sparse. It also offers advanced features out of the box, such as model parallelism that distributes the embedding tables over multiple GPUs.

NVIDIA Merlin HugeCTR TensorFlow plugin walkthrough 

Here’s how to make use of the TensorFlow embedding plugin. The full example is available at the HugeCTR repository, where we also provide a full benchmarking notebook for reproducing the performance figures.

The most convenient way to access the HugeCTR embedding plugin is through using the NGC NVIDIA Merlin TensorFlow training Docker image, in which it is precompiled and installed, along with other components of the NVIDIA Merlin framework, as well as TensorFlow. The most up-to-date version can be pulled directly from the HugeCTR repository, compiled and installed on the fly. When TensorFlow is updated, the plugin must also be recompiled for the newly installed TensorFlow version.

For comparison, here’s how the native TensorFlow embedding layers are used. First, you initialize a 2D-array variable to hold the value of the embeddings. Then, use tf.nn.embedding_lookup to look up the embedding value corresponding to a list of IDs.

embedding_var = tf.Variable(initial_value=initial_value, dtype=tf.float32, name='embedding_variables')

@tf.function
def _train_step(inputs, labels):
    emb_vectors = tf.nn.embedding_lookup([self.embedding_var], inputs)
    ...

for i, (inputs, labels) in enumerate(dataset):
    _train_step(inputs)

In the same fashion, the HugeCTR embedding plugin can be employed. First, you initialize an embedding layer. Next, this embedding layer is used to look up the corresponding embedding values for a list of IDs.

import sparse_operation_kit as sok

emb_layer = sok.All2AllDenseEmbedding(max_vocabulary_size_per_gpu,
                                      embedding_vec_size,
                                      slot_num, nnz_per_slot)

@tf.function
def _train_step(inputs, labels):
    emb_vectors = emb_layer(inputs)
    ...

for i, (inputs, labels) in enumerate(dataset):
    _train_step(inputs)

The HugeCTR embedding plugin is designed to work seamlessly with TensorFlow, including other layers and optimizers such as Adam and sgd. Before TensorFlow v2.5, the Adam optimizer was a CPU-based implementation.

To fully realize the potential of the HugeCTR embedding plugin, we also provide a GPU-based plugin_adam version in sok.optimizers.Adam. Starting from TensorFlow v2.5, the standard Adam optimizer tf.keras.optimizers.Adam, which now comes with a GPU implementation, can be used with similar accuracy and performance.

Performance benchmark

In this section, we showcase the performance of the HugeCTR TensorFlow embedding plugin through synthetic and real use cases.

Synthetic data

In this example, we use a synthetic dataset with 100 feature fields, each with 10 lookups, and a vocabulary size of 8192. The recommender model is an MLP with six layers, each of size 1024. Using the exact model architecture, optimizer, and data loader in TensorFlow, we observed that on 1x A100 GPU, the HugeCTR embedding plugin improves the average iteration time by 7.9x compared to the native TensorFlow embedding layer (Figure 2).

When being strong-scaled from one to four A100 GPUs, we observed a total speedup of 23.6x. This benefit of multi-GPU scaling is provided by the HugeCTR embedding plugin by default. Under the hood, the embedding plugin automatically distributes the table corresponding to feature fields on to the available GPUs in a model parallel fashion. This contrasts with the native TensorFlow embedding layer, where a significant extra effort is required for distributed model-parallel multi-GPU training. The TensorFlow distribution strategies, MirroredStrategy and MultiWorkerMirroredStrategy, are both designed to do data-parallel synchronized training.

HugeCTR TensorFlow plugin provides a 7.9x speedup over native TensorFlow 2.5 embedding lookup layer.
Figure 2. HugeCTR TensorFlow embedding plugin performance on NVIDIA DGX A100 80GB on synthetic dataset

Real use case: Meituan recommender systems

The Meituan recommender systems team is one of the first teams to adopt the HugeCTR TensorFlow plugin with great success. At first, the team optimized their training framework based on CPU, but as their models became more and more complex, it was difficult to optimize the training framework more deeply. Now, Meituan is working on integrating NVIDIA HugeCTR into their training system based on A100 GPUs.

“A single server with 8x A100 GPUs can replace hundreds of workers in the CPU based training system. The cost is also greatly reduced. This is a preliminary optimization result, and there is still much room to optimize in the future,” shared Jun Huang, senior technical expert at Meituan.

Meituan used DIEN as the recommendation model. The total number of embedding parameters is tens of billions, and there are thousands of feature fields in each sample. As the range of input features is not fixed and unknown in advance, the team uses hash tables to uniquely identify each input feature before feeding into an embedding layer.

Using the exact model architecture, optimizer, and data loader in TensorFlow, we observed that on a single A100 GPU, the HugeCTR embedding plugin achieved a 11.5x speedup compared to the original TensorFlow embedding. With weak scaling, the iteration time on 8x A100 GPUs only increased slightly to 1.17x that of 1x A100 GPU (Figure 3).

On a real use case, HugeCTR TensorFlow plugin provides a 11.6x speedup over native TensorFlow 2.5 embedding lookup layer.
Figure 3. HugeCTR TensorFlow embedding plugin performance on NVIDIA DGX A100 80GB on Meituan data

Conclusion

The HugeCTR TensorFlow embedding plugin is available today from the HugeCTR GitHub repository, as well as from the NGC NVIDIA Merlin TensorFlow container. If you are a TensorFlow user looking to build and deploy large-scale recommender systems with large embedding tables, the HugeCTR TensorFlow plugin is a great effortless drop-in replacement for TensorFlow embedding lookup layers.

Try it out to see the full potential of your GPUs unlocked. If you feel that you need even more performance and optimization, then the full-fledged native HugeCTR framework might be the next thing that you want to try.