Categories
Misc

Is Tensorflow right for my use case?

I don’t want to share a lot of details, but I’ve been working on a project that uses OpenCV for basic object detection (mostly through template matching) and tesseract OCR to read text, all in a video game. It’s to be expected that many of the objects are nearly identical 2D video game assets, so detection using template matching in opencv is very accurate.

I’ve been interested in exploring whether or not tensorflow would be appropriate for my use case.

I have the following use case / pattern that I’m looking to implement, and was interested to get your opinion on if tensorflow is an appropriate framework for my use case, or if I should continue to use opencv/tesseract:

  1. Image classification – I need to first determine which type of image I’m looking at to determine what kind of processing to do. There are 3-4 classes of images I’m interested in
  2. There’s one class of image where I need to perform both object detection and OCR
    1. Object detection/tracking would be for half a dozen objects on the screen at any given time.
    2. OCR can be done using tesseract if necessary, as my experimentation with tensorflow OCR implementations has been pretty poor accuracy.
    3. Text is always expected to be of a certain format and in the same region of the screen
  3. There’s another class of image where I need to perform only OCR. The OCR properties of the previous image class remain true.
    1. OCR can be done using tesseract if necessary, as my experimentation with tensorflow OCR implementations has been pretty poor accuracy.
    2. Text is always expected to be of a certain format and in the same region of the screen
  4. Finally, the last image class tell me that no object detection or OCR need to be performed.

My questions are:

  1. Is tensorflow right for me? Why or why not? What are the tradeoffs (besides time to tag custom datasets)?
  2. If tensorflow is right for me, how would I model the above logic? Should I just detect everything and program business logic based on what’s detected, or is there a good way to model this pipeline in tensorflow?

Feedback is appreciated. Thanks.

submitted by /u/UtesCartman
[visit reddit] [comments]

Categories
Offsites

Google Research: Themes from 2021 and Beyond

Over the last several decades, I’ve witnessed a lot of change in the fields of machine learning (ML) and computer science. Early approaches, which often fell short, eventually gave rise to modern approaches that have been very successful. Following that long-arc pattern of progress, I think we’ll see a number of exciting advances over the next several years, advances that will ultimately benefit the lives of billions of people with greater impact than ever before. In this post, I’ll highlight five areas where ML is poised to have such impact. For each, I’ll discuss related research (mostly from 2021) and the directions and progress we’ll likely see in the next few years.

<!–

–>

  · Trend 1: More Capable, General-Purpose ML Models
  · Trend 2: Continued Efficiency Improvements for ML
  · Trend 3: ML Is Becoming More Personally and Communally Beneficial
  · Trend 4: Growing Benefits of ML in Science, Health and Sustainability
  · Trend 5: Deeper and Broader Understanding of ML

Trend 1: More Capable, General-Purpose ML Models
Researchers are training larger, more capable machine learning models than ever before. For example, just in the last couple of years models in the language domain have grown from billions of parameters trained on tens of billions of tokens of data (e.g., the 11B parameter T5 model), to hundreds of billions or trillions of parameters trained on trillions of tokens of data (e.g., dense models such as OpenAI’s 175B parameter GPT-3 model and DeepMind’s 280B parameter Gopher model, and sparse models such as Google’s 600B parameter GShard model and 1.2T parameter GLaM model). These increases in dataset and model size have led to significant increases in accuracy for a wide variety of language tasks, as shown by across-the-board improvements on standard natural language processing (NLP) benchmark tasks (as predicted by work on neural scaling laws for language models and machine translation models).

Many of these advanced models are focused on the single but important modality of written language and have shown state-of-the-art results in language understanding benchmarks and open-ended conversational abilities, even across multiple tasks in a domain. They have also shown exciting capabilities to generalize to new language tasks with relatively little training data, in some cases, with few to no training examples for a new task. A couple of examples include improved long-form question answering, zero-label learning in NLP, and our LaMDA model, which demonstrates a sophisticated ability to carry on open-ended conversations that maintain significant context across multiple turns of dialog.

A dialog with LaMDA mimicking a Weddell seal with the preset grounding prompt, “Hi I’m a weddell seal. Do you have any questions for me?” The model largely holds down a dialog in character.
(Weddell Seal image cropped from Wikimedia CC licensed image.)

Transformer models are also having a major impact in image, video, and speech models, all of which also benefit significantly from scale, as predicted by work on scaling laws for visual transformer models. Transformers for image recognition and for video classification are achieving state-of-the-art results on many benchmarks, and we’ve also demonstrated that co-training models on both image data and video data can improve performance on video tasks compared with video data alone. We’ve developed sparse, axial attention mechanisms for image and video transformers that use computation more efficiently, found better ways of tokenizing images for visual transformer models, and improved our understanding of visual transformer methods by examining how they operate compared with convolutional neural networks. Combining transformer models with convolutional operations has shown significant benefits in visual as well as speech recognition tasks.

The outputs of generative models are also substantially improving. This is most apparent in generative models for images, which have made significant strides over the last few years. For example, recent models have demonstrated the ability to create realistic images given just a category (e.g., “irish setter” or “streetcar”, if you desire), can “fill in” a low-resolution image to create a natural-looking high-resolution counterpart (“computer, enhance!”), and can even create natural-looking aerial nature scenes of arbitrary length. As another example, images can be converted to a sequence of discrete tokens that can then be synthesized at high fidelity with an autoregressive generative model.

Example of a cascade diffusion models that generate novel images from a given category and then use those as the seed to create high-resolution examples: the first model generates a low resolution image, and the rest perform upsampling to the final high resolution image.
The SR3 super-resolution diffusion model takes as input a low-resolution image, and builds a corresponding high resolution image from pure noise.

Because these are powerful capabilities that come with great responsibility, we carefully vet potential applications of these sorts of models against our AI Principles.

Beyond advanced single-modality models, we are also starting to see large-scale multi-modal models. These are some of the most advanced models to date because they can accept multiple different input modalities (e.g., language, images, speech, video) and, in some cases, produce different output modalities, for example, generating images from descriptive sentences or paragraphs, or describing the visual content of images in human languages. This is an exciting direction because like the real world, some things are easier to learn in data that is multimodal (e.g., reading about something and seeing a demonstration is more useful than just reading about it). As such, pairing images and text can help with multi-lingual retrieval tasks, and better understanding of how to pair text and image inputs can yield improved results for image captioning tasks. Similarly, jointly training on visual and textual data can also help improve accuracy and robustness on visual classification tasks, while co-training on image, video, and audio tasks improves generalization performance for all modalities. There are also tantalizing hints that natural language can be used as an input for image manipulation, telling robots how to interact with the world and controlling other software systems, portending potential changes to how user interfaces are developed. Modalities handled by these models will include speech, sounds, images, video, and languages, and may even extend to structured data, knowledge graphs, and time series data.

Example of a vision-based robotic manipulation system that is able to generalize to novel tasks. Left: The robot is performing a task described in natural language to the robot as “place grapes in ceramic bowl”, without the model being trained on that specific task. Right: As on the left, but with the novel task description of “place bottle in tray”.

Often these models are trained using self-supervised learning approaches, where the model learns from observations of “raw” data that has not been curated or labeled, e.g., language models used in GPT-3 and GLaM, the self-supervised speech model BigSSL, the visual contrastive learning model SimCLR, and the multimodal contrastive model VATT. Self-supervised learning allows a large speech recognition model to match the previous Voice Search automatic speech recognition (ASR) benchmark accuracy while using only 3% of the annotated training data. These trends are exciting because they can substantially reduce the effort required to enable ML for a particular task, and because they make it easier (though by no means trivial) to train models on more representative data that better reflects different subpopulations, regions, languages, or other important dimensions of representation.

All of these trends are pointing in the direction of training highly capable general-purpose models that can handle multiple modalities of data and solve thousands or millions of tasks. By building in sparsity, so that the only parts of a model that are activated for a given task are those that have been optimized for it, these multimodal models can be made highly efficient. Over the next few years, we are pursuing this vision in a next-generation architecture and umbrella effort called Pathways. We expect to see substantial progress in this area, as we combine together many ideas that to date have been pursued relatively independently.

Pathways: a depiction of a single model we are working towards that can generalize across millions of tasks.

Top

Trend 2: Continued Efficiency Improvements for ML
Improvements in efficiency — arising from advances in computer hardware design as well as ML algorithms and meta-learning research — are driving greater capabilities in ML models. Many aspects of the ML pipeline, from the hardware on which a model is trained and executed to individual components of the ML architecture, can be optimized for efficiency while maintaining or improving on state-of-the-art performance overall. Each of these different threads can improve efficiency by a significant multiplicative factor, and taken together, can reduce computational costs, including CO2 equivalent emissions (CO2e), by orders of magnitude compared to just a few years ago. This greater efficiency has enabled a number of critical advances that will continue to dramatically improve the efficiency of machine learning, enabling larger, higher quality ML models to be developed cost effectively and further democratizing access. I’m very excited about these dirctions of research!

Continued Improvements in ML Accelerator Performance

Each generation of ML accelerator improves on previous generations, enabling faster performance per chip, and often increasing the scale of the overall systems. Last year, we announced our TPUv4 systems, the fourth generation of Google’s Tensor Processing Unit, which demonstrated a 2.7x improvement over comparable TPUv3 results in the MLPerf benchmarks. Each TPUv4 chip has ~2x the peak performance per chip versus the TPUv3 chip, and the scale of each TPUv4 pod is 4096 chips (4x that of TPUv3 pods), yielding a performance of approximately 1.1 exaflops per pod (versus ~100 petaflops per TPUv3 pod). Having pods with larger numbers of chips that are connected together with high speed networks improves efficiency for larger models.

ML capabilities on mobile devices are also increasing significantly. The Pixel 6 phone features a brand new Google Tensor processor that integrates a powerful ML accelerator to better support important on-device features.

Left: TPUv4 board; Center: Part of a TPUv4 pod; Right: Google Tensor chip found in Pixel 6 phones.

Our use of ML to accelerate the design of computer chips of all kinds (more on this below) is also paying dividends, particularly to produce better ML accelerators.

Continued Improvements in ML Compilation and Optimization of ML Workloads

Even when the hardware is unchanged, improvements in compilers and other optimizations in system software for machine learning accelerators can lead to significant improvements in efficiency. For example, “A Flexible Approach to Autotuning Multi-pass Machine Learning Compilers” shows how to use machine learning to perform auto-tuning of compilation settings to get across-the-board performance improvements of 5-15% (and sometimes as much as 2.4x improvement) for a suite of ML programs on the same underlying hardware. GSPMD describes an automatic parallelization system based on the XLA compiler that is capable of scaling most deep learning network architectures beyond the memory capacity of an accelerator and has been applied to many large models, such as GShard-M4, LaMDA, BigSSL, ViT, MetNet-2, and GLaM, leading to state-of-the-art results across several domains.

End-to-end model speedups from using ML-based compiler autotuning on 150 ML models. Included are models that achieve improvements of 5% or more. Bar colors represent relative improvement from optimizing different model components.

Human-Creativity–Driven Discovery of More Efficient Model Architectures

Continued improvements in model architectures give substantial reductions in the amount of computation needed to achieve a given level of accuracy for many problems. For example, the Transformer architecture, which we developed in 2017, was able to improve the state of the art on several NLP and translation benchmarks while simultaneously using 10x to 100x less computation to achieve these results than a variety of other prevalent methods, such as LSTMs and other recurrent architectures. Similarly, the Vision Transformer was able to show improved state-of-the-art results on a number of different image classification tasks despite using 4x to 10x less computation than convolutional neural networks.

Machine-Driven Discovery of More Efficient Model Architectures

Neural architecture search (NAS) can automatically discover new ML architectures that are more efficient for a given problem domain. A primary advantage of NAS is that it can greatly reduce the effort needed for algorithm development, because NAS requires only a one-time effort per search space and problem domain combination. In addition, while the initial effort to perform NAS can be computationally expensive, the resulting models can greatly reduce computation in downstream research and production settings, resulting in greatly reduced resource requirements overall. For example, the one-time search to discover the Evolved Transformer generated only 3.2 tons of CO2e (much less than the 284t CO2e reported elsewhere; see Appendix C and D in this joint Google/UC Berkeley preprint), but yielded a model for use by anyone in the NLP community that is 15-20% more efficient than the plain Transformer model. A more recent use of NAS discovered an even more efficient architecture called Primer (that has also been open-sourced), which reduces training costs by 4x compared to a plain Transformer model. In this way, the discovery costs of NAS searches are often recouped from the use of the more-efficient model architectures that are discovered, even if they are applied to only a handful of downstream uses (and many NAS results are reused thousands of times).

The Primer architecture discovered by NAS is 4x as efficient compared with a plain Transformer model. This image shows (in red) the two main modifications that give Primer most of its gains: depthwise convolution added to attention multi-head projections and squared ReLU activations (blue indicates portions of the original Transformer).

NAS has also been used to discover more efficient models in the vision domain. The EfficientNetV2 model architecture is the result of a neural architecture search that jointly optimizes for model accuracy, model size, and training speed. On the ImageNet benchmark, EfficientNetV2 improves training speed by 5–11x while substantially reducing model size over previous state-of-the-art models. The CoAtNet model architecture was created with an architecture search that uses ideas from the Vision Transformer and convolutional networks to create a hybrid model architecture that trains 4x faster than the Vision Transformer and achieves a new ImageNet state of the art.

EfficientNetV2 achieves much better training efficiency than prior models for ImageNet classification.

The broad use of search to help improve ML model architectures and algorithms, including the use of reinforcement learning and evolutionary techniques, has inspired other researchers to apply this approach to different domains. To aid others in creating their own model searches, we have open-sourced Model Search, a platform that enables others to explore model search for their domains of interest. In addition to model architectures, automated search can also be used to find new, more efficient reinforcement learning algorithms, building on the earlier AutoML-Zero work that demonstrated this approach for automating supervised learning algorithm discovery.

Use of Sparsity

Sparsity, where a model has a very large capacity, but only some parts of the model are activated for a given task, example or token, is another important algorithmic advance that can greatly improve efficiency. In 2017, we introduced the sparsely-gated mixture-of-experts layer, which demonstrated better results on a variety of translation benchmarks while using 10x less computation than previous state-of-the-art dense LSTM models. More recently, Switch Transformers, which pair a mixture-of-experts–style architecture with the Transformer model architecture, demonstrated a 7x speedup in training time and efficiency over the dense T5-Base Transformer model. The GLaM model showed that transformers and mixture-of-expert–style layers can be combined to produce a model that exceeds the accuracy of the GPT-3 model on average across 29 benchmarks using 3x less energy for training and 2x less computation for inference. The notion of sparsity can also be applied to reduce the cost of the attention mechanism in the core Transformer architecture.

The BigBird sparse attention model consists of global tokens that attend to all parts of an input sequence, local tokens, and a set of random tokens. Theoretically, this can be interpreted as adding a few global tokens on a Watts-Strogatz graph.

The use of sparsity in models is clearly an approach with very high potential payoff in terms of computational efficiency, and we are only scratching the surface in terms of research ideas to be tried in this direction.

Each of these approaches for improved efficiency can be combined together so that equivalent-accuracy language models trained today in efficient data centers are ~100 times more energy efficient and produce ~650 times less CO2e emissions, compared to a baseline Transformer model trained using P100 GPUs in an average U.S. datacenter using an average U.S. energy mix. And this doesn’t even account for Google’s carbon-neutral, 100% renewable energy offsets. We’ll have a more detailed blog post analyzing the carbon emissions trends of NLP models soon.

Top

Trend 3: ML Is Becoming More Personally and Communally Beneficial
A host of new experiences are made possible as innovation in ML and silicon hardware (like the Google Tensor processor on the Pixel 6) enable mobile devices to be more capable of continuously and efficiently sensing their surrounding context and environment. These advances have improved accessibility and ease of use, while also boosting computational power, which is critical for popular features like mobile photography, live translation and more. Remarkably, recent technological advances also provide users with a more customized experience while strengthening privacy safeguards.

More people than ever rely on their phone cameras to record their daily lives and for artistic expression. The clever application of ML to computational photography has continued to advance the capabilities of phone cameras, making them easier to use, improving performance, and resulting in higher-quality images. Advances, such as improved HDR+, the ability to take pictures in very low light, better handling of portraits, and efforts to make cameras more inclusive so they work for all skin tones, yield better photos that are more true to the photographer’s vision and to their subjects. Such photos can be further improved using the powerful ML-based tools now available in Google Photos, like cinematic photos, noise and blur reduction, and the Magic Eraser.

HDR+ starts from a burst of full-resolution raw images, each underexposed by the same amount (left). The merged image has reduced noise and increased dynamic range, leading to a higher quality final result (right).

In addition to using their phones for creative expression, many people rely on them to help communicate with others across languages and modalities in real-time using Live Translate in messaging apps and Live Caption for phone calls. Speech recognition accuracy has continued to make substantial improvements thanks to techniques like self-supervised learning and noisy student training, with marked improvements for accented speech, noisy conditions or environments with overlapping speech, and across many languages. Building on advances in text-to-speech synthesis, people can listen to web pages and articles using our Read Aloud technology on a growing number of platforms, making information more available across barriers of modality and languages. Live speech translations in the Google Translate app have become significantly better by stabilizing the translations that are generated on-the-fly, and high quality, robust and responsible direct speech-to-speech translation provides a much better user experience in communicating with people speaking a different language. New work on combining ML with traditional codec approaches in the Lyra speech codec and the more general SoundStream audio codec enables higher fidelity speech, music, and other sounds to be communicated reliably at much lower bitrate.

Everyday interactions are becoming much more natural with features like automatic call screening and ML agents that will wait on hold for you, thanks to advances in Duplex. Even short tasks that users may perform frequently have been improved with tools such as Smart Text Selection, which automatically selects entities like phone numbers or addresses for easy copy and pasting, and grammar correction as you type on Pixel 6 phones. In addition, Screen Attention prevents the phone screen from dimming when you are looking at it and improvements in gaze recognition are opening up new use cases for accessibility and for improved wellness and health. ML is also enabling new methods for ensuring the safety of people and communities. For example, Suspicious Message Alerts warn against possible phishing attacks and Safer Routing detects hard-braking events to suggest alternate routes.

Recent work demonstrates the ability of gaze recognition as an important biomarker of mental fatigue.

Given the potentially sensitive nature of the data that underlies these new capabilities, it is essential that they are designed to be private by default. Many of them run inside of Android’s Private Compute Core — an open source, secure environment isolated from the rest of the operating system. Android ensures that data processed in the Private Compute Core is not shared to any apps without the user taking an action. Android also prevents any feature inside the Private Compute Core from having direct access to the network. Instead, features communicate over a small set of open-source APIs to Private Compute Services, which strips out identifying information and makes use of privacy technologies, including federated learning, federated analytics, and private information retrieval, enabling learning while simultaneously ensuring privacy.

Federated Reconstruction is a novel partially local federated learning technique in which models are partitioned into global and local parameters. For each round of Federated Reconstruction training: (1) The server sends the current global parameters g to each user i; (2) Each user i freezes g and reconstructs their local parameters li; (3) Each user i freezes li and updates g to produce gi; (4) Users’ gi are averaged to produce the global parameters for the next round.

These technologies are critical to evolving next-generation computation and interaction paradigms, whereby personal or communal devices can both learn from and contribute to training a collective model of the world without compromising privacy. A federated unsupervised approach to privately learn the kinds of aforementioned general-purpose models with fine-tuning for a given task or context could unlock increasingly intelligent systems that are far more intuitive to interact with — more like a social entity than a machine. Broad and equitable access to these intelligent interfaces will only be possible with deep changes to our technology stacks, from the edge to the datacenter, so that they properly support neural computing.

Top

Trend 4: Growing Impact of ML in Science, Health and Sustainability
In recent years, we have seen an increasing impact of ML in the basic sciences, from physics to biology, with a number of exciting practical applications in related realms, such as renewable energy and medicine. Computer vision models have been deployed to address problems at both personal and global scales. They can assist physicians in their regular work, expand our understanding of neural physiology, and also provide better weather forecasts and streamline disaster relief efforts. Other types of ML models are proving critical in addressing climate change by discovering ways to reduce emissions and improving the output of alternative energy sources. Such models can even be leveraged as creative tools for artists! As ML becomes more robust, well-developed, and widely accessible, its potential for high-impact applications in a broad array of real-world domains continues to expand, helping to solve some of our most challenging problems.

Large-Scale Application of Computer Vision for New Insights

The advances in computer vision over the past decade have enabled computers to be used for a wide variety of tasks across different scientific domains. In neuroscience, automated reconstruction techniques can recover the neural connective structure of brain tissues from high resolution electron microscopy images of thin slices of brain tissue. In previous years, we have collaborated to create such resources for fruit fly, mouse, and songbird brains, but last year, we collaborated with the Lichtman Lab at Harvard University to analyze the largest sample of brain tissue imaged and reconstructed in this level of detail, in any species, and produced the first large-scale study of synaptic connectivity in the human cortex that spans multiple cell types across all layers of the cortex. The goal of this work is to produce a novel resource to assist neuroscientists in studying the stunning complexity of the human brain. The image below, for example, shows six neurons out of about 86 billion neurons in an adult human brain.

A single human chandelier neuron from our human cortex reconstruction, along with some of the pyramidal neurons that make a connection with that cell. Here’s an interactive version and a gallery of other interactive examples.

Computer vision technology also provides powerful tools to address challenges at much larger, even global, scales. A deep-learning–based approach to weather forecasting that uses satellite and radar imagery as inputs, combined with other atmospheric data, produces weather and precipitation forecasts that are more accurate than traditional physics-based models at forecasting times up to 12 hours. They can also produce updated forecasts much more quickly than traditional methods, which can be critical in times of extreme weather.

Comparison of 0.2 mm/hr precipitation on March 30, 2020 over Denver, Colorado. Left: Ground truth, source MRMS. Center: Probability map as predicted by MetNet-2. Right: Probability map as predicted by the physics-based HREF model. MetNet-2 is able to predict the onset of the storm earlier in the forecast than HREF as well as the storm’s starting location, whereas HREF misses the initiation location, but captures its growth phase well.

Having an accurate record of building footprints is essential for a range of applications, from population estimation and urban planning to humanitarian response and environmental science. In many parts of the world, including much of Africa, this information wasn’t previously available, but new work shows that using computer vision techniques applied to satellite imagery can help identify building boundaries at continental scales. The results of this approach have been released in the Open Buildings dataset, a new open-access data resource that contains the locations and footprints of 516 million buildings with coverage across most of the African continent. We’ve also been able to use this unique dataset in our collaboration with the World Food Programme to provide fast damage assessment after natural disasters through application of ML.

Example of segmenting buildings in satellite imagery. Left: Source image; Center: Semantic segmentation, with each pixel assigned a confidence score that it is a building vs. non-building; Right: Instance segmentation, obtained by thresholding and grouping together connected components.

A common theme across each of these cases is that ML models are able to perform specialized tasks efficiently and accurately based on analysis of available visual data, supporting high impact downstream tasks.

Automated Design Space Exploration

Another approach that has yielded excellent results across many fields is to allow an ML algorithm to explore and evaluate a problem’s design space for possible solutions in an automated way. In one application, a Transformer-based variational autoencoder learns to create aesthetically-pleasing and useful document layouts, and the same approach can be extended to explore possible furniture layouts. Another ML-driven approach automates the exploration of the huge design space of tweaks for computer game rules to improve playability and other attributes of a game, enabling human game designers to create enjoyable games more quickly.

A visualization of the Variational Transformer Network (VTN) model, which is able to extract meaningful relationships between the layout elements (paragraphs, tables, images, etc.) in order to generate realistic synthetic documents (e.g., with better alignment and margins).

Other ML algorithms have been used to evaluate the design space of computer architectural decisions for ML accelerator chips themselves. We’ve also shown that ML can be used to quickly create chip placements for ASIC designs that are better than layouts generated by human experts and can be generated in a matter of hours instead of weeks. This reduces the fixed engineering costs of chips and lowers the barrier to quickly creating specialized hardware for different applications. We’ve successfully used this automated placement approach in the design of our upcoming TPU-v5 chip.

Such exploratory ML approaches have also been applied to materials discovery. In a collaboration between Google Research and Caltech, several ML models, combined with a modified inkjet printer and a custom-built microscope, were able to rapidly search over hundreds of thousands of possible materials to hone in on 51 previously uncharacterized three-metal oxide materials with promising properties for applications in areas like battery technology and electrolysis of water.

These automated design space exploration approaches can help accelerate many scientific fields, especially when the entire experimental loop of generating the experiment and evaluating the result can all be done in an automated or mostly-automated manner. I expect to see this approach applied to good effect in many more areas in the coming years.

Application to Health

In addition to advancing basic science, ML can also drive advances in medicine and human health more broadly. The idea of leveraging advances in computer science in health is nothing new — in fact some of my own early experiences were in developing software to help analyze epidemiological data. But ML opens new doors, raises new opportunities, and yes, poses new challenges.

Take for example the field of genomics. Computing has been important to genomics since its inception, but ML adds new capabilities and disrupts old paradigms. When Google researchers began working in this area, the idea of using deep learning to help infer genetic variants from sequencer output was considered far-fetched by many experts. Today, this ML approach is considered state-of-the-art. But the future holds an even more important role for ML — genomics companies are developing new sequencing instruments that are more accurate and faster, but also present new inference challenges. Our release of open-source software DeepConsensus and, in collaboration with UCSC, PEPPER-DeepVariant, supports these new instruments with cutting-edge informatics. We hope that more rapid sequencing can lead to near term applicability with impact for real patients.

A schematic of the Transformer architecture for DeepConsensus, which corrects sequencing errors to improve yield and correctness.

There are other opportunities to use ML to accelerate our use of genomic information for personalized health outside of processing the sequencer data. Large biobanks of extensively phenotyped and sequenced individuals can revolutionize how we understand and manage genetic predisposition to disease. Our ML-based phenotyping method improves the scalability of converting large imaging and text datasets into phenotypes usable for genetic association studies, and our DeepNull method better leverages large phenotypic data for genetic discovery. We are happy to release both as open-source methods for the scientific community.

The process for generating large-scale quantification of anatomical and disease traits for combination with genomic data in Biobanks.

Just as ML helps us see hidden characteristics of genomics data, it can help us discover new information and glean new insights from other health data types as well. Diagnosis of disease is often about identifying a pattern, quantifying a correlation, or recognizing a new instance of a larger class — all tasks at which ML excels. Google researchers have used ML to tackle a wide range of such problems, but perhaps none of these has progressed farther than the applications of ML to medical imaging.

In fact, Google’s 2016 paper describing the application of deep learning to the screening for diabetic retinopathy, was selected by the editors of the Journal of the American Medical Association (JAMA) as one of the top 10 most influential papers of the decade — not just the most influential papers on ML and health, the most influential JAMA papers of the decade overall. But the strength of our research doesn’t end at contributions to the literature, but extends to our ability to build systems operating in the real world. Through our global network of deployment partners, this same program has helped screen tens of thousands of patients in India, Thailand, Germany and France who might otherwise have been untested for this vision-threatening disease.

We expect to see this same pattern of assistive ML systems deployed to improve breast cancer screening, detect lung cancer, accelerate radiotherapy treatments for cancer, flag abnormal X-rays, and stage prostate cancer biopsies. Each domain presents new opportunities to be helpful. ML-assisted colonoscopy procedures are a particularly interesting example of going beyond the basics. Colonoscopies are not just used to diagnose colon cancer — the removal of polyps during the procedure are the front line of halting disease progression and preventing serious illness. In this domain we’ve demonstrated that ML can help ensure doctors don’t miss polyps, can help detect elusive polyps, and can add new dimensions of quality assurance, like coverage mapping through the application of simultaneous localization and mapping techniques. In collaboration with Shaare Zedek Medical Center in Jerusalem, we’ve shown these systems can work in real time, detecting an average of one polyp per procedure that would have otherwise been missed, with fewer than four false alarms per procedure.

Sample chest X-rays (CXR) of true and false positives, and true and false negatives for (A) general abnormalities, (B) tuberculosis, and (C) COVID-19. On each CXR, red outlines indicate areas on which the model focused to identify abnormalities (i.e., the class activation map), and yellow outlines refer to regions of interest identified by a radiologist.

Another ambitious healthcare initiative, Care Studio, uses state-of-the-art ML and advanced NLP techniques to analyze structured data and medical notes, presenting clinicians with the most relevant information at the right time — ultimately helping them deliver more proactive and accurate care.

As important as ML may be to expanding access and improving accuracy in the clinical setting, we see a new equally important trend emerging: ML applied to help people in their daily health and well-being. Our everyday devices have powerful sensors that can help democratize health metrics and information so people can make more informed decisions about their health. We’ve already seen launches that enable a smartphone camera to assess heart rate and respiratory rate to help users without additional hardware, and Nest Hub devices that support contactless sleep sensing and allow users to better understand their nighttime wellness. We’ve seen that we can, on the one hand, significantly improve speech recognition quality for disordered speech in our own ASR systems, and on the other, use ML to help recreate the voice of those with speech impairments, empowering them to communicate in their own voice. ML enabled smartphones that help people better research emerging skin conditions or help those with limited vision go for a jog, seem to be just around the corner. These opportunities offer a future too bright to ignore.

The custom ML model for contactless sleep sensing efficiently processes a continuous stream of 3D radar tensors (summarizing activity over a range of distances, frequencies, and time) to automatically compute probabilities for the likelihood of user presence and wakefulness (awake or asleep).

ML Applications for the Climate Crisis

Another realm of paramount importance is climate change, which is an incredibly urgent threat for humanity. We need to all work together to bend the curve of harmful emissions to ensure a safe and prosperous future. Better information about the climate impact of different choices can help us tackle this challenge in a number of different ways.

To this end, we recently rolled out eco-friendly routing in Google Maps, which we estimate will save about 1 million tons of CO2 emissions per year (the equivalent of removing more than 200,000 cars from the road). A recent case study shows that using Google Maps directions in Salt Lake City results in both faster and more emissions-friendly routing, which saves 1.7% of CO2 emissions and 6.5% travel time. In addition, making our Maps products smarter about electric vehicles can help alleviate range anxiety, encouraging people to switch to emissions-free vehicles. We are also working with multiple municipalities around the world to use aggregated historical traffic data to help suggest improved traffic light timing settings, with an early pilot study in Israel and Brazil showing a 10-20% reduction in fuel consumption and delay time at the examined intersections.

With eco-friendly routing, Google Maps will show you the fastest route and the one that’s most fuel-efficient — so you can choose whichever one works best for you.

On a longer time scale, fusion holds promise as a game-changing renewable energy source. In a long-standing collaboration with TAE Technologies, we have used ML to help maintain stable plasmas in their fusion reactor by suggesting settings of the more than 1000 relevant control parameters. With our collaboration, TAE achieved their major goals for their Norman reactor, which brings us a step closer to the goal of breakeven fusion. The machine maintains a stable plasma at 30 million Kelvin (don’t touch!) for 30 milliseconds, which is the extent of available power to its systems. They have completed a design for an even more powerful machine, which they hope will demonstrate the conditions necessary for breakeven fusion before the end of the decade.

We’re also expanding our efforts to address wildfires and floods, which are becoming more common (like millions of Californians, I’m having to adapt to having a regular “fire season”). Last year, we launched a wildfire boundary map powered by satellite data to help people in the U.S. easily understand the approximate size and location of a fire — right from their device. Building on this, we’re now bringing all of Google’s wildfire information together and launching it globally with a new layer on Google Maps. We have been applying graph optimization algorithms to help optimize fire evacuation routes to help keep people safe in the presence of rapidly advancing fires. In 2021, our Flood Forecasting Initiative expanded its operational warning systems to cover 360 million people, and sent more than 115 million notifications directly to the mobile devices of people at risk from flooding, more than triple our outreach in the previous year. We also deployed our LSTM-based forecast models and the new Manifold inundation model in real-world systems for the first time, and shared a detailed description of all components of our systems.

The wildfire layer in Google Maps provides people with critical, up-to-date information in an emergency.

We’re also working hard on our own set of sustainability initiatives. Google was the first major company to become carbon neutral in 2007. We were also the first major company to match our energy use with 100 percent renewable energy in 2017. We operate the cleanest global cloud in the industry, and we’re the world’s largest corporate purchaser of renewable energy. Further, in 2020 we became the first major company to make a commitment to operate on 24/7 carbon-free energy in all our data centers and campuses worldwide. This is far more challenging than the traditional approach of matching energy usage with renewable energy, but we’re working to get this done by 2030. Carbon emission from ML model training is a concern for the ML community, and we have shown that making good choices about model architecture, datacenter, and ML accelerator type can reduce the carbon footprint of training by ~100-1000x.

Top

Trend 5: Deeper and Broader Understanding of ML
As ML is used more broadly across technology products and society more generally, it is imperative that we continue to develop new techniques to ensure that it is applied fairly and equitably, and that it benefits all people and not just select subsets. This is a major focus for our Responsible AI and Human-Centered Technology research group and an area in which we conduct research on a variety of responsibility-related topics.

One area of focus is recommendation systems that are based on user activity in online products. Because these recommendation systems are often composed of multiple distinct components, understanding their fairness properties often requires insight into individual components as well as how the individual components behave when combined together. Recent work has helped to better understand these relationships, revealing ways to improve the fairness of both individual components and the overall recommendation system. In addition, when learning from implicit user activity, it is also important for recommendation systems to learn in an unbiased manner, since the straightforward approach of learning from items that were shown to previous users exhibits well-known forms of bias. Without correcting for such biases, for example, items that were shown in more prominent positions to users tend to get recommended to future users more often.

As in recommendation systems, surrounding context is important in machine translation. Because most machine translation systems translate individual sentences in isolation, without additional surrounding context, they can often reinforce biases related to gender, age or other areas. In an effort to address some of these issues, we have a long-standing line of research on reducing gender bias in our translation systems, and to help the entire translation community, last year we released a dataset to study gender bias in translation based on translations of Wikipedia biographies.

Another common problem in deploying machine learning models is distributional shift: if the statistical distribution of data on which the model was trained is not the same as that of the data the model is given as input, the model’s behavior can sometimes be unpredictable. In recent work, we employ the Deep Bootstrap framework to compare the real world, where there is finite training data, to an “ideal world”, where there is infinite data. Better understanding of how a model behaves in these two regimes (real vs. ideal) can help us develop models that generalize better to new settings and exhibit less bias towards fixed training datasets.

Although work on ML algorithms and model development gets significant attention, data collection and dataset curation often gets less. But this is an important area, because the data on which an ML model is trained can be a potential source of bias and fairness issues in downstream applications. Analyzing such data cascades in ML can help identify the many places in the lifecycle of an ML project that can have substantial influence on the outcomes. This research on data cascades has led to evidence-backed guidelines for data collection and evaluation in the revised PAIR Guidebook, aimed at ML developers and designers.

Arrows of different color indicate various types of data cascades, each of which typically originate upstream, compound over the ML development process, and manifest downstream.

The general goal of better understanding data is an important part of ML research. One thing that can help is finding and investigating anomalous data. We have developed methods to better understand the influence that particular training examples can have on an ML model, since mislabeled data or other similar issues can have outsized impact on the overall model behavior. We have also built the Know Your Data tool to help ML researchers and practitioners better understand properties of their datasets, and last year we created a case study of how to use the Know Your Data tool to explore issues like gender bias and age bias in a dataset.

A screenshot from Know Your Data showing the relationship between words that describe attractiveness and gendered words. For example, “attractive” and “male/man/boy” co-occur 12 times, but we expect ~60 times by chance (the ratio is 0.2x). On the other hand, “attractive” and “female/woman/girl” co-occur 2.62 times more than chance.

Understanding dynamics of benchmark dataset usage is also important, given the central role they play in the organization of ML as a field. Although studies of individual datasets have become increasingly common, the dynamics of dataset usage across the field have remained underexplored. In recent work, we published the first large scale empirical analysis of dynamics of dataset creation, adoption, and reuse. This work offers insights into pathways to enable more rigorous evaluations, as well as more equitable and socially informed research.

Creating public datasets that are more inclusive and less biased is an important way to help improve the field of ML for everyone. In 2016, we released the Open Images dataset, a collection of ~9 million images annotated with image labels spanning thousands of object categories and bounding box annotations for 600 classes. Last year, we introduced the More Inclusive Annotations for People (MIAP) dataset in the Open Images Extended collection. The collection contains more complete bounding box annotations for the person class hierarchy, and each annotation is labeled with fairness-related attributes, including perceived gender presentation and perceived age range. With the increasing focus on reducing unfair bias as part of responsible AI research, we hope these annotations will encourage researchers already leveraging the Open Images dataset to incorporate fairness analysis in their research.

Because we also know that our teams are not the only ones creating datasets that can improve machine learning, we have built Dataset Search to help users discover new and useful datasets, wherever they might be on the Web.

Tackling various forms of abusive behavior online, such as toxic language, hate speech, and misinformation, is a core priority for Google. Being able to detect such forms of abuse reliably, efficiently, and at scale is of critical importance both to ensure that our platforms are safe and also to avoid the risk of reproducing such negative traits through language technologies that learn from online discourse in an unsupervised fashion. Google has pioneered work in this space through the Perspective API tool, but the nuances involved in detecting toxicity at scale remains a complex problem. In recent work, in collaboration with various academic partners, we introduced a comprehensive taxonomy to reason about the changing landscape of online hate and harassment. We also investigated how to detect covert forms of toxicity, such as microaggressions, that are often ignored in online abuse interventions, studied how conventional approaches to deal with disagreements in data annotations of such subjective concepts might marginalize minority perspectives, and proposed a new disaggregated modeling approach that uses a multi-task framework to tackle this issue. Furthermore, through qualitative research and network-level content analysis, Google’s Jigsaw team, in collaboration with researchers at George Washington University, studied how hate clusters spread disinformation across social media platforms.

Another potential concern is that ML language understanding and generation models can sometimes also produce results that are not properly supported by evidence. To confront this problem in question answering, summarization, and dialog, we developed a new framework for measuring whether results can be attributed to specific sources. We released annotation guidelines and demonstrated that they can be reliably used in evaluating candidate models.

Interactive analysis and debugging of models remains key to responsible use of ML. We have updated our Language Interpretability Tool with new capabilities and techniques to advance this line of work, including support for image and tabular data, a variety of features carried over from our previous work on the What-If Tool, and built-in support for fairness analysis through the technique of Testing with Concept Activation Vectors. Interpretability and explainability of ML systems more generally is also a key part of our Responsible AI vision; in collaboration with DeepMind, we made headway in understanding the acquisition of human chess concepts in the self-trained AlphaZero chess system.

We are also working hard to broaden the perspective of Responsible AI beyond western contexts. Our recent research examines how various assumptions of conventional algorithmic fairness frameworks based on Western institutions and infrastructures may fail in non-Western contexts and offers a pathway for recontextualizing fairness research in India along several directions. We are actively conducting survey research across several continents to better understand perceptions of and preferences regarding AI. Western framing of algorithmic fairness research tends to focus on only a handful of attributes, thus biases concerning non-Western contexts are largely ignored and empirically under-studied. To address this gap, in collaboration with the University of Michigan, we developed a weakly supervised method to robustly detect lexical biases in broader geo-cultural contexts in NLP models that reflect human judgments of offensive and inoffensive language in those geographic contexts.

Furthermore, we have explored applications of ML to contexts valued in the Global South, including developing a proposal for farmer-centered ML research. Through this work, we hope to encourage the field to be thoughtful about how to bring ML-enabled solutions to smallholder farmers in ways that will improve their lives and their communities.

Involving community stakeholders at all stages of the ML pipeline is key to our efforts to develop and deploy ML responsibly and keep us focused on tackling the problems that matter most. In this vein, we held a Health Equity Research Summit among external faculty, non-profit organization leads, government and NGO representatives, and other subject matter experts to discuss how to bring more equity into the entire ML ecosystem, from the way we approach problem-solving to how we assess the impact of our efforts.

Community-based research methods have also informed our approach to designing for digital wellbeing and addressing racial equity issues in ML systems, including improving our understanding of the experience of Black Americans using ASR systems. We are also listening to the public more broadly to learn how sociotechnical ML systems could help during major life events, such as by supporting family caregiving.

As ML models become more capable and have impact in many domains, the protection of the private information used in ML continues to be an important focus for research. Along these lines, some of our recent work addresses privacy in large models, both highlighting that training data can sometimes be extracted from large models and pointing to how privacy can be achieved in large models, e.g., as in differentially private BERT. In addition to the work on federated learning and analytics, mentioned above, we have also been enhancing our toolbox with other principled and practical ML techniques for ensuring differential privacy, for example private clustering, private personalization, private matrix completion, private weighted sampling, private quantiles, private robust learning of halfspaces, and in general, sample-efficient private PAC learning. Moreover, we have been expanding the set of privacy notions that can be tailored to different applications and threat models, including label privacy and user versus item level privacy.

A visual illustration of the differentially private clustering algorithm.

Top

Datasets
Recognizing the value of open datasets to the general advancement of ML and related fields of research, we continue to grow our collection of open source datasets and resources and expand our global index of open datasets in Google Dataset Search. This year, we have released a number of datasets and tools across a range of research areas:

Datasets & Tools Description
AIST++ 3D keypoints with corresponding images for dance motions covering 10 dance genres
AutoFlow 40k image pairs with ground truth optical flow
C4_200M A 200 million sentence synthetic dataset for grammatical error correction
CIFAR-5M Dataset of ~6 million synthetic CIFAR-10–like images (RGB 32 x 32 pix)
Crisscrossed Captions Set of semantic similarity ratings for the MS-COCO dataset
Disfl-QA Dataset of contextual disfluencies for information seeking
Distilled Datasets Distilled datasets from CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST, and SVHN
EvolvingRL 1000 top performing RL algorithms discovered through algorithm evolution
GoEmotions A human-annotated dataset of 58k Reddit comments labeled with 27 emotion categories
H01 Dataset 1.4 petabyte browsable reconstruction of the human cortex
Know Your Data Tool for understanding biases in a dataset
Lens Flare 5000 high-quality RGB images of typical lens flare
More Inclusive Annotations for People (MIAP) Improved bounding box annotations for a subset of the person class in the Open Images dataset
Mostly Basic Python Problems 1000 Python programming problems, incl. task description, code solution & test cases
NIH ChestX-ray14 dataset labels Expert labels for a subset of the NIH ChestX-ray14 dataset
Open Buildings Locations and footprints of 516 million buildings with coverage across most of Africa
Optical Polarization from Curie 5GB of optical polarization data from the Curie submarine cable
Readability Scroll Scroll interactions of ~600 participants reading texts from the OneStopEnglish corpus
RLDS Tools to store, retrieve & manipulate episodic data for reinforcement learning
Room-Across-Room (RxR) Multilingual dataset for vision-and-language navigation in English, Hindi and Telugu
Soft Attributes ~6k sets of movie titles annotated with single English soft attributes
TimeDial Dataset of multiple choice span-filling tasks for temporal commonsense reasoning in dialog
ToTTo English table-to-text generation dataset with a controlled text generation task
Translated Wikipedia Biographies Dataset for analysis of common gender errors in NMT for English, Spanish and German
UI Understanding Data for UIBert Datasets for two UI understanding tasks, AppSim & RefExp
WikiFact Wikipedia & WikiData–based dataset to train relationship classifiers and fact extraction models
WIT Wikipedia-based Image Text dataset for multimodal multilingual ML

Research Community Interaction
To realize our goal for a more robust and comprehensive understanding of ML and related technologies, we actively engage with the broader research community. In 2021, we published over 750 papers, nearly 600 of which were presented at leading research conferences. Google Research sponsored over 150 conferences, and Google researchers contributed directly by serving on program committees and organizing workshops, tutorials and numerous other activities aimed at collectively advancing the field. To learn more about our contributions to some of the larger research conferences this year, please see our recent conference blog posts. In addition, we hosted 19 virtual workshops (like the 2021 Quantum Summer Symposium), which allowed us to further engage with the academic community by generating new ideas and directions for the research field and advancing research initiatives.

In 2021, Google Research also directly supported external research with $59M in funding, including $23M through Research programs to faculty and students, and $20M in university partnerships and outreach. This past year, we introduced new funding and collaboration programs that support academics all over the world who are doing high impact research. We funded 86 early career faculty through our Research Scholar Program to support general advancements in science, and funded 34 faculty through our Award for Inclusion Research Program who are doing research in areas like accessibility, algorithmic fairness, higher education and collaboration, and participatory ML. In addition to the research we are funding, we welcomed 85 faculty and post-docs, globally, through our Visiting Researcher program, to come to Google and partner with us on exciting ideas and shared research challenges. We also selected a group of 74 incredibly talented PhD student researchers to receive Google PhD Fellowships and mentorship as they conduct their research.

As part of our ongoing racial equity commitments, making computer science (CS) research more inclusive continues to be a top priority for us. In 2021, we continued expanding our efforts to increase the diversity of Ph.D. graduates in computing. For example, the CS Research Mentorship Program (CSRMP), an initiative by Google Research to support students from historically marginalized groups (HMGs) in computing research pathways, graduated 590 mentees, 83% of whom self-identified as part of an HMG, who were supported by 194 Google mentors — our largest group to date! In October, we welcomed 35 institutions globally leading the way to engage 3,400+ students in computing research as part of the 2021 exploreCSR cohort. Since 2018, this program has provided faculty with funding, community, evaluation and connections to Google researchers in order to introduce students from HMGs to the world of CS research. We are excited to expand this program to more international locations in 2022.

We also continued our efforts to fund and partner with organizations to develop and support new pathways and approaches to broadening participation in computing research at scale. From working with alliances like the Computing Alliance of Hispanic-Serving Institutions (CAHSI) and CMD-IT Diversifying LEAdership in the Professoriate (LEAP) Alliance to partnering with university initiatives like UMBC’s Meyerhoff Scholars, Cornell University’s CSMore, Northeastern University’s Center for Inclusive Computing, and MIT’s MEnTorEd Opportunities in Research (METEOR), we are taking a community-based approach to materially increase the representation of marginalized groups in computing research.

Other Work
In writing these retrospectives, I try to focus on new research work that has happened (mostly) in the past year while also looking ahead. In past years’ retrospectives, I’ve tried to be more comprehensive, but this time I thought it could be more interesting to focus on just a few themes. We’ve also done great  work in many other research areas that don’t fit neatly into these themes. If you’re interested, I encourage you to check out our research publications by area below or by year (and if you’re interested in quantum computing, our Quantum team recently wrote a retrospective of their work in 2021):

Algorithms and Theory Hardware and Architecture Networking
Data Management Human-Computer Interaction and Visualization Quantum Computing
Data Mining Information Retrieval and the Web Responsible AI
Distributed Systems & Parallel Computing Machine Intelligence Robotics
Economics & Electronic Commerce Machine Perception Security, Privacy and Abuse Prevention
Education Innovation Machine Translation Software Engineering
General Science Mobile Systems Software Systems
Health and Bioscience Natural Language Processing Speech Processing

Conclusion
Research is often a multi-year journey to real-world impact. Early stage research work that happened a few years ago is now having a dramatic impact on Google’s products and across the world. Investments in ML hardware accelerators like TPUs and in software frameworks like TensorFlow and JAX have borne fruit. ML models are increasingly prevalent in many different products and features at Google because their power and ease of expression streamline experimentation and productionization of ML models in performance-critical environments. Research into model architectures to create Seq2Seq, Inception, EfficientNet, and Transformer or algorithmic research like batch normalization and distillation is driving progress in the fields of language understanding, vision, speech, and others. Basic capabilities like better language and visual understanding and speech recognition can be transformational, and as a result, these sorts of models are widely deployed for a wide variety of problems in many of our products including Search, Assistant, Ads, Cloud, Gmail, Maps, YouTube, Workspace, Android, Pixel, Nest, and Translate.

These are truly exciting times in machine learning and computer science. Continued improvement in computers’ ability to understand and interact with the world around them through language, vision, and sound opens up entire new frontiers of how computers can help people accomplish things in the world. The many examples of progress along the five themes outlined in this post are waypoints in a long-term journey!

Acknowledgements
Thanks to Alison Carroll, Alison Lentz, Andrew Carroll, Andrew Tomkins, Avinatan Hassidim, Azalia Mirhoseini, Barak Turovsky, Been Kim, Blaise Aguera y Arcas, Brennan Saeta, Brian Rakowski, Charina Chou, Christian Howard, Claire Cui, Corinna Cortes, Courtney Heldreth, David Patterson, Dipanjan Das, Ed Chi, Eli Collins, Emily Denton, Fernando Pereira, Genevieve Park, Greg Corrado, Ian Tenney, Iz Conroy, James Wexler, Jason Freidenfelds, John Platt, Katherine Chou, Kathy Meier-Hellstern, Kyle Vandenberg, Lauren Wilcox, Lizzie Dorfman, Marian Croak, Martin Abadi, Matthew Flegal, Meredith Morris, Natasha Noy, Negar Saei, Neha Arora, Paul Muret, Paul Natsev, Quoc Le, Ravi Kumar, Rina Panigrahy, Sanjiv Kumar, Sella Nevo, Slav Petrov, Sreenivas Gollapudi, Tom Duerig, Tom Small, Vidhya Navalpakkam, Vincent Vanhoucke, Vinodkumar Prabhakaran, Viren Jain, Yonghui Wu, Yossi Matias, and Zoubin Ghahramani for helpful feedback and contributions to this post, and to the entire Research and Health communities at Google for everyone’s contributions towards this work.

Categories
Misc

Detecting Objects in Point Clouds with NVIDIA CUDA-Pointpillars

Use long-range and high-precision data sets to achieve 3D object detection for perception, mapping, and localization algorithms.

A point cloud is a data set of points in a coordinate system. Points contain a wealth of information, including three-dimensional coordinates X, Y, Z; color; classification value; intensity value; and time. Point clouds mostly come from lidars that are commonly used in various NVIDIA Jetson use cases, such as autonomous machines, perception modules, and 3D modeling.

One of the key applications is to leverage long-range and high-precision data sets to achieve 3D object detection for perception, mapping, and localization algorithms.

PointPillars is one the most common models used for point cloud inference. This post discusses an NVIDIA CUDA-accelerated PointPillars model for Jetson developers. Download the CUDA-PointPillars model today.

What is CUDA-Pointpillars

In this post, we introduce CUDA-Pointpillars, which can detect objects in point clouds. The process is as follows:

  • Base preprocessing: Generates pillars.
  • Preprocessing: Generates BEV feature maps (10 channels).
  • ONNX model for TensorRT: An ONNX mode that can be implemented by TensorRT.
  • Post-processing: Generates bounding boxes by parsing the output of the TensorRT engine.
Image shows the pipeline of CUDA-Pointpillars, which has four parts and uses a point cloud as input and output-bounding box.
Figure 1. Pipeline of CUDA-Pointpillars.

Base preprocessing

The base preprocessing step converts point clouds into base feature maps. It provides the following components:

  • Base feature maps
  • Pillar coordinates: Coordinates of each pillar.
  • Parameters: Number of pillars.
Image shows how to convert points cloud into base feature maps and what is the struct of base feature maps.
Figure 2. Converting point clouds into base feature maps

Preprocessing

The preprocessing step converts the basic feature maps (four channels) into BEV feature maps (10 channels).

Image shows how to convert 4 channels from base feature maps into 0 channels  of BEV feature maps.
Figure 3. Converting base feature maps into BEV feature maps

ONNX model for TensorRT

The native point pillars from OpenPCDet were modified for the following reasons:

  • Too many small operations, with low memory bandwidth.
  • Some operations, like NonZero, are not supported by TensorRT.
  • Some operations, like ScatterND, have low performance.
  • They use “dict” as input and output, which cannot export ONNX files.

To export ONNX from native OpenPCDet, we modified the model (Figure 4).

Image shows an ONNX model in CUDA-Pointpillars, which was exported from OpenPCDet and simplified by onnx-simplifier.
Figure 4. Overview of the ONNX model in CUDA-Pointpillars.

You can divide the whole ONNX file into the following parts:

  • Inputs: BEV feature maps, pillar coordinates, parameters. These are all generated in preprocessing.
  • Outputs: Class, Box, Dir_class. These are parsed by post-processing to generate a bounding box.
  • ScatterBEV: Converts point pillars (1D) into a 2D image, which can work as a plug-in for TensorRT.
  • Others: Supported by TensorRT. [OTHER WHAT?]
Image shows how to scatter point pillars into 2D image for 2D backbone, which detects objects.
Figure 5. Scattering point pillar data into a 2D image for the 2D backbone.

Post-processing

The post-processing parses the output of the TensorRT engine (class, box, and dir_class) and output-bounding boxes. Figure 6 shows example parameters.

Image shows members of a bounding box and their physical significance.
Figure 6. Parameters of a bounding box.

Using CUDA-PointPillars

To use CUDA-PointPillars, provide the ONNX mode file and data buffer for the point clouds:

    std::vector nms_pred;
    PointPillar pointpillar(ONNXModel_File, cuda_stream);
    pointpillar.doinfer(points_data, points_count, nms_pred);

Converting a native model trained by OpenPCDet into an ONNX file for CUDA-Pointpillars

In our project, we provide a Python script that can convert a native model trained by OpenPCDet into am ONNX file for CUDA-Pointpillars. Find the exporter.py script in the /tool directory of CUDA-Pointpillars.

To get a pointpillar.onnx file in the current directory, run the following command:

$ python exporter.py --ckpt ./*.pth

Performance

The table shows the test environment and performance. Before the test, boost CPU and GPU.

Jetson Xavier NVIDIA AGX 8GB
Release NVIDIA JetPack 4.5
CUDA 10.2
TensorRT 7.1.3
Infer Time 33 ms
Table 1. Test platform and performance

Get started with CUDA-PointPillars

In this post, we showed you what CUDA-PointPillars is and how to use it to detect objects in point clouds.

Because native OpenPCDet cannot export ONNX and has too many small operations with low performance for TensorRT, we developed CUDA-PointPillars. This application can export native models trained by OpenPCDet to a special ONNX model and inference the ONNX model by TensorRT.

Download CUDA-PointPillars today.

Categories
Misc

Can’t open Tensorflow lite android studio file

Hi, recently I decided to try out Tensorflow Lite so I went to their github(https://github.com/tensorflow/examples) and downloaded the examples file. However, when I tried to open an example such as object detection in Android Studio, I kept getting the error:

Could not resolve all dependencies for configuration ‘:app:taskApiDebugRuntimeClasspath’.

Using insecure protocols with repositories, without explicit opt-in, is unsupported. Switch Maven repository ‘ossrh-snapshot(http://oss.sonatype.org/content/repositories/snapshots)’ to redirect to a secure protocol (like HTTPS) or allow insecure protocols. See https://docs.gradle.org/7.3.2/dsl/org.gradle.api.artifacts.repositories.UrlArtifactRepository.html#org.gradle.api.artifacts.repositories.UrlArtifactRepository:allowInsecureProtocol for more details.

May I ask how do I fix it? Thank you.

submitted by /u/EpicNewsMoment
[visit reddit] [comments]

Categories
Misc

Leading HPC Software Company Bright Computing Joins NVIDIA

Bright Computing, a leader in software for managing high performance computing systems used by more than 700 organizations worldwide, is now part of NVIDIA. Companies in healthcare, financial services, manufacturing and other markets use its tool to set up and run HPC clusters, groups of servers linked by high-speed networks into a single unit. Its Read article >

The post Leading HPC Software Company Bright Computing Joins NVIDIA appeared first on The Official NVIDIA Blog.

Categories
Misc

Cloud Control: Production Studio Taylor James Elevates Remote Workflows With NVIDIA Technology

WFH was likely one the most-used acronyms of the past year, with more businesses looking to enhance their employees’ remote experiences than ever. Creative production studio Taylor James found a cloud-based solution to maintain efficiency and productivity — even while working remotely — with NVIDIA RTX Virtual Workstations on AWS. With locations in New York, Read article >

The post Cloud Control: Production Studio Taylor James Elevates Remote Workflows With NVIDIA Technology appeared first on The Official NVIDIA Blog.

Categories
Misc

AI Startup Speeds Up Derivative Models for Bank of Montreal

To make the best portfolio decisions, banks need to accurately calculate values of their trades, while factoring in uncertain external risks. This requires high-performance computing power to run complex derivatives models — which find fair prices for financial contracts — as close to real time as possible. “You don’t want to trade today on yesterday’s Read article >

The post AI Startup Speeds Up Derivative Models for Bank of Montreal appeared first on The Official NVIDIA Blog.

Categories
Misc

Managing Video Streams in Runtime with the NVIDIA DeepStream SDK

The applications of video analytics are changing right before your eyes. With AI applied to video analytics, it is now possible to keep a watch over hundreds of cameras in real time.

Transportation monitoring systems, healthcare, and retail have all benefited greatly from intelligent video analytics (IVA). DeepStream is an IVA SDK. DeepStream enables you to attach and detach video streams in runtime without affecting the entire deployment.

This post discusses the details of stream addition and deletion work with DeepStream. I also provide an idea about how to manage large deployments centrally across multiple isolated datacenters, serving multiple use cases with streams coming from many cameras.

The NVIDIA DeepStream SDK is a streaming analytics toolkit for multisensor processing. Streaming data analytics use cases are transforming before your eyes. IVA is of immense help in smarter spaces. DeepStream runs on discrete GPUs such as NVIDIA T4, NVIDIA Ampere Architecture and on system on chip platforms such as the NVIDIA Jetson family of devices.

DeepStream has flexibility that enables you to build complex applications with any of the following:

  • Multiple deep learning frameworks
  • Multiple streams
  • Multiple models combining in series or in parallel to form an ensemble
  • Multiple models working in tandem
  • Compute at different precisions
  • Custom preprocessing and post-processing
  • Orchestration with Kubernetes

DeepStream application can have multiple plug-ins, as shown in Figure 1. Each plug-in as per the capability may use GPU, DLA, or specialized hardware.

Figure 1. Showing integration of nvinferserver and nvinfer plug-in with DeepStream. Nvinfer server can work with backends like ONNX, TensorFlow, PyTorch, and TensorRT. It also enables creating ensemble models.

DeepStream is fundamentally built to allow deployment at scale, making sure throughput and accuracy at any given time. The scale of any IVA pipeline depends on two major factors:

  • Stream management
  • Compute capability

Stream management is a vital aspect of any large deployment with many cameras. Any large deployment cannot be brought down to add/remove streams. Such large deployment must be made failsafe to handle spurious streams in runtime. Also, the deployment is expected to handle runtime attachment/detachment of the use case to the pipeline running with specific models.

This post helps you in understanding the following aspects of stream management:

  • Stream consumption with DeepStream Python API
  • Adding and removing streams in runtime
  • Attaching specific stream to pipeline with specific models in runtime
  • Stream management on large-scale deployment involving multiple data centers

As the application grows in complexity, it becomes increasingly difficult to change. A well-thought-out development strategy from the beginning can help a long way. In the next section, I discuss different ways to develop a DeepStream application briefly. I also discuss how to manage streams/use-case allocation and deallocation and consider some of the best practices.

DeepStream application development

DeepStream enables you to create seamless streaming pipelines for AI-based video, audio, and image analytics. DeepStream gives you the choice of developing in C or Python, providing them more flexibility. DeepStream comes with several hardware-accelerated plug-ins. DeepStream is derived From Gstreamer and offers and unified API between Python and C language.

Python and C API for DeepStream are unified. This means any application developed in Python can be easily converted to C and the reverse. Python and C provide all the levels of freedom to the developer. With DeepStream Python and C API, it is possible to design dynamic applications that handle streams and use-cases in runtime. Some example Python applications are available at: NVIDIA-AI-IOT/deepstream_python_app.

The DeepStream SDK is based on the GStreamer multimedia framework and includes a GPU-accelerated plug-in pipeline. Plug-ins for video inputs, video decoding, image preprocessing, NVIDIA TensorRT-based inference, object tracking, and display are included in the SDK to make the application development process easier. These features can be used to create multistream video analytics solutions that are adaptable.

Plug-ins are the core building block with which to make pipelines. Each data buffer in-between the input (that is, the input of the pipeline, for example, camera and video files) and output (for example, the screen display) is passed through plug-ins. Video decoding and encoding, neural network inference, and displaying text on top of video streams are examples of plug-ins. The connected plug-in constitutes a pipeline.

Pads are the interfaces between plug-ins. When data flows from one plug-in to another plug-in in a pipeline, it flows from the Source pad of one plug-in to the Sink pad of another. Each plug-in might have zero, one, or many source/sink components.

Figure 2. The simplified DeepStream application with multiple stream support.

The earlier example application consists of the following plug-ins:

  • GstUriDecodebin: Decodes data from a URI into raw media. It selects a source plug-in that can handle the given scheme and connects it to decodebin.
  • Nvstreammux: The Gst-nvstreammux plug-in forms a batch of frames from multiple input sources.
  • Nvinfer: The Gst-nvinfer plug-in does inferencing on input data using TensorRT.
  • Nvmultistream-tiler: The Gst-nvmultistreamtiler plug-in composites a 2D tile from batched buffers.
  • Nvvideoconvert: Gst-nvvideoconvert performs scaling, cropping, and video color format conversion.
  • NvDsosd: Gst-nvdsosd draws bounding boxes, text, and region of interest (ROI) polygons.
  • GstEglGles: EglGlesSink renders video frames on an EGL surface (xOverlay interface and Native Display).

Each plug-in can have one or more source and sink pads. In this case, when the streams are added, the Gst-Uridecodebin plug-in gets added to the pipeline, one for each stream. The source component from each Gst-Uridecodebin plug-in is connected to each sink component on the single Nv-streammux plug-in. Nv-streammux creates batches from the frames coming from all previous plug-ins and pushes them to the next plug-in in the pipeline. Figure 3 shows how multiple camera streams are added to the pipeline.

Figure 3. Showing Pads as the linking interfaces between plug-ins.

Buffers carry the data through the pipeline. Buffers are timestamped, contain metadata attached by various DeepStream plug-ins. Buffer carries information such as how many plug-ins are using it, flags, and pointers to objects in memory.

DeepStream applications can be thought of as pipelines consisting of individual components plug-ins. Each plug-in represents a functional block like inference using TensorRT or multistream decode. Where applicable, plug-ins are accelerated using the underlying hardware to deliver maximum performance. DeepStream’s key value is in making deep learning for video easily accessible, to allow you to concentrate on quickly building and customizing efficient and scalable video analytics applications.

Runtime stream addition/deletion application

DeepStream provides sample implementation for runtime add/delete functionality in python and C language. The samples are located at the following locations:

These applications are designed keeping simplicity in mind. These applications take one input stream, and the same stream is added multiple times to the running pipeline after a set interval of time. This is how a specified number of streams are added to the pipeline without restarting the application. Eventually, each stream is removed at every interval of time. After the last stream is removed, the application gracefully stops.

To start with the sample applications, follow these steps.

To create a Python-based application

  1. Pull the DeepStream Docker image from ngc.nvidia.com.
  2. Run git clone on the Python application repository within the Docker container.
  3. Go to the following location within the Docker container:  deepstream_python_apps/apps/runtime_source_add_delete
  4. Set up the Python prerequisites.
  5. Go to apps/runtime_source_add_delete and execute the application as follows:
python3 deepstream-test-rt-src-add-del.py 
python3 deepstream_rt_src_add_del.py file:///opt/nvidia/deepstream/deepstream-/samples/streams/sample_720p.mp4

To create a C-based application

  1. Pull the DeepStream Docker image from ngc.nvidia.com.  :
  2. Run git clone on the C application repository at /opt/nvidia/deepstream/deepstream/sources/apps/sample_apps/ within the Docker container.
  3. Go to deepstream_reference_apps/runtime_source_add_delete and compile and run the application as follows:
make
./deepstream-test-rt-src-add-del 
./deepstream-test-rt-src-add-del file:///opt/nvidia/deepstream/deepstream-/samples/streams/sample_720p.mp4

Application aspect: Runtime camera add-remove

DeepStream Python or C applications usually take input streams as a list of arguments while running the script. After code execution, a sequence of events takes place that eventually adds a stream to a running pipeline.

Here, you use the uridecodebin plug-in that decodes data from a URI into raw media. It selects a source plug-in that can handle the given scheme and connects it to a decode bin.

Here’s the list of sequence that takes place to register any stream:

  1. The source bin is created from Curidecodebin plug-in by the function create_uridecode_bin. The function create_uridecode_bin takes the first argument source_id, which is an integer, and the second argument is rtsp_url. In this case, this integer is the order of the stream from 1…..N. This integer is used to create a uniquely identifiable source-bin name as source-bin-1, source-bin-2,… source-bin-N.
  2. g_source_bin_list dictionary maps between the source-bin and id value.
  3. After the source bin is created, the RTSP stream URLs from arguments to the program are attached to this source bin.
  4. Later, the source-bin value of uridecodebin is linked to the sink-bin of the next plug-in, streammux.
  5. Such multiple uridecodebin plug-ins are created, each for one stream, and attached to streammux plug-in.

The following code example shows the minimal code in Python to attach multiple streams to a DeepStream pipeline.

g_source_bin_list = []
for i in range(num_sources):
    print("Creating source_bin ",i," n ")
    uri_name=argv[i]
    if uri_name.find("rtsp://") == 0 :
        is_live = True
    #Create source bin and add to pipeline
    source_bin=create_uridecode_bin(i, uri_name)
    g_source_bin_list[rtsp[i]] = source_bin
    pipeline.add(source_bin)

In a more organized application, these lines of code responsible for stream addition are shifted to a function that takes two arguments to attach a stream: stream_id and rtsp_url. You can call such a function anytime and append more streams to the running application.

Similarly, when the stream must be detached from the application, the following events take place:

  1. source_id of the already attached stream is given to the function stop_release_source.
  2. sink-pad of streammux attached to the source_id to be released is detached from the source bin of uridecodebin.
  3. The source bin of uridecodebin is then removed from the pipeline.
  4. The active source count is decreased by one.

The following code example shows a minimal code for Python and C to detach streams from a DeepStream pipeline.

def stop_release_source(source_id):
    pad_name = "sink_%u" % source_id
    print(pad_name)
    #Retrieve sink pad to be released
    sinkpad = streammux.get_static_pad(pad_name)
    #Send flush stop event to the sink pad, then release from the streammux
    sinkpad.send_event(Gst.Event.new_flush_stop(False))
    streammux.release_request_pad(sinkpad)
    #Remove the source bin from the pipeline
    pipeline.remove(g_source_bin_list[source_id])
    source_id -= 1
    g_num_sources -= 1

Deployment aspect: Runtime camera and use case management

Earlier, I discussed how to add and remove streams from the code. There are a few more factors, considering the deployment aspect.

Previously, you took all the input streams with command-line arguments. However, after the program is executed and while it is in deployment, you cannot provide any additional argument to it. How do you pass instructions to the running program on which stream to attach or detach?

Deployment required additional code that takes care of periodically checking whether there are new streams available that must be attached. The following streams should be deleted:

  • Stream no longer requires monitoring.
  • Camera issues lead to no streams.
  • Previously attached stream must be used for another use case.

In the case of multiple data centers for stream processing, give priority to the stream source nearest to the data center.

The DeepStream pipeline runs in the main thread. A separate thread is required to check for the stream to be added or deleted. Thankfully, Glib has a function named g_timeout_add_seconds. Glib is the GNU C Library project that provides the core libraries for the GNU system and GNU/Linux systems, as well as many other systems that use Linux as the kernel.

g_timeout_add_seconds (set) is a function to be called at regular intervals when the pipeline is running. The function is called repeatedly until it returns FALSE, at which point the timeout is automatically destroyed, and the function is not called again.

guint g_timeout_add_seconds (guint interval, GSourceFunc function, gpointer data);

g_timeout_add_seconds takes three inputs:

  • Interval: The time between calls to the function, in seconds.
  • function: The function to call.
  • data: The data and arguments to pass to the function.

For example, you call a function watchDog and it takes GSourceBinList. A dictionary map between streamURL and streamId. streamId is an internal ID (Integer) that gets generated after the stream is added to the pipeline. The final caller function looks like the following code example:

guint interval = 10;
guint g_timeout_add_seconds (interval, watchDog, GSourceBinList,  argv);

As per the current interval setting, the watchDog function is called every 10 seconds. A database must be maintained to manage and track many streams. Such an example database table is shown in Table 1. The function watchDog can be used to query a database where a list of all the available streams against their current state and use case is maintained.

Source ID RTSP URL Stream State Use case Camera Location Taken
1 Rtsp://123/1.mp4 ON License Plate Detection Loc1 True
2 Rtsp://123/2.mp4 BAD STREAM License Plate Detection Loc2 False
3 Rtsp://123/3.mp4 OFF Motion Detection Loc2 False
n Rtsp://123/n.mp4 OFF Social Distance Loc3 False
Table 1. The minimal database table required to manage streams and corresponding use cases.

Here’s an example of the bare minimum database structure (SQL/no-SQL) needed to manage many streams at the same time:

  • Source ID: A unique ID, which is also the sink pad ID of nvstreammux where it is connected to. source_id would be useful to monitor nv-gst events, for example, pad added deleted EOS for each stream. Remember that in the earlier simple app, you considered making source bin as source-bin-1, source-bin-2,… source-bin-N in order of argument input. You use the same method with many cameras and track all active source bins in the application scope.
  • RTSP URL: The URL that the source plug-in should use.
  • Stream state: Helps in managing the states of the stream such as ON or OFF. The database client must also be able to change the camera, such as BAD STREAM, NO STREAMm CAMERA FAULT, and so on, according to what is perceived by the client. This can help in instant maintenance.
  • Use case: Assigns a use case to the camera. This use case is checked and only those cameras whose model is currently active are attached.
  • Camera Location:  Helps with the localization of the compute based on the cameras’ location. This check avoids unnecessary capture from a camera located at distant locations and could be better assigned to other nearby compute clusters.
  • Taken: Assume that deployment is multiple GPUs with multiple nodes. When a DeepStream application running on any machine and any GPU adds any source, it sets the flag to True. This prevents another instance from repeatedly adding the same source again.

Maintaining a schema as described enables easy dashboard creation and monitoring from a central place.

Returning to the watchDog function, here’s the pseudo-code to check for the stream state and attach a new video stream according to the location and use cases:

FUNCTION watchDog (Dict: GSourceBinList)
INITIALIZE streamURL ⟵ List ▸ Dynamic list of stream URLs
INITIALIZE streamState ⟵ List ▸ Dynamic list of state corresponding to stream URLs
INITIALIZE streamId ⟵ Integer ▸ variable to store id for new stream
streamURL, streamState := getCurrentCameraState()

FOR X = 1 in length(streamState)
	IF ((streamURL[X] IN gSourceBinList.keys()) & (streamState[X] == "OFF"))
			stopReleaseSource(streamURL[X]) ▸ Detach stream

streamURL, streamState := getAllStreamByLocationAndUsecase()
FOR Y = 1 in length(streamState)
	IF ((streamURL[Y] NOT IN gSourceBinList.keys()) & (streamState [Y] == "ON")
			streamId := addSource(streamURL[Y]) ▸ Add new stream
			GSourceBinList(streamURL, streamId) ▸ update mappings
RETURN GSourceBinList
  • An application enters the main function after module loading and global variable initialization.
  • In the main function, the local modules and variables are initialized.
  • As the application starts for the first time, it requests the list of streams from the database after location and use case filters are applied.
  • After receiving the stream list ,all the plug-ins of the DeepStream pipeline are initialized, linked, and set to the PLAY state. At this point, the application is running with all the streams provided.
  • After every set interval of time, a separate thread checks for the state of the current stream in the database. If the state of any already-added stream is changed to OFF in the database, the stream is released. The thread also checks if a new camera is listed in the database with ON state, the stream is added to the DeepStream pipeline after applying location and use case filter.
  • After the stream is added, the flag in the Taken column of the database must be set to True so that no other process can add the same stream again.

Figure 4 shows the overall flow of the functional calls required to efficiently add remove camera stream and attach to the server running with an appropriate model.

Application lifecycle for runtime add remove with DeepStream.
Figure 4. Overall control flow to manage streams and attach/detach use cases.

Just changing the number of sources does not help, as downstream components to the source must be able to change their properties according to the number of streams. For this purpose, components of the DeepStream application are already optimized to change properties in runtime.

However, many of the plug-ins use batch size as a parameter during initialization to allocate compute/memory resources. In this case, we recommend specifying maximum batch size while executing the application. Table 2 shows a few such plug-in examples:

Plug-ins Functionality Runtime changes
Gst-nvstreammux Forms a batch of frames from multiple input sources. The muxer supports the addition and deletion of sources at run time.
Gst-nvdsanalytics Performs analytics on metadata attached by nvinfer (primary detector) and nvtracker. If the runtime stream resolution is different from the configuration resolution. The plug-in handles the resolution change and scales the rules for the runtime resolution.
Gst-nvinfer Performs inferencing on input data using TensorRT. Enables the reconfiguration of batch size according to number of the stream at runtime.
Gst-nvinferserver Performs inferencing on input data using NVIDIA Triton Inference Server. Enables the reconfiguration of batch size according to number of the stream at runtime.
Gst-nvmultistreamtiler Composites a 2D tile from batched buffers. Reconfigures 2D tile for new sources added at runtime.
Gst- nvtracker Enables the DS pipeline to use a low-level tracker to track the detected objects with unique IDs. Supports tracking on new sources added at runtime and cleanup of resources when sources are removed.
Table 2. Plug-ins and their capability to adapt to runtime changes.

You can explicitly change the property when the number of streams is detected. To manually tweak the properties of the plug-in at runtime, use the set_property function in Python and C or the g_object_set function in C.

Best practices

  • Always check your stream properties before adding to the pipeline. Stream properties can be checked with the gst-discoverer-1.0 command-line utility. It accepts a URI from the command line and prints all information regarding the stream. It is useful to find out what container and codecs have been used to produce the media, and therefore what plug-ins you must put in a pipeline to play it. Gst-Discover can be used with Python and C by using the respective APIs.
  • Profile the DeepStream application when it is developed. This is the first step in optimizing and tuning your application. Profiling helps in the understanding of an application’s performance characteristics and can easily identify parts of the code that present opportunities for improvement. Find hotspots and bottlenecks in your application to help you decide where to focus your optimization efforts.
  • Count the maximum number of streams that can be run on GPU by profiling the application. At runtime, make sure that you keep the maximum stream below the maximum supported number so that your application performance remains stable.

To increase performance, consult the DeepStream troubleshooting manuals.

For more information, see the following resources:

Categories
Misc

Trying to apply the TensorFlow agents from the examples to a custom environment

Hello everyone,

I followed the TensorFlow tutorial for agents and the multi armed bandit tutorial and now I’m trying to make one of the already implemented agents, from the examples, work on my own environment. Basically my environment exists of 5 actions and 5 observations. Applying one action i results in the same state i. One action contains another step of sending that action number to a different program via a socket and the answer from the program is interpreted for the reward. My environment seems to be working, I used the little test script below to test the observe and action functions. I know this is not a full proof but showed its atleast working.

Now I am missing the part of mapping the observation to the action, hence the agent with his policy. I followed the structure of the examples, but every agent I tried on my environment had a different error. I seem to apply them wrong to my environment but cant figure out what I’m doing wrong.

Am I not able to apply one of these end-to-end agents from the examples like it is stated? I searched all tutorials and documentations on tensorflow but couldnt get any answer. My environment should be simple enough. I seem to be missing some essential step.

The Errors for each agent:

 Greedy: Input 0 of layer "dense_3" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (50,) Call arguments received: • observation=tf.Tensor(shape=(), dtype=int32) • step_type=tf.Tensor(shape=(), dtype=int32) • network_state=() • training=False Linucb: ValueError: Global observation shape is expected to be [None, 1]. Got []. LinThompson: lib/python3.8/site-packages/tf_agents/bandits/policies/linear_bandit_policy.py", line 242, in _distribution raise ValueError( ValueError: Global observation shape is expected to be [None, 1]. Got []. Exp3: lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7107, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.InvalidArgumentError: cannot compute Mul as input #1(zero-based) was expected to be a int32 tensor but is a float tensor [Op:Mul] 

The environment:

 nest = tf.nest #https://www.tensorflow.org/agents/tutorials/2_environments_tutorial # Statemachine environment # # Actions: # n Actions: Every state of the statemachine represents one bandit with one action. # for now it is 5 states # # Observations: # one of the 5 states class AFLEnvironment(bandit_py_environment.BanditPyEnvironment): def __init__(self): action_spec = tensor_spec.BoundedTensorSpec( shape=(), dtype=np.int32, minimum=0, maximum=4, name='action') #actions: 0,1,2,3,4 for 5 states. observation_spec = tensor_spec.BoundedTensorSpec( shape=(), dtype=np.int32, minimum=0, maximum = 4,name='observation')#5 possible states self._state = tf.constant(0) super(AFLEnvironment, self).__init__(observation_spec, action_spec) def _observe(self): self._observation = self._state return self._observation # implementation of taking the action def _apply_action(self, action): sock = self.__connectToSocket() #answer: NO_FAULT = 0, FSRV_RUN_TMOUT = 1, FSRV_RUN_CRASH = 2, FSRV_RUN_ERROR = 3 answer = self.__fuzz(action, sock) if answer == "0": reward = 0.0 elif answer == "1": reward = 1.0 elif answer == "2": reward = 1.0 elif answer == "3": reward = 1.0 else: print("Error in return value from fuzzing: %s" % answer) sys.exit(1) self._state = tf.constant(action) print("Step ended, reward is: %s" % reward) return reward 

The different agents:

 nest = tf.nest flags.DEFINE_string('root_dir', os.getenv('TEST_UNDECLARED_OUTPUTS_DIR'), 'Root directory for writing logs/summaries/checkpoints.') flags.DEFINE_enum( 'agent', 'EXP3', ['GREEDY', 'LINUCB', 'LINTHOMPSON', 'EXP3'], 'Which agent to use. Possible values are `GREEDY`, `LINUCB`, `LINTHOMPSON` and `EXP3`. Default is GREEDY.') FLAGS = flags.FLAGS # From example, change here for training parameters BATCH_SIZE = 8 TRAINING_LOOPS = 200 STEPS_PER_LOOP = 2 CONTEXT_DIM = 15 # LinUCB agent constants. AGENT_ALPHA = 10.0 # epsilon Greedy constants. EPSILON = 0.05 LAYERS = (50, 50, 50) LR = 0.005 def main(unused_argv): tf.compat.v1.enable_v2_behavior() # The trainer only runs with V2 enabled. with tf.device('/CPU:0'): # due to b/128333994 env = AFLEnvironment() #'GREEDY', 'LINUCB', 'LINTHOMPSON', 'EXP3' if FLAGS.agent == 'GREEDY': network = q_network.QNetwork( input_tensor_spec=env.time_step_spec().observation, action_spec=env.action_spec(), fc_layer_params=LAYERS) agent = eps_greedy_agent.NeuralEpsilonGreedyAgent( time_step_spec=env.time_step_spec(), action_spec=env.action_spec(), reward_network=network, optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=LR), epsilon=EPSILON) elif FLAGS.agent == 'LINUCB': agent = lin_ucb_agent.LinearUCBAgent( time_step_spec=env.time_step_spec(), action_spec=env.action_spec(), alpha=AGENT_ALPHA, gamma=0.95, #wird teilweise in den examples weggelassen emit_log_probability=False, dtype=tf.float32) elif FLAGS.agent == 'LINTHOMPSON': agent = lin_ts_agent.LinearThompsonSamplingAgent( time_step_spec=env.time_step_spec(), action_spec=env.action_spec()) elif FLAGS.agent == 'EXP3': agent = exp3_agent.Exp3Agent( time_step_spec = env.time_step_spec(), action_spec = env.action_spec(), learning_rate = 1) replay_buffer = [] metric = py_metrics.AverageReturnMetric() observers = [replay_buffer.append, metric] driver = dynamic_step_driver.DynamicStepDriver( env=env, policy=agent.collect_policy, observers=observers, num_steps = 200) initial_time_step = env.reset() print("initial_time_step") print(initial_time_step) final_time_step, _ = driver.run(initial_time_step) print('Replay Buffer:') for traj in replay_buffer: print(traj) if __name__ == '__main__': app.run(main) 

Test script:

 env = AFLEnvironment() observation = env.reset().observation print("observation: %d" % observation) action = 1 #@param print("action: %d" % action) reward = env.step(action).reward print("reward: %f" % reward) print("observation : %d", env._observe()) 

submitted by /u/sampletext1111
[visit reddit] [comments]

Categories
Misc

TFlite Maxpool op doesn’t seem to work as intended

I’m exploring the behavior of operations of TFlite for custom hardware. I quantized a pretrained VGG16 (from model zoo) into int8. The scale and zero point of input and output tensors are equal for each maxpool op. Since quantization is a monotonically increasing function, I believe the output of maxpool op (int8) should be the 2×2 maxpool of the input (int8). equivalent to following numpy code:

max_out_custom = max_in.reshape(1,112,2,112,2,64).max(axis=2).max(axis=3) 

But it is not so, and I cant find a pattern. Any help will be appreciated.

Colab with example code: https://colab.research.google.com/drive/1410SH8uEE5IX0Iuvv27SwTtpCl2XM5T5?usp=sharing

submitted by /u/uncle-iroh-11
[visit reddit] [comments]