Categories
Misc

Startup Transforms Meeting Notes With Time-Saving Features

Gil Makleff and Artem Koren are developing AI for meeting transcripts, creating time-savers like shareable highlights of the text that is often TL;DR (too long; didn’t read). The Sembly founders conceived the idea after years of working in enterprise operational consulting at UMT Consulting Group, which was acquired by Ernst & Young. “We had an Read article >

The post Startup Transforms Meeting Notes With Time-Saving Features appeared first on NVIDIA Blog.

Categories
Misc

I’m trying to train on tfod it’s my time using tfod api and it’s showing some warnings should I be concerned

I’m trying to train on tfod it’s my time using tfod api and it’s showing some warnings should I be concerned submitted by /u/RAIDAIN
[visit reddit] [comments]
Categories
Misc

A Night to Behold: Researchers Use Deep Learning to Bring Color to Night Vision

Talk about a bright idea. A team of scientists has used GPU-accelerated deep learning to show how color can be brought to night-vision systems.  In a paper published this week in the journal PLOS One, a team of researchers at the University of California, Irvine led by Professor Pierre Baldi and Dr. Andrew Browne, describes how Read article >

The post A Night to Behold: Researchers Use Deep Learning to Bring Color to Night Vision appeared first on NVIDIA Blog.

Categories
Misc

Is this possible? With Arduino and TensorFlow

I’m thinking of making an Arduino based program where it captures audio from me or someone saying something or playing something on a musical instrument, and uses that to check similarities between a database of audio files I’ve created using TensorFlow. Is this possible with TensorFlow and Arduino? Thanks!

submitted by /u/padam11
[visit reddit] [comments]

Categories
Offsites

Locked-image Tuning: Adding Language Understanding to Image Models

The ability to classify images into categories has been transformed by deep learning. It has also been significantly accelerated by transfer learning, whereby models are first pre-trained on large datasets, like ImageNet, to learn visual representations that are then transferred via fine-tuning to a new task with less data (e.g., classifying animals). Previous works such as BiT and ViT employed these methods to achieve state-of-the-art performance on a wide range of classification tasks, such as the VTAB benchmark.

However, fine-tuning has some downsides: though pre-training is done only once, fine-tuning is necessary on every new dataset for which task-specific data is needed. Multimodal contrastive learning is an alternative, recently popularized paradigm (e.g., CLIP, ALIGN) that overcomes these issues by instead learning how to match free-form text with images. These models can then solve new tasks by reformulating them as image-text matching problems, without extra data (referred to as “zero-shot” learning). Contrastive learning is flexible and easy to adapt to new tasks, but has its own limitations, namely the need for a lot of paired image-text data and weaker performance than transfer learning approaches.

With those limitations in mind, we propose “LiT: Zero-Shot Transfer with Locked-image Text Tuning”, to appear at CVPR 2022. LiT models learn to match text to an already pre-trained image encoder. This simple yet effective setup provides the best of both worlds: strong image representations from pre-training, plus flexible zero-shot transfer to new tasks via contrastive learning. LiT achieves state-of-the-art zero-shot classification accuracy, significantly closing the gap between the two styles of learning. We think the best way to understand is to try it yourself, so we’ve included a demo of LiT models at the end of this post.

Fine-tuning (left) requires task-specific data and training to adapt a pre-trained model to a new task. An LiT model (right) can be used with any task, without further data or adaptation.

Contrastive Learning on Image-Text Data
Contrastive learning models learn representations from “positive” and “negative” examples, such that representations for “positive” examples are similar to each other but different from “negative” examples.

Multimodal contrastive learning applies this to pairs of images and associated texts. An image encoder computes representations from images, and a text encoder does the same for texts. Each image representation is encouraged to be close to the representation of its associated text (“positive”), but distinct from the representation of other texts (“negatives”) in the data, and vice versa. This has typically been done with randomly initialized models (“from scratch”), meaning the encoders have to simultaneously learn representations and how to match them.

Multimodal contrastive learning trains models to produce similar representations for closely matched images and texts.

This training can be done on noisy, loosely aligned pairs of image and text, which naturally occur on the web. This circumvents the need for manual labeling, and makes data scaling easy. Furthermore, the model learns much richer visual concepts — it’s not constrained to what’s defined in the classification label space. Instead of classifying an image as “coffee”, it can understand whether it’s “a small espresso in a white mug” or “a large latte in a red flask”.

Once trained, a model that aligns image and text can be used in many ways. For zero-shot classification, we compare image representations to text representations of the class names. For example, a “wombat vs jaguar” classifier can be built by computing the representations of the texts “jaguar” and “wombat”, and classifying an image as a jaguar if its representation better matches the former. This approach scales to thousands of classes and makes it very easy to solve classification tasks without the extra data necessary for fine-tuning. Another application of contrastive models is image search (a.k.a. image-text retrieval), by finding the image whose representation best matches that of a given text, or vice versa.

The Best of Both Worlds with Locked-image Tuning
As mentioned earlier, transfer learning achieves state-of-the-art accuracy, but requires per-task labels, datasets, and training. On the other hand, contrastive models are flexible, scalable, and easily adaptable to new tasks, but fall short in performance. To compare, at the time of writing, the state of the art on ImageNet classification using transfer learning is 90.94%, but the best contrastive zero-shot models achieve 76.4%.

LiT tuning bridges this gap: we contrastively train a text model to compute representations well aligned with the powerful ones available from a pre-trained image encoder. Importantly, for this to work well, the image encoder should be “locked“, that is: it should not be updated during training. This may be unintuitive since one usually expects the additional information from further training to increase performance, but we find that locking the image encoder consistently leads to better results.

LiT-tuning contrastively trains a text encoder to match a pre-trained image encoder. The text encoder learns to compute representations that align to those from the image encoder.

This can be considered an alternative to the classic fine-tuning stage, where the image encoder is separately adapted to every new classification task; instead we have one stage of LiT-tuning, after which the model can classify any data. LiT-tuned models achieve 84.5% zero-shot accuracy on ImageNet classification, showing significant improvements over previous methods that train models from scratch, and halving the performance gap between fine-tuning and contrastive learning.

Left: LiT-tuning significantly closes the gap between the best contrastive models and the best models fine-tuned with labels. Right: Using a pre-trained image encoder is always helpful, but locking it is surprisingly a key part of the recipe to success; unlocked image models (dashed) yield significantly worse performance.

An impressive benefit of contrastive models is increased robustness — they retain high accuracy on datasets that typically fool fine-tuned models, such as ObjectNet and ImageNet-C. Similarly, LiT-tuned models have high performance across various challenging versions of ImageNet, for example achieving a state-of-the-art 81.1% accuracy on ObjectNet.

LiT-tuning has other advantages. While prior contrastive works require large amounts of data and train for a very long time, the LiT approach is much less data hungry. LiT models trained on 24M publicly available image-text pairs rival the zero-shot classification performance of prior models trained on 400M image-text pairs of private data. The locked image encoder also leads to faster training with a smaller memory footprint. On larger datasets, image representations can be pre-computed; not running the image model during training further improves efficiency and also unlocks much larger batch sizes, which increases the number of “negatives” the model sees and is key to high-performance contrastive learning. The method works well with varied forms of image pre-training (e.g., including self-supervised learning), and with many publicly available image models. We hope that these benefits make LiT a great testbed for researchers.

Conclusion
We present Locked-image Tuning (LiT), which contrastively trains a text encoder to match image representations from a powerful pre-trained image encoder. This simple method is data and compute efficient, and substantially improves zero-shot classification performance compared to existing contrastive learning approaches.

Want to try it yourself?

A preview of the demo: use it to match free-form text descriptions to images and build your own zero-shot classifier!

We have prepared a small interactive demo to try some LiT-tuned models. We also provide a Colab with more advanced use cases and larger models, which are a great way to get started.

Acknowledgments
We would like to thank Xiaohua Zhai, Xiao Wang, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer who have co-authored the LiT paper and been involved in all aspects of its development, as well as the Brain team in Zürich. We also would like to thank Tom Small for creating the animations used in this blogpost.

Categories
Misc

Metropolis Spotlight: Bluecity Combines Vision AI and Lidar for Real-Time Road Safety and Traffic Congestion

Cities now have access to real-time, multimodal traffic data that improves road safety and reduces traffic congestion.

Bluecity, an NVIDIA Metropolis partner, recently launched a new traffic management solution for safer roads and shorter commutes. The technology combines vision AI and lidar technology to better understand round-the-clock traffic data, providing information that could help city planning departments identify problem intersections, reduce congestion, plan smarter, and lower emissions. 

Road safety and congestion are a priority for city and transportation planners, but sparse data has limited their ability to address traffic issues, especially in areas with ever-expanding populations. While video cameras are used to capture information, like the number of cars at a particular intersection, poor lighting and bad weather conditions can interfere with capturing this data accurately. Studies have also shown that accidents are most likely to happen during those times when visibility is low.

New technologies that can overcome these obstacles and collect multimodal data about drivers, vehicle speed, and trajectories can help make the roads safer. Especially when it can be done in real time. 

Bluecity is solving this problem by combining vision AI and lidar technology to understand and evaluate traffic data. 

Bluecity developed IndiGO, its computer vision and traffic data platform, to provide real-time, traffic data and analytics
Figure 1. Bluecity analytics platform

Similar to radar, lidar sensors emit pulsed light waves into the environment and sense objects when the pulses bounce off them. Lidar uses lasers with a lower wavelength than radar, and as a result can detect smaller objects, offering precise measurement data. Lidar can do this even in poor lighting or weather conditions and it captures data anonymously. Each lidar sensor can provide 360-degree coverage and a radius of up to 120 meters (400 feet).

The Bluecity system employs the powerful capabilities of the NVIDIA Jetson edge AI platform—which provides GPU-accelerated computing in a compact and energy-efficient module—along with NVIDIA TensorRT to accelerate its application’s inference throughput at the edge. The edge computing system runs a proprietary 3D perception software and powers the traffic management solution to process up to 50 lidar frames per second in real time to detect all road users.

Their platform provides information on which turning directions and intersections are the riskiest, near misses, time-to-collision, and the speed of the vehicles involved.  It can also classify road users and gives important insight into not only driver behavior, but also that of cyclists and pedestrians. 

The ability to collect data regardless of lighting or weather conditions helps city planners make decisions about things like road design and traffic-light timing that are based on actual data. The startup’s AI component turns raw data into valuable information that guides practical and timely decision-making. AI powers better visualization, while conflict analyses help identify dangerous intersections before accidents occur. For instance, in Repentigny, Quebec, Bluecity installed their solution to provide multimodal traffic data so an engineering firm could better understand mobility in a region where they are updating a bridge. 

Subscribers can view, select, filter, and download the data in order to improve their traffic planning with Bluecity’s easy-to-understand dashboards.

Bluecity solutions are deployed in several Canadian, U.S. and European cities, including Irvine, Austin, Texas, Boca Raton, Trois-Rivières in Canada, and Helsinki, Finland. The startup’s vision is to provide better multimodal data to make intersections in our cities safer, improve road safety, and lead to reduced carbon emissions for smart cities.

Categories
Misc

GFN Thursday Gears Up With More Electronic Arts Games on GeForce NOW

This GFN Thursday delivers more gr-EA-t games as two new titles from Electronic Arts join the GeForce NOW library. Gamers can now enjoy Need for Speed HEAT  and Plants vs. Zombies Garden Warfare 2 streaming from GeForce NOW to underpowered PCs, Macs, Chromebooks, SHIELD TV and mobile devices. It’s all part of the eight  total Read article >

The post GFN Thursday Gears Up With More Electronic Arts Games on GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

Creating a Two-Hand Pose Classifier

I am trying to create a hand pose classifier on Tensorflow.js using JS. I am familiar with the libraries that make hand pose classification possible, but they all support only one hand. I want to train a neural network using landmarks from both hands (or classify each hand seperately) and ideally save and use that dataset on a web-based project. Can anyone point me in the right direction/tutorials to implement this idea?

submitted by /u/SumusSolis
[visit reddit] [comments]

Categories
Misc

TensorFlow Object Detection API

TensorFlow Object Detection API

I am running Faster R_CNN on a custom dataset of 250 images for the object detection task. I downloaded the tfrecords from roboflow and I started training a Faster R-CNN with Inception ResNet v2 640×640. However after 200 iterations, the loss on some tasks becomes 0:

https://preview.redd.it/5radkzb1fgt81.png?width=889&format=png&auto=webp&s=64a86d2903fab990d942517a31fbd51ff31ea4c1

What should be the cause of this problem?

submitted by /u/giakou4
[visit reddit] [comments]

Categories
Misc

Fast Fine-Tuning of AI Transformers Using RAPIDS Machine Learning

Find out how RAPIDS and the cuML support vector machine can achieve faster training time and maximum accuracy when fine-tuning transformers.

In recent years, transformers have emerged as a powerful deep neural network architecture that has been proven to beat the state of the art in many application domains, such as natural language processing (NLP) and computer vision.

This post uncovers how you can achieve maximum accuracy with the fastest training time possible when fine-tuning transformers. We demonstrate how the cuML support vector machine (SVM) algorithm, from the RAPIDS Machine Learning library, can dramatically accelerate this process. CuML SVM on GPU is 500x faster than the CPU-based implementation. This approach uses SVM heads instead of the conventional multi-layer perceptron (MLP) head, making it possible to fine-tune with precision and ease.

What is fine-tuning and why do you need it?

A transformer is a deep learning model consisting of many multi-head, self-attention, and feedforward fully connected layers. It is mainly used for sequence-to-sequence tasks, including NLP tasks, such as machine translation and question-answering, and computer vision tasks, such as object detection and more.

Training a transformer from scratch is a compute-intensive process, often taking days or even weeks. In practice, fine-tuning is the most efficient way of applying pretrained transformers to new tasks, thereby reducing training time.

MLP head for fine-tuning transformers

As shown in Figure 1, transformers have two distinct components:

  • The backbone, which contains multiple blocks of self-attention and feedforward layers.
  • The head, where final predictions take place for either classification or regression tasks.

During fine-tuning, the backbone network of the transformer is frozen while only the lightweight head module is trained for the new task. The most common choice for the head module is a multi-layer perceptron (MLP) for both classification and regression tasks.

During fine-tuning of transformers, the pretrained backbone is frozen and only the head module is trained for the new task. In this post, we show that NVIDIA cuML SVM is both faster and more accurate than MLP as the head module.
Figure 1. Using cuML SVM as the head speeds up the fine-tuning of transformers

As it turns out, implementing and tuning a MLP can be much harder than it looks. Why is that?

  • There are multiple hyperparameters to tune: number of layers, dropout, learning rate, regularization, types of optimizers, and more. Choosing which hyperparameter to tune is dependent on the problem that you are trying to solve. For example, standard techniques such as dropout and batchnorm could lead to performance degradation for regression problems.
  • Additional efforts must be made to prevent overfitting. The transformer’s output is often a long embedding vector, with a length ranging from hundreds to thousands. Overfitting is common when the training data size is not large enough.
  • Performance in terms of execution time is typically not optimized. Users must write boilerplate code for data processing and training. Batch generation and data movement from CPU to GPU can also become a bottleneck for performance.

Advantages of SVM heads for fine-tuning transformers

Support vector machines (SVMs) are one of the most popular supervised learning methods and most potent when there are meaningful, predictive features available. This is especially true with high-dimensional data due to SVM’s robustness against overfitting.

Yet, data scientists are sometimes hesitant to try SVMs for several reasons: 

  • It requires handcraft feature engineering that can be difficult to implement.
  • SVMs are traditionally slow.

RAPIDS cuML revives interest in revisiting this classic model by providing a speedup of up to 500x on GPU. With RAPIDS cuML, SVM is gaining popularity again in the data science community.

For example, RAPIDS cuML SVM notebooks have been frequently used in several Kaggle competitions: 

As transformers have already learned to extract meaningful representations in the form of long embedding vectors, cuML SVM is an ideal candidate for the head classifier or regressor.

When compared to the MLP head, cuML SVM has the following advantages:

  • Easy to tune. In practice, we have found in most instances that tuning just one parameter, C, is enough for SVM.
  • Speed. cuML moves all data to the GPU at once, before processing on the GPU.
  • Diversity. The predictions of SVM are statistically different from the MLP predictions, rendering it useful in ensembles.
  • Simple API. cuML SVM API provides scikit-learn style fit and predict functions.

Case study: PetFinder.my Pawpularity Contest

This proposed fine-tuning methodology with SVM heads applies to both NLP and computer vision tasks. To demonstrate this, we looked at the PetFinder.my Pawpularity Contest, a Kaggle data science competition that predicted the popularity of shelter pets based on their photos.

The dataset used for this project consists of 10,000 hand-labeled images, each with a target pawpularity that we aimed to predict. With pawpularity values ranging from 0 to 100, we used regression to solve this problem.

As there are only 10,000 labeled images, it is impractical to train a deep neural network to achieve high accuracy from scratch. Instead, we approached this by using a pretrained swin transformer backbone and then fine-tuning it with the labeled pet images.

Three steps to fine-tune transformers with RAPIDS cuML SVM. Step 1: train both the backbone and the mlp head with BCE loss. Step 2: Freeze the backbone and train the RAPIDS SVM head. Step 3: infer with both heads and average their predictions to achieve the best accuracy.
Figure 2. How to use cuML SVM head in fine-tuning.

As shown in Figure 2, our approach requires three steps:

  1. First, a regression head using MLP is added to the backbone swin transformer, and the backbone and head are fine tuned. One interesting finding is that the binary cross entropy loss outperforms the common mean square error loss (MSE) due to the distribution of the target
  2. Next, the backbone is frozen, and the MLP head is replaced with the cuML SVM head. The SVM head is then trained with the regular MSE loss.
  3. To achieve the best prediction accuracy, we averaged the MLP head and SVM head. The evaluation metric root means that square error is optimized going from 18 to 17.8, which is significant for this dataset.

It is worth noting that steps 1 and 3 are optional and have been implemented here to optimize the model’s score for this competition. Step 2 alone is the most common scenario for fine-tuning. For this reason, we measured the run time at step 2 and compared three options: cuML SVM (GPU), sklearn SVM (CPU), and PyTorchMLP (GPU). The results are shown in Figure 3.

Runtime comparison between cuML SVM on GPU, sklearn SVM on CPU and MLP on GPU. cuML SVM is up to 15x faster and 28x faster than its CPU counterpart for fit and inference, respectively.
Figure 3. Runtime comparison

The runtime is normalized by sklearn SVM and cuML SVM achieved 15x speedup for training and 28.18x speedup for inference. It is noteworthy that cuML SVM is faster than PyTorch MLP due to high GPU utilization. The notebook can be found on Kaggle.

Key takeaways on transformer fine-tuning

Transformers are revolutionary deep learning models, but training them is time-consuming. Fast fine-tuning of transformers on a GPU can benefit many applications by providing significant speedup. RAPIDS cuML SVM can also be used as a drop-in replacement of the classic MLP head, as it is both faster and more accurate. 

GPU acceleration infuses new energy into classic ML models like SVM. With RAPIDS, it is possible to combine the best of the two worlds: classic machine learning (ML) models and cutting-edge deep learning (DL) models. In RAPIDS cuML, you will find more lightning-fast and easy-to-use models.

Postscript

At the time of writing and editing this post, the PetFinder.my Pawpularity Contest concluded. NVIDIA KGMON Gilberto Titericz won first place by using RAPIDS SVM. His winning solution was to concentrate embeddings from transformers and other deep CNNs, and use RAPIDS SVM as the regression head. For more information, see his winning solution write-up.