Categories
Offsites

End-to-end Generative Pre-training for Multimodal Video Captioning

Multimodal video captioning systems utilize both the video frames and speech to generate natural language descriptions (captions) of videos. Such systems are stepping stones towards the longstanding goal of building multimodal conversational systems that effortlessly communicate with users while perceiving environments through multimodal input streams.

Unlike video understanding tasks (e.g., video classification and retrieval) where the key challenge lies in processing and understanding multimodal input videos, the task of multimodal video captioning includes the additional challenge of generating grounded captions. The most widely adopted approach for this task is to train an encoder-decoder network jointly using manually annotated data. However, due to a lack of large-scale, manually annotated data, the task of annotating grounded captions for videos is labor intensive and, in many cases, impractical. Previous research such as VideoBERT and CoMVT pre-train their models on unlabelled videos by leveraging automatic speech recognition (ASR). However, such models often cannot generate natural language sentences because they lack a decoder, and thus only the video encoder is transferred to the downstream tasks.

In “End-to-End Generative Pre-training for Multimodal Video Captioning”, published at CVPR 2022, we introduce a novel pre-training framework for multimodal video captioning. This framework, which we call multimodal video generative pre-training or MV-GPT, jointly trains a multimodal video encoder and a sentence decoder from unlabelled videos by leveraging a future utterance as the target text and formulating a novel bi-directional generation task. We demonstrate that MV-GPT effectively transfers to multimodal video captioning, achieving state-of-the-art results on various benchmarks. Additionally, the multimodal video encoder is competitive for multiple video understanding tasks, such as VideoQA, text-video retrieval, and action recognition.

Future Utterance as an Additional Text Signal
Typically, each training video clip for multimodal video captioning is associated with two different texts: (1) a speech transcript that is aligned with the clip as a part of the multimodal input stream, and (2) a target caption, which is often manually annotated. The encoder learns to fuse information from the transcript with visual contents, and the target caption is used to train the decoder for generation. However, in the case of unlabelled videos, each video clip comes only with a transcript from ASR, without a manually annotated target caption. Moreover, we cannot use the same text (the ASR transcript) for the encoder input and decoder target, since the generation of the target would then be trivial.

MV-GPT circumvents this challenge by leveraging a future utterance as an additional text signal and enabling joint pre-training of the encoder and decoder. However, training a model to generate future utterances that are often not grounded in the input content is not ideal. So we apply a novel bi-directional generation loss to reinforce the connection to the input.

Bi-directional Generation Loss
The issue of non-grounded text generation is mitigated by formulating a bi-directional generation loss that includes forward and backward generation. Forward generation produces future utterances given visual frames and their corresponding transcripts and allows the model to learn to fuse the visual content with its corresponding transcript. Backward generation takes the visual frames and future utterances to train the model to generate a transcript that contains more grounded text of the video clip. Bi-directional generation loss in MV-GPT allows the encoder and the decoder to be trained to handle visually grounded texts.

Bi-directional generation in MV-GPT. A model is trained with two generation losses. In forward generation, the model generates a future utterance (blue boxes) given the frames and the present utterance (red boxes), whereas the present is generated from the future utterance in backward generation. Two special beginning-of-sentence tokens ([BOS-F] and [BOS-B]) initiate forward and backward generation for the decoder.

Results on Multimodal Video Captioning
We compare MV-GPT to existing pre-training losses using the same model architecture, on YouCook2 with standard evaluation metrics (Bleu-4, Cider, Meteor and Rouge-L). While all pre-training techniques improve captioning performances, it is critical to pre-train the decoder jointly to improve model performance. We demonstrate that MV-GPT outperforms the previous state-of-the-art joint pre-training method by over 3.5% with relative gains across all four metrics.

Pre-training Loss Pre-trained Parts Bleu-4 Cider Meteor Rouge-L
No Pre-training N/A 13.25 1.03 17.56 35.48
CoMVT Encoder 14.46 1.24 18.46 37.17
UniVL Encoder + Decoder 19.95 1.98 25.27 46.81
MV-GPT (ours) Encoder + Decoder 21.26 2.14 26.36 48.58
MV-GPT performance across four metrics (Bleu-4, Cider, Meteor and Rouge-L) of different pre-training losses on YouCook2. “Pre-trained parts” indicates which parts of the model are pre-trained — only the encoder or both the encoder and decoder. We reimplement the loss functions of existing methods but use our model and training strategies for a fair comparison.

We transfer a model pre-trained by MV-GPT to four different captioning benchmarks: YouCook2, MSR-VTT, ViTT and ActivityNet-Captions. Our model achieves state-of-the-art performance on all four benchmarks by significant margins. For instance on the Meteor metric, MV-GPT shows over 12% relative improvements in all four benchmarks.

YouCook2 MSR-VTT ViTT ActivityNet-Captions
Best Baseline 22.35 29.90 11.00 10.90
MV-GPT (ours) 27.09 38.66 26.75 12.31
Meteor metric scores of the best baseline methods and MV-GPT on four benchmarks.

Results on Non-generative Video Understanding Tasks
Although MV-GPT is designed to train a generative model for multimodal video captioning, we also find that our pre-training technique learns a powerful multimodal video encoder that can be applied to multiple video understanding tasks, including VideoQA, text-video retrieval and action classification. When compared to the best comparable baseline models, the model transferred from MV-GPT shows superior performance in five video understanding benchmarks on their primary metrics — i.e., top-1 accuracy for VideoQA and action classification benchmarks, and recall at 1 for the retrieval benchmark.

Task Benchmark Best Comparable Baseline MV-GPT
VideoQA MSRVTT-QA 41.5 41.7
ActivityNet-QA 38.9 39.1
Text-Video Retrieval MSR-VTT 33.7 37.3
Action Recognition Kinetics-400 78.9 80.4
Kinetics-600 80.6 82.4
Comparisons of MV-GPT to best comparable baseline models on five video understanding benchmarks. For each dataset we report the widely used primary metric, i.e., MSRVTT-QA and ActivityNet-QA: Top-1 answer accuracy; MSR-VTT: Recall at 1; and Kinetics: Top-1 classification accuracy.

Summary
We introduce MV-GPT, a new generative pre-training framework for multimodal video captioning. Our bi-directional generative objective jointly pre-trains a multimodal encoder and a caption decoder by using utterances sampled at different times in unlabelled videos. Our pre-trained model achieves state-of-the-art results on multiple video captioning benchmarks and other video understanding tasks, namely VideoQA, video retrieval and action classification.

Acknowledgements
This research was conducted by Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab and Cordelia Schmid.

Categories
Misc

What Is Zero Trust?

For all its sophistication, the Internet age has brought on a digital plague of security breaches. The steady drumbeat of data and identity thefts spawned a new movement and a modern mantra that’s even been the subject of a U.S. presidential mandate — zero trust. So, What Is Zero Trust? Zero trust is a cybersecurity Read article >

The post What Is Zero Trust? appeared first on NVIDIA Blog.

Categories
Misc

Festo Develops With Isaac Sim to Drive Its Industrial Automation

Dionysios Satikidis was playing FIFA 19 when he realized the simulated soccer game’s realism offered a glimpse into the future for training robots. An expert in AI and autonomous systems at Festo, a German industrial control and automation company, he believed the worlds of gaming and robotics would intersect. “I’ve always been passionate about technology Read article >

The post Festo Develops With Isaac Sim to Drive Its Industrial Automation appeared first on NVIDIA Blog.

Categories
Misc

Feel the Need … for Speed as ‘Top Goose’ Debuts In the NVIDIA Studio

This week In the NVIDIA Studio takes off with the debut of Top Goose, a short animation created with Omniverse Machinima and inspired by one of the greatest fictional pilots to ever grace the big screen. The project was powered by PCs using the same breed of GPU that has produced every Best Visual Effects nominee at the Academy Awards for 14 years: multiple systems with NVIDIA RTX A6000 GPUs and an NVIDIA Studio laptop — the Razer Blade 15 with a GeForce RTX 3070 Laptop GPU.

The post Feel the Need … for Speed as ‘Top Goose’ Debuts In the NVIDIA Studio appeared first on NVIDIA Blog.

Categories
Misc

Looking for help for hire

I’m both a collector and a coin dealer. I look through tens of thousands of coins a week for rare dates, errors, etc. But as I get older, my eyes are not what they use to be. So it’s getting somewhat difficult for me to see the key details on the coin. So I decided to make a setup that can look through coins for me. I’ve been greatly influenced by this machine that does everything I want, but I need something a lot smaller.

https://youtu.be/k7okDtRRCcY

I do have a basic background in coding and how it works. But I have little experience with making an AI. I’ve watched many video tutorials and I now understand clearly how an AI learns. I think the best route is to use Python, TensorFlow, and open-cv. But I keep getting some kind of errors that have been a major roadblock for me.

If this is relevant. My company setup is a ryzen 9 5900x. 3080 gpu and has 64gb of ram.

I’m looking for someone who can guide me through installing and training an AI model. I will compensate for your time, either in money or in collectible coins. What I mean for collectable coins is good quality coins. Not those cheapy coins you pick up from gift shops. But actually pieces of history. I’ve got silver coins, I’ve got a ton of English coins from 1600s-1800s. You can check out my ebay store to get a idea of what I have to offer. https://www.ebay.com/sch/uncommoncentscoins/m.html?_nkw&_armrs=1&_ipg&_from&LH_Complete=1&LH_Sold=1&rt=nc&_trksid=p2046732.m1684

submitted by /u/Ok_Wish4469
[visit reddit] [comments]

Categories
Misc

Where to start with AI image generation?

Hey guys,

I was chatting with an artist today and he was showing me some AI art he created. Basically he’d create base artwork and then process it through an AI to add some random stylization. I asked him about the process and he was pretty secretive about it, but mentioned he uses Tensor Flow. He couldn’t give any more details.

I’m in love with the idea and I was curious if anyone knew of any sample projects that do something similar, or any resources to get me started?

My background: software dev, but not much in AI

submitted by /u/johnprime
[visit reddit] [comments]

Categories
Misc

Colab gives runtime error, failed to initialize sdl.

Hi,

I was using google colab when I was struck with this issue. I have all the necessary libs installed. This is the error message:

RuntimeError Traceback (most recent call last)
<ipython-input-13-feca46536a5c> in <module>() —-> 1 env = gym.make(‘ALE/Breakout-v5′, render_mode=’human’) 2 env = Recorder(env, ‘./video’) 4 frames
/usr/local/lib/python3.7/dist-packages/gym/envs/atari/environment.py in seed(self, seed) 194 “
https://github.com/mgbellemare/Arcade-Learning-Environment#rom-management” 195 ) –> 196 self.ale.loadROM(getattr(roms, self._game)) 197 198 if self._game_mode is not None: RuntimeError: Failed to initialize SDL

Cudn’t find any solutions, pls help.

Thx

submitted by /u/StarLan7
[visit reddit] [comments]

Categories
Misc

Simplify AI Model Development with the Latest TAO Toolkit Release

Boost productivity and model training with new pretrained models and features such as ONNX model weights import, REST APIs, and TensorBoard visualization.

Today, NVIDIA announced the general availability of the latest version of the TAO Toolkit. As a low-code version of the NVIDIA Train, Adapt and Optimize (TAO) framework, the toolkit simplifies and accelerates the creation of AI models for speech and vision AI applications. 

With TAO, developers can use the power of transfer learning to create production-ready models customized and optimized for many use-cases. These include detecting defects, translating languages, or managing traffic—without the need for massive amounts of data. 

This version boosts developer productivity with new pretrained vision and speech models. It also includes key new features such as ONNX model weights import, REST APIs, and TensorBoard integration. 

Download TAO Toolkit 3.22.05 >>

Release highlights

Deploy TAO Toolkit as-a-Service with REST APIs: Build a new AI service or integrate into an existing one with REST APIs. You can manage and orchestrate the TAO Toolkit service on Kubernetes. With TAO Toolkit as-a-service IT managers can deliver scalable services using industry-standard APIs.

Bring your own model weights: Fine-tune and optimize your non-TAO models with TAO. Import pretrained weights from ONNX and take advantage of TAO features like pruning and quantization on your own model. This is supported for image classification and segmentation tasks.

Visualize with TensorBoard: Understand your model training performance by visualizing scalars such as training and validation loss, model weights, and predicted images in TensorBoard. Compare results between experiments by changing hyperparameters and choose the one that best fits your needs. 

Pretrained models: Pretrained models speed up the customization process for you to fine-tune through the power of transfer learning, with less data. 

Some of the new pretrained models in this latest version can: 

  • Apply data gathered from LIDAR sensors for robotics and automotive applications.
  • Classify human actions based on human poses that can be used in public safety, retail, and worker safety use cases.
  • Estimate keypoints on humans, animals, and objects to help portray actions or simply define the object shape.  
  • Create custom voices with just 30 minutes of recorded data to power smart devices, game characters, and quick service restaurants.

Enterprise support for TAO Toolkit is available with NVIDIA AI Enterprise, an end-to-end software suite for AI development and deployment. This new release of TAO Toolkit  will be included in the next quarterly update to NVIDIA AI Enterprise.

Get started 

Solutions using TAO Toolkit 

Categories
Misc

Vision in the Making: Andrew Ng’s Startup Automates Factory Inspection

Computer vision specialist Landing AI has a unique calling card: Its co-founder and CEO is a tech rock star. At Google Brain, Andrew Ng became famous for showing how deep learning could recognize cats in a sea of images with uncanny speed and accuracy. Later, he founded Coursera, where his machine learning courses have attracted Read article >

The post Vision in the Making: Andrew Ng’s Startup Automates Factory Inspection appeared first on NVIDIA Blog.

Categories
Misc

Upcoming Event: Join NVIDIA at Automate 2022

Join NVIDIA at Automate 2022, June 6-9, to learn about AI platforms for manufacturing, robotics, and logistics that improve efficiency, scalability, and production across industries.