Categories
Offsites

UVQ: Measuring YouTube’s Perceptual Video Quality

Online video sharing platforms, like YouTube, need to understand perceptual video quality (i.e., a user’s subjective perception of video quality) in order to better optimize and improve user experience. Video quality assessment (VQA) attempts to build a bridge between video signals and perceptual quality by using objective mathematical models to approximate the subjective opinions of users. Traditional video quality metrics, like peak signal-to-noise ratio (PSNR) and Video Multi-Method Assessment Fusion (VMAF), are reference-based and focus on the relative difference between the target and reference videos. Such metrics, which work best on professionally generated content (e.g., movies), assume the reference video is of pristine quality and that one can induce the target video’s absolute quality from the relative difference.

However, the majority of the videos that are uploaded on YouTube are user-generated content (UGC), which bring new challenges due to their remarkably high variability in video content and original quality. Most UGC uploads are non-pristine and the same amount of relative difference could imply very different perceptual quality impacts. For example, people tend to be less sensitive to the distortions of poor quality uploads than of high quality uploads. Thus, reference-based quality scores become inaccurate and inconsistent when used for UGC cases. Additionally, despite the high volume of UGC, there are currently limited UGC video quality assessment (UGC-VQA) datasets with quality labels. Existing UGC-VQA datasets are either small in size (e.g., LIVE-Qualcomm has 208 samples captured from 54 unique scenes), compared with datasets with millions of samples for classification and recognition (e.g., ImageNet and YouTube-8M), or don’t have enough content variability (sampling without considering content information, like LIVE-VQC and KoNViD-1k).

In “Rich Features for Perceptual Quality Assessment of UGC Videos“, published at CVPR 2021, we describe how we attempt to solve the UGC quality assessment problem by building a Universal Video Quality (UVQ) model that resembles a subjective quality assessment. The UVQ model uses subnetworks to analyze UGC quality from high-level semantic information to low-level pixel distortions, and provides a reliable quality score with rationale (leveraging comprehensive and interpretable quality labels). Moreover, to advance UGC-VQA and compression research, we enhance the open-sourced YouTube-UGC dataset, which contains 1.5K representative UGC samples from millions of UGC videos (distributed under the Creative Commons license) on YouTube. The updated dataset contains ground-truth labels for both original videos and corresponding transcoded versions, enabling us to better understand the relationship between video content and its perceptual quality.

Subjective Video Quality Assessment
To understand perceptual video quality, we leverage an internal crowd-sourcing platform to collect mean opinion scores (MOS) with a scale of 1–5, where 1 is the lowest quality and 5 is the highest quality, for no-reference use cases. We collect ground-truth labels from the YouTube-UGC dataset and categorize UGC factors that affect quality perception into three high-level categories: (1) content, (2) distortions, and (3) compression. For example, a video with no meaningful content won’t receive a high quality MOS. Also, distortions introduced during the video production phase and video compression artifacts introduced by third-party platforms, e.g., transcoding or transmission, will degrade the overall quality.

MOS= 2.052 MOS= 4.457
Left: A video with no meaningful content won’t receive a high quality MOS. Right: A video displaying intense sports shows a higher MOS.
MOS= 1.242 MOS= 4.522
Left: A blurry gaming video gets a very low quality MOS. Right: A video with professional rendering (high contrast and sharp edges, usually introduced in the video production phase) shows a high quality MOS.
MOS= 2.372 MOS= 4.646
Left: A heavily compressed video receives a low quality MOS. Right: a video without compression artifacts shows a high quality MOS.

We demonstrate that the left gaming video in the second row of the figure above has the lowest MOS (1.2), even lower than the video with no meaningful content. A possible explanation is that viewers may have higher video quality expectations for videos that have a clear narrative structure, like gaming videos, and the blur artifacts significantly reduce the perceptual quality of the video.

UVQ Model Framework
A common method for evaluating video quality is to design sophisticated features, and then map these features to a MOS. However, designing useful handcrafted features is difficult and time-consuming, even for domain experts. Also, the most useful existing handcrafted features were summarized from limited samples, which may not perform well on broader UGC cases. In contrast, machine learning is becoming more prominent in UGC-VQA because it can automatically learn features from large-scale samples.

A straightforward approach is to train a model from scratch on existing UGC quality datasets. However, this may not be feasible as there are limited quality UGC datasets. To overcome this limitation, we apply a self-supervised learning step to the UVQ model during training. This self-supervised step enables us to learn comprehensive quality-related features, without ground-truth MOS, from millions of raw videos.

Following the quality-related categories summarized from the subjective VQA, we develop the UVQ model with four novel subnetworks. The first three subnetworks, which we call ContentNet, DistortionNet and CompressionNet, are used to extract quality features (i.e., content, distortion and compression), and the fourth subnetwork, called AggregationNet, maps the extracted features to generate a single quality score. ContentNet is trained in a supervised learning fashion with UGC-specific content labels that are generated by the YouTube-8M model. DistortionNet is trained to detect common distortions, e.g., Gaussian blur and white noise of the original frame. CompressionNet focuses on video compression artifacts, whose training data are videos compressed with different bitrates. CompressionNet is trained using two compressed variants of the same content that are fed into the model to predict corresponding compression levels (with a higher score for more noticeable compression artifacts), with the implicit assumption that the higher bitrate version has a lower compression level.

The ContentNet, DistortionNet and CompressionNet subnetworks are trained on large-scale samples without ground-truth quality scores. Since video resolution is also an important quality factor, the resolution-sensitive subnetworks (CompressionNet and DistortionNet) are patch-based (i.e., each input frame is divided into multiple disjointed patches that are processed separately), which makes it possible to capture all detail on native resolution without downscaling. The three subnetworks extract quality features that are then concatenated by the fourth subnetwork, AggregationNet, to predict quality scores with domain ground-truth MOS from YouTube-UGC.

The UVQ training framework.

Analyzing Video Quality with UVQ
After building the UVQ model, we use it to analyze the video quality of samples pulled from YouTube-UGC and demonstrate that its subnetworks can provide a single quality score along with high-level quality indicators that can help us understand quality issues. For example, DistortionNet detects multiple visual artifacts, e.g., jitter and lens blur, for the middle video below, and CompressionNet detects that the bottom video has been heavily compressed.

ContentNet assigns content labels with corresponding probabilities in parentheses, i.e., car (0.58), vehicle (0.42), sports car (0.32), motorsports (0.18), racing (0.11).
DistortionNet detects and categorizes multiple visual distortions with corresponding probabilities in parentheses, i.e., jitter (0.112), color quantization (0.111), lens blur (0.108), denoise (0.107).
CompressionNet detects a high compression level of 0.892 for the video above.

Additionally, UVQ can provide patch-based feedback to locate quality issues. Below, UVQ reports that the quality of the first patch (patch at time t = 1) is good with a low compression level. However, the model identifies heavy compression artifacts in the next patch (patch at time t = 2).

Patch at time t = 1 Patch at time t = 2
Compression level = 0.000 Compression level = 0.904
UVQ detects a sudden quality degradation (high compression level) for a local patch.

In practice, UVQ can generate a video diagnostic report that includes a content description (e.g., strategy video game), distortion analysis (e.g., the video is blurry or pixelated) and compression level (e.g., low or high compression). Below, UVQ reports that the content quality, looking at individual features, is good, but the compression and distortion quality is low. When combining all three features, the overall quality is medium-low. We see that these findings are close to the rationale summarized by internal user experts, demonstrating that UVQ can reason through quality assessments, while providing a single quality score.

UVQ diagnostic report. ContentNet (CT): Video game, strategy video game, World of Warcraft, etc. DistortionNet (DT): multiplicative noise, Gaussian blur, color saturation, pixelate, etc. CompressionNet (CP): 0.559 (medium-high compression). Predicted quality score in [1, 5]: (CT, DT, CP) = (3.901, 3.216, 3.151), (CT+DT+CP) = 3.149 (medium-low quality).

Conclusion
We present the UVQ model, which generates a report with quality scores and insights that can be used to interpret UGC video perceptual quality. UVQ learns comprehensive quality related features from millions of UGC videos and provides a consistent view of quality interpretation for both no-reference and reference cases. To learn more, read our paper or visit our website to see YT-UGC videos and their subjective quality data. We also hope that the enhanced YouTube-UGC dataset enables more research in this space.

Acknowledgements
This work was possible through a collaboration spanning several Google teams. Key contributors include: Balu Adsumilli, Neil Birkbeck, Joong Gon Yim from YouTube and Junjie Ke, Hossein Talebi, Peyman Milanfar from Google Research. Thanks to Ross Wolf, Jayaprasanna Jayaraman, Carena Church, and Jessie Lin for their contributions.

Categories
Misc

Learn How Leading Companies Are Building AI Centers of Excellence, at NVIDIA GTC

AI Centers of Excellence are organizational units dedicated to implementing a company-wide AI vision. They help identify business use cases, create an implementation roadmap, accelerate adoption, assess impact and more. NVIDIA GTC, a global conference on AI and the metaverse, brings together the world’s top business and technology leaders who’ve embraced artificial intelligence to transform Read article >

The post Learn How Leading Companies Are Building AI Centers of Excellence, at NVIDIA GTC appeared first on NVIDIA Blog.

Categories
Misc

Explore the Future of Robotics at GTC 2022

Discover the latest innovations in AI and robotics, and hear world-renowned roboticist Dr. Henrik Christensen talk about the future of robotics.

Discover the latest innovations in AI and robotics, and hear world-renowned roboticist Dr. Henrik Christensen talk about the future of robotics.

Categories
Misc

Predict, Detect, Mitigate: AI for Climate Science Takes the Stage at NVIDIA GTC

Recent AI advances enable modeling of weather forecasting 4-5 magnitudes faster than traditional computing methods. The brightest leaders, researchers and developers in climate science, high performance computing and AI will discuss such technology breakthroughs — and how they can help foster a greener Earth — at NVIDIA GTC. The virtual conference, running Sept. 19-22, also Read article >

The post Predict, Detect, Mitigate: AI for Climate Science Takes the Stage at NVIDIA GTC appeared first on NVIDIA Blog.

Categories
Misc

Shelter From the Storm: AI Helps Gauge Catastrophe Risks

Floods in Kentucky and wildfires in California are the kinds of disasters companies of all sorts are trying to address with AI. Tom Rikert, co-founder and CEO of San Francisco-based startup Masterful AI, is one of many experts helping them manage catastrophe risk. In the U.S. alone, the National Association of Insurance Commissioners estimates that Read article >

The post Shelter From the Storm: AI Helps Gauge Catastrophe Risks appeared first on NVIDIA Blog.

Categories
Misc

3D Artists Reimagine, Remaster Iconic European Architecture This Week ‘In the NVIDIA Studio’

A triple threat steps In the NVIDIA Studio this week: a tantalizing trio of talented 3D artists who each reimagined and remastered classic European buildings with individualistic flair.

The post 3D Artists Reimagine, Remaster Iconic European Architecture This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Leveraging AI Music with NVIDIA DGX-2

Language models such as the NVIDIA Megatron-LM and OpenAI GPT-2 and GPT-3 have been used to enhance human productivity and creativity. Specifically, these…

Language models such as the NVIDIA Megatron-LM and OpenAI GPT-2 and GPT-3 have been used to enhance human productivity and creativity. Specifically, these models have been used as powerful tools for writing, programming, and painting. The same architecture can be used for music composition.

Large datasets are required to use language models in these domains. Starting with 50 GB of uncompressed text files for language generation is no surprise. This implies the need for a log of GPU compute to train the models effectively for rapid development, prototyping, and iteration.

This post provides an account of a series of experiments performed in the field of AI music using the NVIDIA DGX-2 platform. DGX-2 boosted progress significantly in both data preprocessing and training language models.

Datasets for AI music

There are two major classes when it comes to datasets for Computational Music. One approach involves training on the music represented as pure audio (WAV files or MP3s). The second approach does not work with the pure audio. Instead, you map anything that resembles sheet music to a token representation. 

Usually, this requires tokens for which note starts (C, D, E, F, G), how much time passes (quarter notes or eighth notes, for example), and which note ends. In research and application, MIDI-files have proven to be fruitful sources for musical material. The MIDI standard has been designed to electronically store music information.

These experiments used several sets of MIDI files, including:

Video 1. AI music composed using a GPT trained on the MetaMIDI Dataset

The MIDI format is a non-human-readable representation of music which, in order to train a Causal Language Model, has to be mapped to a readable token representation. For this representation, we took inspiration from the mmmtrack encoding

This encoding represents pieces of music as a hierarchy. A piece of music consists of different tracks for different instruments: drums, guitars, bass, and piano, for example. Each track consists of several bars (4, 8, or 16 bars, depending on the use case). And each bar holds a sequence of note-on, time-delta, and note-off events. Although this hierarchy can be considered a tree, it is possible to encode everything as a linear sequence, making it an ideal representation for decoder-only language models.

The example below is a four-part chorale in its piano roll representation. A chorale features four voices: soprano, alto, tenor, and bass. Soprano and alto are female voices, and tenor and bass are male voices. Usually, all four voices sing at the same time but with different, harmonic pitches. 

Figure 1 visualizes the voices with pitch color coding. The soprano is green, the alto is orange, the tenor is blue, and the bass is red. You can encode these musical events—which have both a time and a pitch dimension—to a sequence of tokens.

Graph visualization of music tokens generated with the Music GPT model. The music tokens are pitch color-coded.
Figure 1. A sample of generated music tokens visualized with pitch color coding

Following the mmmtrack encoding, the bass part would be mapped to the following token representation:

PIECE_START TRACK_START INST=BASS BAR_START NOTE_ON=61 TIME_DELTA=4 NOTE_OFF=61 NOTE_ON=59 TIME_DELTA=2 NOTE_OFF=59 NOTE_ON=58 TIME_DELTA=2 NOTE_OFF=58 NOTE_ON=56 TIME_DELTA=4 NOTE_OFF=56 NOTE_ON=54 TIME_DELTA=4 NOTE_OFF=54 BAR_END BAR_START NOTE_ON=59 TIME_DELTA=2 NOTE_OFF=59 NOTE_ON=58 TIME_DELTA=2 NOTE_OFF=58 NOTE_ON=56 TIME_DELTA=4 NOTE_OFF=56 NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 NOTE_ON=59 TIME_DELTA=4 NOTE_OFF=59 BAR_END BAR_START NOTE_ON=58 TIME_DELTA=4 NOTE_OFF=58 NOTE_ON=59 TIME_DELTA=2 NOTE_OFF=59 NOTE_ON=61 TIME_DELTA=2 NOTE_OFF=61 NOTE_ON=63 TIME_DELTA=2 NOTE_OFF=63 NOTE_ON=52 TIME_DELTA=2 NOTE_OFF=52 NOTE_ON=54 TIME_DELTA=4 NOTE_OFF=54 BAR_END BAR_START NOTE_ON=47 TIME_DELTA=4 NOTE_OFF=47 NOTE_ON=49 TIME_DELTA=2 NOTE_OFF=49 NOTE_ON=51 TIME_DELTA=2 NOTE_OFF=51 NOTE_ON=52 TIME_DELTA=2 NOTE_OFF=52 NOTE_ON=54 TIME_DELTA=2 NOTE_OFF=54 NOTE_ON=56 TIME_DELTA=4 NOTE_OFF=56 BAR_END TRACK_END TRACK_START INST=TENOR …

With a little practice, humans can read and understand this representation. The representation starts with PIECE_START indicating the start of a piece of music. TRACK_START indicates the beginning and TRACK_END the end of a track (or instrument or voice). The INST=BASS token denotes that this track contains the bass voice. Other voices are represented the same way. BAR_START and BAR_END represent the beginning and the end of a bar, respectively. NOTE_ON=61 is the start of a note with pitch 61. 

On the piano, this would be the note C#5. TIME_DELTA=4 means that a duration of four sixteenth notes would elapse. That would be a quarter note. After that, the note would end, represented by NOTE_OFF=61. And so on and so forth. At this point, this notation would also allow for polyphony. Several tracks would sound notes at the same time, and each track could have parallel notes. This makes the encoding universal.

Each piece of music differs in the number of bars. It is quite possible that encoding an entire song would require a long sequence length, making the training of a respective Transformer computationally expensive. These experiments encode most of the datasets with four bars and a few with eight bars. Experiments with 16 bars are underway. In addition, only music in a 4/4 time meter was used. This covers the better part of western music. Other meters such as 3/4 (waltz) can be the subject of future work.

This sequence of different experiments mapped many MIDI datasets to the described token format. The same preprocessor was used throughout. Once the preprocessor worked with small datasets, it immediately worked with larger ones. 

The processing time depends on the number of MIDI files to be encoded, ranging from a few minutes to many hours. The longest preprocessing took 30 hours on DGX-2 running on all 96 CPUs in parallel. It is estimated that this would take about 10-14 days of processing on a state-of-the-art MacBook Pro.

Graph of music datasets (MIDI files) used for training the Music GPT models in bar chart sorted from the largest datasets to the smallest
Figure 2. Music datasets used for training the GPT models

Encoding a dataset of MIDI files would yield a collection of token files. The size of those token files depends on the number of MIDI files and the number of bars. Consider some of the experiment datasets and their encoded dataset sizes:

  • JS Fake Chorales Dataset: 14 MB with four bars per sample
  • The Lakh MIDI Dataset: 72 GB, its Clean subset 19 GB with four bars per sample
  • The MetaMIDI Dataset: 130 GB with four bars and 230 GB with eight bars per sample

You can imagine that training on the 14 MB of JS Fake Chorales would take just a few hours. Training on the MetaMIDI Dataset with its 130 GB would take many days. Training for these experiments lasted between 10 and 15 days.

Model training

Many models were trained using the HuggingFace GPT-2 implementation. A few models were trained using the NVIDIA Megatron-LM in GPT-2 mode. 

Training with HuggingFace boiled down to uploading the dataset to the DGX-2 and then running a training script that contained all functionality, including the model and training parameters. The same script was used, with just a few changes here and there for all our datasets. It was just a matter of scale.

For Megatron-LM, the environment setup is as easy as pulling and running an NGC PyTorch Docker container, then getting to work immediately with a Jupyter notebook in the browser through ssh tunneling into the DGX-2 machine.

Most of the experiments used the same GPT-2 architecture:  six decoder-blocks and eight attention heads; the embedding size was 512, and the sequence length was 2048. Although this is definitely not a Large Language Model (LLM), which can have around 100 decoder blocks, subjective evaluation showed that for AI music this architecture works like a charm.

Using the NVIDIA DGX-2 really made a difference in rapid iteration. Datasets that would train for multiple days on a single GPU, would train for just a few hours on DGX-2. Datasets that would train for months on a single GPU, finished training after two weeks maximum on DGX-2. Especially for experiments with datasets

Training times for some of the datasets were as follows:

  • The Lakh MIDI Clean Dataset took 15 hours for 10 epochs and roughly 15K songs
  • The Lakh MIDI Dataset took 130 hours for 10 epochs and roughly 175K songs
  • The MetaMIDI Dataset took 290 hours for 9 epochs and roughly 400K songs

Note that the JS Fake Chorales dataset was trained earlier and not on the DGX-2. Due to its very small size, it was not necessary to use a multi-GPU setup. It could even be trained overnight on a MacBook Pro.

NVIDIA DGX-2

This section provides a closer look at the NVIDIA DGX-2 specifications. As mentioned above, the platform is very effective, both when it comes to accelerated dataset preprocessing, and when it comes to training language models. This section will be a delightfully technical one.

A picture of a DGX-2 station
Figure 3. DGX-2 station 

NVIDIA DGX-2 is a powerful system with 16 fully connected Tesla V100 32 GB GPUs using NVSwitch. It is capable of delivering 2.4 TB/sec of bisection bandwidth. DGX-2 has been designed for AI researchers and data scientists who need both performance and scalability. 

For transformer models, NVIDIA DGX-2 is able to deliver up to 517,227 tokens per second throughput with mixed precision, making it especially powerful.

A table showing multi-GPU performance information for the NVIDIA DGX-2 station, specifically on throughput for Floating Point 32 / mixed precision varying number of GPUs and batch sizes.
Table 1. Multi-GPU performance table for DGX-2

Software framework: NVIDIA Megatron-LM

To get the most out of powerful compute, you need stable and optimized software. With a performance-optimized framework such as NVIDIA Megatron-LM, performance is scaled almost linearly as the GPT model sizes are scaled. For related information, see Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.

A baseline is achieved by training a model of 1.2 billion parameters on a single NVIDIA V100 32 GB GPU, that sustains 39 teraflops. This is 30% of the theoretical peak flops for a single GPU as configured in a DGX-2H server, and is thus a strong baseline. 

Scaling the model to 8.3 billion parameters on 512 GPUs with 8-way model parallelism achieved up to 15.1 petaflops per second sustained over the entire application. This is 76% scaling efficiency compared to the single GPU case.

A plot with achieved petaflops on the y-axis and number of GPUs used on the x-axis to demonstrate near linear scaling performance with Megatron-LM up to thousands of GPUs as we scale model sizes up to 1 trillion parameters.
Figure 4. Scaling to thousands of GPUs with NVIDIA Megatron-LM, without losing performance

By fixing the seq_len, short-hands, equal to 4,096, and modifying training configurations and launch training runs with only a few iterations, it is possible to calculate the teraflop percent achieved in real application job runs.

After a native run, both the nvidia-smi as well as the output Nsight profile were analyzed. Different configurations were tested to obtain the highest possible teraflop, as the below table illustrates:

Table showing Teraflops calculation using the third equation presented in Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Table 2. Teraflops calculation using the third equation presented in Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

The training configuration presented in the last row of the table delivered the highest teraflop of 45.45%. 

Note that eight V100 32 GB GPUs were used instead of 16 GPUs to shorten the time it takes to run each profiling job. The nvidia-smi command was used to verify with the training config that achieved 45.45% teraflops utilization, as illustrated below. 

Training performance was interactively monitored through the use of nvidia-smi commands
Figure 5. Training performance was interactively monitored through the use of nvidia-smi commands

Summary

The AI music experiments presented here were performed using the NVIDIA DGX-2. We trained language models using datasets ranging from just a few megabytes in size to 230 GB. We used the HuggingFace GPT-2 implementation and showed that NVIDIA Megatron-LM is also a great alternative for experimentation.

NVIDIA DGX-2 made a significant difference in accelerating dataset preprocessing—mapping MIDI files to a token representation—and training models. This allowed for rapid experimentation. DGX-2 worked like a charm when it came to training the largest MIDI dataset available (MetaMIDI with 400K files).

Categories
Misc

DLI Courses: Enhance Your Skills with Hands-On Training at GTC

Select from 20 hands-on workshops, offered at GTC, available in multiple languages and time zones. Early bird pricing of just $99 ends Aug 29 (regular $500).

Select from 20 hands-on workshops, offered at GTC, available in multiple languages and time zones. Early bird pricing of just $99 ends Aug 29 (regular $500).

Categories
Misc

Get Your NVIDIA RTX Games Ready for Epic MegaJam

Create a project in Unreal Engine and submit it by September 1 for your chance to win an NVIDIA GeForce RTX 3080 GPU.

Create a project in Unreal Engine and submit it by September 1 for your chance to win an NVIDIA GeForce RTX 3080 GPU.

Categories
Misc

An AI-Enabled Drone Could Soon Become Every Rhino Poacher’s… Horn Enemy

Watching out for the nearly-extinct two-ton beasts may be the ultimate example of a job best done remotely.

The post An AI-Enabled Drone Could Soon Become Every Rhino Poacher’s… Horn Enemy appeared first on NVIDIA Blog.