Categories
Offsites

MUSIQ: Assessing Image Aesthetic and Technical Quality with Multi-scale Transformers

Understanding the aesthetic and technical quality of images is important for providing a better user visual experience. Image quality assessment (IQA) uses models to build a bridge between an image and a user’s subjective perception of its quality. In the deep learning era, many IQA approaches, such as NIMA, have achieved success by leveraging the power of convolutional neural networks (CNNs). However, CNN-based IQA models are often constrained by the fixed-size input requirement in batch training, i.e., the input images need to be resized or cropped to a fixed size shape. This preprocessing is problematic for IQA because images can have very different aspect ratios and resolutions. Resizing and cropping can impact image composition or introduce distortions, thus changing the quality of the image.

In CNN-based models, images need to be resized or cropped to a fixed shape for batch training. However, such preprocessing can alter the image aspect ratio and composition, thus impacting image quality. Original image used under CC BY 2.0 license.

In “MUSIQ: Multi-scale Image Quality Transformer”, published at ICCV 2021, we propose a patch-based multi-scale image quality transformer (MUSIQ) to bypass the CNN constraints on fixed input size and predict the image quality effectively on native-resolution images. The MUSIQ model supports the processing of full-size image inputs with varying aspect ratios and resolutions and allows multi-scale feature extraction to capture image quality at different granularities. To support positional encoding in the multi-scale representation, we propose a novel hash-based 2D spatial embedding combined with an embedding that captures the image scaling. We apply MUSIQ on four large-scale IQA datasets, demonstrating consistent state-of-the-art results across three technical quality datasets (PaQ-2-PiQ, KonIQ-10k, and SPAQ) and comparable performance to that of state-of-the-art models on the aesthetic quality dataset AVA.

The patch-based MUSIQ model can process the full-size image and extract multi-scale features, which better aligns with a person’s typical visual response.

In the following figure, we show a sample of images, their MUSIQ score, and their mean opinion score (MOS) from multiple human raters in the brackets. The range of the score is from 0 to 100, with 100 being the highest perceived quality. As we can see from the figure, MUSIQ predicts high scores for images with high aesthetic quality and high technical quality, and it predicts low scores for images that are not aesthetically pleasing (low aesthetic quality) or that contain visible distortions (low technical quality).

High quality
76.10 [74.36] 69.29 [70.92]
     
Low aesthetics quality
55.37 [53.18] 32.50 [35.47]
     
Low technical quality
14.93 [14.38] 15.24 [11.86]
Predicted MUSIQ score (and ground truth) on images from the KonIQ-10k dataset. Top: MUSIQ predicts high scores for high quality images. Middle: MUSIQ predicts low scores for images with low aesthetic quality, such as images with poor composition or lighting. Bottom: MUSIQ predicts low scores for images with low technical quality, such as images with visible distortion artifacts (e.g., blurry, noisy).

The Multi-scale Image Quality Transformer
MUSIQ tackles the challenge of learning IQA on full-size images. Unlike CNN-models that are often constrained to fixed resolution, MUSIQ can handle inputs with arbitrary aspect ratios and resolutions.

To accomplish this, we first make a multi-scale representation of the input image, containing the native resolution image and its resized variants. To preserve the image composition, we maintain its aspect ratio during resizing. After obtaining the pyramid of images, we then partition the images at different scales into fixed-size patches that are fed into the model.

Illustration of the multi-scale image representation in MUSIQ.

Since patches are from images of varying resolutions, we need to effectively encode the multi-aspect-ratio multi-scale input into a sequence of tokens, capturing both the pixel, spatial, and scale information. To achieve this, we design three encoding components in MUSIQ, including: 1) a patch encoding module to encode patches extracted from the multi-scale representation; 2) a novel hash-based spatial embedding module to encode the 2D spatial position for each patch; and 3) a learnable scale embedding to encode different scales. In this way, we can effectively encode the multi-scale input as a sequence of tokens, serving as the input to the Transformer encoder.

To predict the final image quality score, we use the standard approach of prepending an additional learnable “classification token” (CLS). The CLS token state at the output of the Transformer encoder serves as the final image representation. We then add a fully connected layer on top to predict the IQS. The figure below provides an overview of the MUSIQ model.

Overview of MUSIQ. The multi-scale multi-resolution input will be encoded by three components: the scale embedding (SCE), the hash-based 2D spatial embedding (HSE), and the multi-scale patch embedding (MPE).

Since MUSIQ only changes the input encoding, it is compatible with any Transformer variants. To demonstrate the effectiveness of the proposed method, in our experiments we use the classic Transformer with a relatively lightweight setting so that the model size is comparable to ResNet-50.

Benchmark and Evaluation
To evaluate MUSIQ, we run experiments on multiple large-scale IQA datasets. On each dataset, we report the Spearman’s rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) between our model prediction and the human evaluators’ mean opinion score. SRCC and PLCC are correlation metrics ranging from -1 to 1. Higher PLCC and SRCC means better alignment between model prediction and human evaluation. The graph below shows that MUSIQ outperforms other methods on PaQ-2-PiQ, KonIQ-10k, and SPAQ.

Performance comparison of MUSIQ and previous state-of-the-art (SOTA) methods on four large-scale IQA datasets. On each dataset we compare the Spearman’s rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) of model prediction and ground truth.

Notably, the PaQ-2-PiQ test set is entirely composed of large pictures having at least one dimension exceeding 640 pixels. This is very challenging for traditional deep learning approaches, which require resizing. MUSIQ can outperform previous methods by a large margin on the full-size test set, which verifies its robustness and effectiveness.

It is also worth mentioning that previous CNN-based methods often required sampling as many as 20 crops for each image during testing. This kind of multi-crop ensemble is a way to mitigate the fixed shape constraint in the CNN models. But since each crop is only a sub-view of the whole image, the ensemble is still an approximate approach. Moreover, CNN-based methods both add additional inference cost for every crop and, because they sample different crops, they can introduce randomness in the result. In contrast, because MUSIQ takes the full-size image as input, it can directly learn the best aggregation of information across the full image and it only needs to run the inference once.

To further verify that the MUSIQ model captures different information at different scales, we visualize the attention weights on each image at different scales.

Attention visualization from the output tokens to the multi-scale representation, including the original resolution image and two proportionally resized images. Brighter areas indicate higher attention, which means that those areas are more important for the model output. Images for illustration are taken from the AVA dataset.

We observe that MUSIQ tends to focus on more detailed areas in the full, high-resolution images and on more global areas on the resized ones. For example, for the flower photo above, the model’s attention on the original image is focusing on the pedal details, and the attention shifts to the buds at lower resolutions. This shows that the model learns to capture image quality at different granularities.

Conclusion
We propose a multi-scale image quality transformer (MUSIQ), which can handle full-size image input with varying resolutions and aspect ratios. By transforming the input image to a multi-scale representation with both global and local views, the model can capture the image quality at different granularities. Although MUSIQ is designed for IQA, it can be applied to other scenarios where task labels are sensitive to image resolution and aspect ratio. The MUSIQ model and checkpoints are available at our GitHub repository.

Acknowledgements
This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Qifei Wang, Yilin Wang and Peyman Milanfar.

Categories
Misc

Upcoming Workshop: Fundamentals of Deep Learning

Explore deep learning with hands-on exercises in computer vision and NLP in this online instructor-led workshop.

Explore deep learning with hands-on exercises in computer vision and NLP in this online instructor-led workshop.

Categories
Misc

Building an Automatic Speech Recognition Model for the Kinyarwanda Language

Speech recognition technology is growing in popularity for voice assistants and robotics, for solving real-world problems through assisted healthcare or…

Speech recognition technology is growing in popularity for voice assistants and robotics, for solving real-world problems through assisted healthcare or education, and more. This is helping democratize access to speech AI worldwide. As labeled datasets for unique, emerging languages become more widely available, developers can build AI applications readily, accurately, and affordably to enhance technology developments and experiences for their native regions.

Kinyarwanda is the native language of 9.8 million people in Rwanda, Uganda, DR Congo, and Tanzania with over 20 million total speakers across the globe. 

In April 2022, Mozilla Common Voice (MCV), a crowdsourced project aimed at making voice recognition open and accessible to everyone, made a significant contribution to building the Kinyarwanda dataset, as detailed in the article, Lessons from Building for Kinyarwanda on Common Voice. It is a 57 GB dataset with 2,000+ hours of audio, making it the largest dataset on the MCV platform.

To bring the value of the effort and dataset to developers, an automatic speech recognition (ASR) model was trained on this dataset that achieved state-of-the-art performance on the published checkpoints.

This post provides an overview of the training process using NeMo ASR toolkit. It briefly covers challenges with the dataset, converting characters to longer units using byte-pair encoding, and the training process for improved model performance. Developers can refer to the step-by-step tutorial on GitHub for the reference code and details. 

Obtaining the dataset

MCV has the largest publicly available multi-language dataset. You can download language-specific datasets from the Mozilla Common Voice Hub

In the Kinyarwanda dataset used for the model, there are 1,404,853 sentences that are pre-split into train/dev/test data. Each entry in the dataset consists of a unique MP3 file and corresponding information such as name of the file, transcription, and meta information in TSV format. 

NeMo ASR requires data that includes a set of utterances in individual audio files plus a manifest that describes the dataset, with information about one utterance per line.

Once the dataset is downloaded, in the training split, TSV files are converted to JSON manifests and MP3 files are converted to WAV files, which are recommended formats for NeMo toolkit. The same steps are then repeated for test and dev data separately.

The manifest format is provided below:

{"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147}

Data preprocessing

Before training the model, the data requires preprocessing to reduce ambiguity and inconsistencies and make the data easy to interpret. The preprocessing steps for this model are:

  • Replace all punctuation with a space (except for apostrophes)
  • Replace different types of apostrophes [’’‘`ʽ’] by 1
  • Make all text lowercase​ for consistency
  • Replace rare characters with diacritics ​ ([éèëēê] → e, for example)​
  • Delete all remaining out-of-vocabulary characters 

(combined Latin letters, space, and apostrophe, for example)

Because 99% of the dataset has an audio duration of 11 seconds or shorter, it is suggested to restrict the maximum audio duration to 11 seconds during preprocessing for faster training.

The final Kinyarwanda transcript consists of sentences with Latin letters, spaces, and apostrophes after preprocessing.

Subword tokenization 

It is possible to train character-based ASR models but they will regard each letter as a separate token, taking more time to generate the output. Using longer units improves both quality and speed. 

This process involves a tokenization algorithm called byte-pair encoding that splits words into subtokens and marks the beginning of the word with a special symbol so it’s easy to restore the original words.

To make the process easier, NeMo toolkit supports on-the-fly subword tokenization by passing the tokenizer through the model config so there is no need to modify transcripts. This does not affect the model performance and potentially helps to adapt to other domains without retraining the tokenizer.

Visit NVIDIA/NeMo on GitHub for a detailed description and tutorial on subword tokenization for NeMo ASR.

Training models

Two approaches lead to trained model. The first approach involves training the model from scratch using two model architectures: Conformer-CTC and Conformer-Transducer. The second approach involves 

fine-tuning the Kinyarwanda Conformer-Transducer model from different pretrained checkpoints.

To train a Conformer-CTC model, use speech_to_text_ctc_bpe.py with the default config conformer_ctc_bpe.yaml. To train a Conformer-Transducer model, use speech_to_text_rnnt_bpe.py with the default config conformer_transducer_bpe.yaml

For fine-tuning, use the pretrained STT_EN_Conformer_Transducer model for a checkpoint that is not self-supervised. Use the SSL_EN_Conformer_Large for a self-supervised checkpoint from NVIDIA GPU Cloud. 

You can find more details about the training process in the step-by-step tutorial on GitHub. 

The reference code for Self-supervised Checkpoint Initialization (SSL_EN_Conformer_Large) is provided below.

import nemo.collections.asr as nemo_asr
ssl_model = nemo_asr.models.ssl_models.SpeechEncDecSelfSupervisedModel.from_pretrained(model_name='ssl_en_conformer_large')

# define fine-tune model
asr_model = nemo_asr.models.EncDecCTCModelBPE(cfg=cfg.model, trainer=trainer)

# load ssl checkpoint
asr_model.load_state_dict(ssl_model.state_dict(), strict=False)

del ssl_model

Figure 1 shows a comparison of training dynamics. The fine-tuning approach is quick and easy for training, and also leads to faster convergence and better quality.

A graph showing the Word Error Rate comparison for models used.
Figure 1. Word Error Rate output comparison for models used

Test results

While building a model, the goal is to minimize the Word Error Rate (WER) while transcribing the speech input. In simple words, Word Error Rate is the number of errors divided by the total number of words.​ It is often used to test the performance of a model but should not be the only standard, as out-of-scope variables like noise, echo, and accents can have a substantial impact on speech recognition.

Character Error Rate (CER) is also considered. CER gives the percentage of characters that were incorrectly predicted. Our models have the lowest percentage of WER and CER in the Kinyarwanda ASR models (Table 1).

Model WER % CER %
Conformer-CTC-Large 18.73​ 5.75
Conformer-Transducer-Large 16.19 5.7​
Table 1. Word Error Rate and Character Error Rate for the Kinyarwanda models

Key takeaways

We have built two high-quality Kinyarwanda checkpoints from scratch with the NeMo toolkit. The Conformer-Transducer checkpoint has better quality but the Conformer-CTC is 4x faster at inference, so they are both potentially useful based on the need.​ 

The high performance of the pretrained model is another step towards new developments in the speech AI community. The state-of-the-art model can be improved further by fine-tuning it with more data that has more dialects, accents, and rare words and is a true representation of how people speak their native languages. NVIDIA NeMo pretrained models are open source and meet the goal of democratization and inclusivity across the globe.

Additional resources

Explore the MVC initiative to access or provide voice data for your language. For more information on models, see the following resources:

Join experts from Google, Meta, NVIDIA, and more at the first annual NVIDIA Speech AI Summit. Register now.

Categories
Misc

Get in Touch With New Mobile Gaming Controls on GeForce NOW

GeForce NOW expands touch control support to 13 more games this GFN Thursday. That means it’s easier than ever to take PC gaming on the go using mobile devices and tablets. The new “Mobile Touch Controls” row in the GeForce NOW app is the easiest way for members to find which games put the action Read article >

The post Get in Touch With New Mobile Gaming Controls on GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

Open-Source Fleet Management Tools for Autonomous Mobile Robots

At ROSCon 2022, NVIDIA announced the newest Isaac ROS software release, Developer Preview (DP) 2. This release includes new cloud– and edge-to-robot task…

At ROSCon 2022, NVIDIA announced the newest Isaac ROS software release, Developer Preview (DP) 2. This release includes new cloud– and edge-to-robot task management and monitoring software for autonomous mobile robot (AMR) fleets, as well as additional features for ROS 2 developers.

NVIDIA Isaac ROS consists of individual packages (GEMs) and complete pipelines (NITROS) for hardware-accelerated performance. In addition to performance improvements, the new release adds the following functionality:

  • Mission Dispatch and Client: An open-source CPU package to assign and monitor tasks from a fleet management system to the robot. Mission Dispatch is a cloud-native microservice that can be integrated as part of larger fleet management systems.
  • FreeSpace Segmentation: A hardware-accelerated package for producing a vision AI–based occupancy grid in the proximity of the robot to be used as an input to the navigation stack.
  • H.264 Video Encode and Decode: Hardware-accelerated packages for compressed video data recording and playback. Video data collection is an important part of training AI perception models. The performance of these new GEMs on the NVIDIA Jetson AGX Orin platform measured at 2x 1080p stereo cameras at 30 fps (>120 fps total), reducing data footprint by ~10x.

Mission Dispatch and Client

Block diagrams for the software stacks.
Figure 1. Architecture of Mission Dispatch and Mission Client software

Mission Dispatch and Client provide a standard, open-source way to assign and track tasks between a fleet management system and ROS 2 robots.  Dispatch and Client communicate using VDA5050, an open standard for communications designed specifically for robot fleets. Messages are transmitted wirelessly over MQTT, a lightweight messaging protocol for Internet of Things (IoT) applications.

Mission Dispatch is a containerized micro-service available for download from NGC, or as source code on the NVIDIA Isaac GitHub repo, and can be integrated into fleet management systems. Mission Dispatch has been verified to interoperate with other open-source ROS 2 clients like the recently announced VDA5050 Connector developed by OTTO Motors and InOrbit.

Mission Client, which is compatible with ROS 2 Humble, is available as a package in the NVIDIA Isaac ROS GitHub repo and preintegrated with the Nav2 navigation stack to assign and track navigation and other tasks on the robot.

“As mobile robot deployment in the real world accelerates, interoperability is becoming increasingly critical,” said Ryan Gariepy, CTO at OTTO Motors. “Bridging VDA5050 with ROS2 as an open-source community will promote innovation in fleet management solutions while allowing robot makers to focus on differentiation.”

NVIDIA Isaac ROS performance

NVIDIA Isaac ROS continues to deliver hardware-accelerated performance for the ROS 2 developer community for AI perception, image processing, and navigation. Autonomous robots require advanced AI and computer vision capabilities. Isaac ROS represents our commitment to making it easier for the robotics community to adopt these cutting-edge technologies.

For more information about the latest performance numbers for key Isaac ROS packages, see Isaac ROS Performance Summary.

Image of a person pushing a cart of crates and associated DNN output images.
Figure 2. Improved stereo depth performance of the BI3D model on flat featureless surfaces. (left) Original photo, (middle) DP1.1 release, (right) DP2 release.

Free training for ROS 2 developers

To provide advanced technical training and access to NVIDIA Isaac ROS experts, NVIDIA is announcing a new series of webinars focused on ROS 2 developers. These sessions are free and feature Q&A periods with the technical experts developing accelerated modules for ROS 2.

Line drawing of TurboTurtle robot with the NVIDIA and ROS logos.
Figure 3. TurboTurtle

The first three webinar topics:

  • November 14, 2022: Pinpoint, 250 fps, ROS 2 localization with vSLAM on Jetson, led by Dr. Raffaello Bonghi.
  • December 2022: Using Isaac ROS for Stereo-Based Depth Estimation, led by Hemal Shah
  • December 2022: Building an Isaac ROS accelerated module using YOLOv5, led by Asawaree Bandhi

Register for the November 14 webinar and check back soon, as more webinars will be added to the series.

ROSCon 2022

If you are attending ROSCon in Kyoto, Japan, be sure to attend the technical session gz-omni: Bridging Gazebo with Isaac Sim (livestream) on October 20, 2022 at 2:10PM JST. Visit NVIDIA at booth #22 to see a live demonstration of NVIDIA Isaac ROS in action running on the NVIDIA Jetson AGX Orin Developer Kit.

Getting started

To get started today with NVIDIA Isaac ROS, review the examples summarized in the /NVIDIA-ISAAC-ROS GitHub repo.

Categories
Misc

How Tarteel Uses AI to Help Arabic Learners Perfect Their Pronunciation

There are some 1.8 billion Muslims, but only 16% or so of them speak Arabic, the language of the Quran. This is in part due to the fact that many Muslims struggle to find qualified instructors to give them feedback on their Quran recitation. Enter today’s guest and his company Tarteel, a member of the Read article >

The post How Tarteel Uses AI to Help Arabic Learners Perfect Their Pronunciation appeared first on NVIDIA Blog.

Categories
Misc

Explainer: What Is Path Tracing?

Path tracing is going real-time, unleashing interactive, photorealistic 3D environments filled with dynamic light and shadow, reflections, and refractions.

Path tracing is going real-time, unleashing interactive, photorealistic 3D environments filled with dynamic light and shadow, reflections, and refractions.

Categories
Offsites

Do Modern ImageNet Classifiers Accurately Predict Perceptual Similarity?

The task of determining the similarity between images is an open problem in computer vision and is crucial for evaluating the realism of machine-generated images. Though there are a number of straightforward methods of estimating image similarity (e.g., low-level metrics that measure pixel differences, such as FSIM and SSIM), in many cases, the measured similarity differences do not match the differences perceived by a person. However, more recent work has demonstrated that intermediate representations of neural network classifiers, such as AlexNet, VGG and SqueezeNet trained on ImageNet, exhibit perceptual similarity as an emergent property. That is, Euclidean distances between encoded representations of images by ImageNet-trained models correlate much better with a person’s judgment of differences between images than estimating perceptual similarity directly from image pixels.

Two sets of sample images from the BAPPS dataset. Trained networks agree more with human judgements as compared to low-level metrics (PSNR, SSIM, FSIM). Image source: Zhang et al. (2018).

In “Do better ImageNet classifiers assess perceptual similarity better?” published in Transactions on Machine Learning Research, we contribute an extensive experimental study on the relationship between the accuracy of ImageNet classifiers and their emergent ability to capture perceptual similarity. To evaluate this emergent ability, we follow previous work in measuring the perceptual scores (PS), which is roughly the correlation between human preferences to that of a model for image similarity on the BAPPS dataset. While prior work studied the first generation of ImageNet classifiers, such as AlexNet, SqueezeNet and VGG, we significantly increase the scope of the analysis incorporating modern classifiers, such as ResNets and Vision Transformers (ViTs), across a wide range of hyper-parameters.

Relationship Between Accuracy and Perceptual Similarity
It is well established that features learned via training on ImageNet transfer well to a number of downstream tasks, making ImageNet pre-training a standard recipe. Further, better accuracy on ImageNet usually implies better performance on a diverse set of downstream tasks, such as robustness to common corruptions, out-of-distribution generalization and transfer learning on smaller classification datasets. Contrary to prevailing evidence that suggests models with high validation accuracies on ImageNet are likely to transfer better to other tasks, surprisingly, we find that representations from underfit ImageNet models with modest validation accuracies achieve the best perceptual scores.

Plot of perceptual scores (PS) on the 64 × 64 BAPPS Dataset (y-axis) against the ImageNet 64 × 64 validation accuracies (x-axis). Each blue dot represents an ImageNet classifier. Better ImageNet classifiers achieve better PS up to a certain point (dark blue), beyond which improving the accuracy lowers the PS. The best PS are attained by classifiers with moderate accuracy (20.0–40.0).

<!–

Plot of perceptual scores (PS) on the 64 × 64 BAPPS Dataset (y-axis) against the ImageNet 64 × 64 validation accuracies (x-axis). Each blue dot represents an ImageNet classifier. Better ImageNet classifiers achieve better PS up to a certain point (dark blue), beyond which improving the accuracy lowers the PS. The best PS are attained by classifiers with moderate accuracy (20.0–40.0).

–>

We study the variation of perceptual scores as a function of neural network hyperparameters: width, depth, number of training steps, weight decay, label smoothing and dropout. For each hyperparameter, there exists an optimal accuracy up to which improving accuracy improves PS. This optimum is fairly low and is attained quite early in the hyperparameter sweep. Beyond this point, improved classifier accuracy corresponds to worse PS.

As illustration, we present the variation of PS with respect to two hyperparameters: training steps in ResNets and width in ViTs. The PS of ResNet-50 and ResNet-200 peak very early at the first few epochs of training. After the peak, PS of better classifiers decrease more drastically. ResNets are trained with a learning rate schedule that causes a stepwise increase in accuracy as a function of training steps. Interestingly, after the peak, they also exhibit a step-wise decrease in PS that matches this step-wise accuracy increase.

Early-stopped ResNets attain the best PS across different depths of 6, 50 and 200.

ViTs consist of a stack of transformer blocks applied to the input image. The width of a ViT model is the number of output neurons of a single transformer block. Increasing its width is an effective way to improve its accuracy. Here, we vary the width of two ViT variants, B/8 and L/4 (i.e., Base and Large ViT models with patch sizes 4 and 8 respectively), and evaluate both the accuracy and PS. Similar to our observations with early-stopped ResNets, narrower ViTs with lower accuracies perform better than the default widths. Surprisingly, the optimal width of ViT-B/8 and ViT-L/4 are 6 and 12% of their default widths. For a more comprehensive list of experiments involving other hyperparameters such as width, depth, number of training steps, weight decay, label smoothing and dropout across both ResNets and ViTs, check out our paper.

Narrow ViTs attain the best PS.

Scaling Down Models Improves Perceptual Scores
Our results prescribe a simple strategy to improve an architecture’s PS: scale down the model to reduce its accuracy until it attains the optimal perceptual score. The table below summarizes the improvements in PS obtained by scaling down each model across every hyperparameter. Except for ViT-L/4, early stopping yields the highest improvement in PS, regardless of architecture. In addition, early stopping is the most efficient strategy as there is no need for an expensive grid search.

Model Default Width Depth Weight
Decay
Central
Crop
Train
Steps
Best
ResNet-6 69.1 +0.4 +0.3 0.0 +0.5 69.6
ResNet-50 68.2 +0.4 +0.7 +0.7 +1.5 69.7
ResNet-200 67.6 +0.2 +1.3 +1.2 +1.9 69.5
ViT B/8 67.6 +1.1 +1.0 +1.3 +0.9 +1.1 68.9
ViT L/4 67.9 +0.4 +0.4 -0.1 -1.1 +0.5 68.4
Perceptual Score improves by scaling down ImageNet models. Each value denotes the improvement obtained by scaling down a model across a given hyperparameter over the model with default hyperparameters.

Global Perceptual Functions
In prior work, the perceptual similarity function was computed using Euclidean distances across the spatial dimensions of the image. This assumes a direct correspondence between pixels, which may not hold for warped, translated or rotated images. Instead, we adopt two perceptual functions that rely on global representations of images, namely the style-loss function from the Neural Style Transfer work that captures stylistic similarity between two images, and a normalized mean pool distance function. The style-loss function compares the inter-channel cross-correlation matrix between two images while the mean pool function compares the spatially averaged global representations.

Global perceptual functions consistently improve PS across both networks trained with default hyperparameters (top) and ResNet-200 as a function of train epochs (bottom).

We probe a number of hypotheses to explain the relationship between accuracy and PS and come away with a few additional insights. For example, the accuracy of models without commonly used skip-connections also inversely correlate with PS, and layers close to the input on average have lower PS as compared to layers close to the output. For further exploration involving distortion sensitivity, ImageNet class granularity, and spatial frequency sensitivity, check out our paper.

Conclusion
In this paper, we explore the question of whether improving classification accuracy yields better perceptual metrics. We study the relationship between accuracy and PS on ResNets and ViTs across many different hyperparameters and observe that PS exhibits an inverse-U relationship with accuracy, where accuracy correlates with PS up to a certain point, and then exhibits an inverse-correlation. Finally, in our paper, we discuss in detail a number of explanations for the observed relationship between accuracy and PS, involving skip connections, global similarity functions, distortion sensitivity, layerwise perceptual scores, spatial frequency sensitivity and ImageNet class granularity. While the exact explanation for the observed tradeoff between ImageNet accuracy and perceptual similarity is a mystery, we are excited that our paper opens the door for further research in this area.

Acknowledgements
This is joint work with Neil Houlsby and Nal Kalchbrenner. We would additionally like to thank Basil Mustafa, Kevin Swersky, Simon Kornblith, Johannes Balle, Mike Mozer, Mohammad Norouzi and Jascha Sohl-Dickstein for useful discussions.

Categories
Misc

Changing Cybersecurity with Natural Language Processing

If you’ve used a chatbot, predictive text to finish a thought in an email, or pressed “0” to speak to an operator, you’ve come across natural language…

If you’ve used a chatbot, predictive text to finish a thought in an email, or pressed “0” to speak to an operator, you’ve come across natural language processing (NLP). As more enterprises adopt NLP, the sub-field is developing beyond those popular use cases of machine-human communication to machines interpreting both human and non-human language. This creates an exciting opportunity for organizations to stay ahead of evolving cybersecurity threats.

This post was originally published on CIO.com

NLP combines linguistics, computer science, and AI to support machine learning of human language. Human language is astonishingly complex. Relying on structured rules leaves machines with an incomplete understanding of it.

NLP enables machines to contextualize and learn instead of relying on rigid encoding so that they can adapt to different dialects, new expressions, or questions that the programmers never anticipated.

NLP research has driven the evolution of AI tech, like neural networks that are instrumental to machine learning across various fields and use cases. NLP has been primarily leveraged across machine-to-human communication to simplify interactions for enterprises and consumers.

NLP for cybersecurity

NLP was designed to enable machines to learn to communicate like humans, with humans. Many services that we use today leverage machine communications either to each other or in translation to become intelligible by humans. Cybersecurity is the perfect example of such a field where IT analysts can feel like they speak to more machines than people.

NLP can be leveraged in cybersecurity workflows to assist in breach protection, identification, and scale and scope analysis.

Phishing

In the short term, NLP can be easily leveraged to enhance and simplify breach protection from phishing attempts.

In the context of phishing, NLP can be leveraged to understand bot or spam behavior in email text sent by a machine posing as a human. It can also be used to understand the internal structure of the email itself to identify patterns of spammers and the types of messages they send.

This example is the first extension of NLP, originally designed to understand just human language and now being applied to understand the combination of human language mixed with machine-level headers.

Log parsing

In the medium term, NLP can be leveraged to parse logs, a cyBERT use case.

In the current rules-based system, the mechanisms and systems required to parse raw logs and make them ready for analysts are brittle and need significant development and maintenance resources.

 Using NLP, parsing of raw logs becomes more flexible and less prone to breaking when changes occur to the log generators and sensors.

Going further, the neural networks used for parsing can generalize beyond the logs they were exposed to during training, creating methods to transform raw data into rich content ready for an analyst without the need to write explicit rules for these new or changed log types. 

As a result, NLP models are more accurate at parsing logs than traditional rules while being more flexible and fault-tolerant.

Synthetic languages

In the longer term, entirely synthetic languages can be created that represent machine-to-machine and human-to-machine communications.

If two machines can create an entirely new language, that language can then be analyzed using NLP techniques to identify errors in grammar, syntax, and composition. All these can be interpreted as anomalies and contextualized for analysts.

This new development can help identify known issues or attacks when they occur, and can also identify completely unknown misconfigurations and attacks, which helps analysts be more efficient and effective.

Summary

The phishing protection, log parsing, and synthetic language applications are just the beginning for NLP. To learn more about AI and cybersecurity, see Learn About the Latest Developments with AI-Powered Cybersecurity, one of many on-demand sessions from  NVIDIA GTC.

Categories
Misc

Achieving 100x Faster Single-Cell Modality Prediction with NVIDIA RAPIDS cuML

Single-cell measurement technologies have advanced rapidly, revolutionizing the life sciences. We have scaled from measuring dozens to millions of cells and…

Single-cell measurement technologies have advanced rapidly, revolutionizing the life sciences. We have scaled from measuring dozens to millions of cells and from one modality to multiple high dimensional modalities. The vast amounts of information at the level of individual cells present a great opportunity to train machine learning models to help us better understand the intrinsic link of cell modalities, which could be transformative for synthetic biology and drug target discovery.

This post introduces modality prediction and explains how we accelerated the winning solution of the NeurIPS Single-Cell Multi-Modality Prediction Challenge by drop-in replacing the CPU-based TSVD and kernel ridge regression (KRR), implemented in scikit-learn, with NVIDIA GPU-based RAPIDS cuML implementations.

Using cuML and changing only six lines of code, we accelerated the scikit-learn–based winning solution, reducing the training time from 69 minutes to 40 seconds: a 103.5x speedup! Even when compared to sophisticated deep learning models developed in PyTorch, we observed that the cuML solution is both faster and more accurate for this prediction challenge.

Challenges of single-cell modality prediction

Diagram shows the process of DNA transcription to RNA and RNA translation to protein. The latter is what we focus on in this post.
Figure 1. Overview of the single-cell modality prediction problem

Thanks to single-cell technology, we can measure multiple modalities within the same single cell such as DNA accessibility (ATAC), mRNA gene expression (GEX), and protein abundance (ADT). Figure 1 shows that these modalities are intrinsically linked. Only accessible DNA can produce mRNA, which in turn is used as a template to produce protein.

The problem of modality prediction arises naturally where it is desirable to predict one modality from another. In the 2021 NeurIPS challenge, we were asked to predict the flow of information from ATAC to GEX and from GEX to ADT.

If a machine learning model can make good predictions, it must have learned intricate states of the cell and it could provide a deeper insight into cellular biology. Extending our understanding of these regulatory processes is also transformative for drug target discovery.

The modality prediction is a multi-output regression problem, and it presents unique challenges:

  • High cardinality. For example, GEX and ADT information are described in vectors of length 13953 and 134, respectively.
  • Strong bias. The data is collected from 10 diverse donors and four sites. Training and test data come from different sites. Both donor and site strongly influence the distribution of the data.
  • Sparsity, redundancy, and non-linearity. The modality data is sparse, and the columns are highly correlated.

In this post, we focus on the task of GEX to ADT predictions to demonstrate the efficiency of a single-GPU solution. Our methods can be extended to other single-cell modality prediction tasks with larger data size and higher cardinality using multi-node multi-GPU architectures.

Using TSVD and KRR algorithms for multi-target regression

As our baseline, we used the first-place solution of the NeurIPS Modality Prediction Challenge “GEX to ADT” from Kaiwen Deng of University of Michigan. The workflow of the core model is shown in Figure 2. The training data includes both GEX and ADT information while the test data only has GEX information.

The task is to predict the ADT of the test data given its GEX. To address the sparsity and redundancy of the data, we applied truncated singular value decomposition (TSVD) to reduce the dimension of both GEX and ADT.

In particular, two TSVD models fit GEX and ADT separately:

  • For GEX, TSVD fits the concatenated data of both training and testing.
  • For ADT, TSVD only fits the training data.

In Deng’s solution, dimensionality is reduced aggressively from 13953 to 300 for GEX and from 134 to 70 for ADT.

The number of principal components 300 and 70 are hyperparameters of the model, which are obtained through cross-validation and tuning. The reduced version of GEX and ADT of training data are then fed into KRR with the RBF kernel. Matching Deng’s approach, at inference time, we used the trained KRR model to perform the following tasks:

  • Predict the reduced version of ADT of the test data.
  • Apply the inverse transform of TSVD.
  • Recover the ADT prediction of the test data.
Blocks showing the input and outputs of each stage of the workflow.
Figure 2. Model overview. The blocks represent input and output data and the numbers beside the blocks represent the dimensions.

Generally, TSVD is the most popular choice to perform dimension reduction for sparse data, typically used during feature engineering. In this case, TSVD is used to reduce the dimension of both the features (GEX) and the targets (ADT). Dimension reduction of the targets makes it much easier for the downstream multi-output regression model because the TSVD outputs are more independent across the columns.

KRR is chosen as the multi-output regression model. Compared to SVM, KRR computes all the columns of the output concurrently while SVM predicts one column at a time so KRR can learn the nonlinearity like SVM but be much faster.

Implementing a GPU-accelerated solution with cuML

cuML is one of the RAPIDS libraries. It contains a suite of GPU-accelerated machine learning algorithms that provide many highly optimized models, including both TSVD and KRR. You can quickly adapt the baseline model from a scikit-learn implementation to a cuML implementation.

In the following code example, we only needed to change six lines of code and three of them are imports. For simplicity, many preprocessing and utility codes are omitted.

Baseline sklearn implementation:

from sklearn.decomposition import TruncatedSVD
from sklearn.gaussian_process.kernels import RBF
from sklearn.kernel_ridge import KernelRidge

tsvd_gex = TruncatedSVD(n_components=300)
tsvd_adt = TruncatedSVD(n_components=70)

gex_train_test = tsvd_gex.fit_transform(gex_train_test)
gex_train, gex_test = split(get_train_test)
adt_train = tsvd_adt.fit_transform(adt_train)
adt_comp = tsvd_adt.components_

y_pred = 0
for seed in seeds:
    gex_tr,_,adt_tr,_=train_test_split(gex_train, 
                                       adt_train,
                                       train_size=0.5, 
                                       random_state=seed)
    kernel = RBF(length_scale = scale)
    krr = KernelRidge(alpha=alpha, kernel=kernel)
    krr.fit(gex_tr, adt_tr)
    y_pred += (krr.predict(gex_test) @ adt_comp)
y_pred /= len(seeds)

RAPIDS cuML implementation:

from cuml.decomposition import TruncatedSVD
from cuml.kernel_ridge import KernelRidge
import gc

tsvd_gex = TruncatedSVD(n_components=300)
tsvd_adt = TruncatedSVD(n_components=70)

gex_train_test = tsvd_gex.fit_transform(gex_train_test)
gex_train, gex_test = split(get_train_test)
adt_train = tsvd_adt.fit_transform(adt_train)
adt_comp = tsvd_adt.components_.to_output('cupy')

y_pred = 0
for seed in seeds:
    gex_tr,_,adt_tr,_=train_test_split(gex_train, 
                                       adt_train,
                                       train_size=0.5, 
                                       random_state=seed)
    krr = KernelRidge(alpha=alpha,kernel='rbf')
    krr.fit(gex_tr, adt_tr)
    gc.collect()
    y_pred += (krr.predict(gex_test) @ adt_comp)
y_pred /= len(seeds)

The syntax of cuML kernels is slightly different from scikit-learn. Instead of creating a standalone kernel object, we specified the kernel type in the KernelRidge’s constructor. This is because the Gaussian process is not supported by cuML yet.

Another difference is that explicit garbage collection is needed for the current version cuML implementations. Some form of reference cycles are created in this particular loop and objects are not freed automatically without garbage collection. For more information, see the complete notebooks in the /daxiongshu/rapids_nips_blog GitHub repo.

Results

We compared the cuML implementation of TSVD+KRR against the CPU baseline and other top solutions in the challenge. The GPU solutions run on a single V100 GPU and the CPU solutions run on dual 20-core Intel Xeon CPUs. The metric for the competition is root mean square error (RMSE).

We found that the cuML implementation of TSVD+KRR is 103x faster than the CPU baseline with a slight degradation of the score due to the randomness in the pipeline. However, the score is still better than any other models in the competition.

We also compared our solution with two deep learning models:

Both deep learning models are implemented in PyTorch and run on a single V100 GPU. Both deep learning models have many layers with millions of parameters to train and hence are prone to overfitting for this dataset. In comparison, TSVD+KRR only has to train less than 30K parameters. Figure 4 shows that the cuML TSVD+KRR model is both faster and more accurate than the deep learning models, thanks to its simplicity.

Chart compares RMSE and training time between the proposed TSVD+KRR cuML GPU and three baseline solutions: TSVD+KRR CPU, MLP PyTorch GPU, and GNN PyTorch GPU. The proposed TSVD+KRR cuML GPU is at least 100x faster than the baselines and only slightly worse RMSE than the best baseline.
Figure 4. Performance and training time comparison. The horizontal axis is with a logarithmic scale.

Figure 5 shows a detailed speedup analysis, where we present timings for the two stages of the algorithm: TSVD and KRR. cuML TSVD and KRR are 15x and 103x faster than the CPU baseline, respectively.

Bar chart shows running time breakdown for cuML GPU over sklearn CPU. The TSVD running time is reduced from 120 seconds with sklearn to 8 seconds with cuML. The KRR running time is reduced from 4,140 seconds with sklearn to 40 seconds with cuML.
Figure 5. Run time comparison

Figure 5. Run time comparison

Conclusion

Due to its lightning speed and user-friendly API, RAPIDS cuML is incredibly useful for accelerating the analysis of the single-cell data. With a few minor code changes, you can boost your existing scikit-learn workflows.

In addition, when dealing with single-cell modality prediction, we recommend starting with cuML TSVD to reduce the dimension of data and KRR for the downstream tasks to achieve the best speedup.

Try out this RAPIDS cuML implementation with the code on the /daxiongshu/rapids_nips_blog GitHub repo.