Categories
Offsites

Learning to Route by Task for Efficient Inference

Scaling large language models has resulted in significant quality improvements natural language understanding (T5), generation (GPT-3) and multilingual neural machine translation (M4). One common approach to building a larger model is to increase the depth (number of layers) and width (layer dimensionality), simply enlarging existing dimensions of the network. Such dense models take an input sequence (divided into smaller components, called tokens) and pass every token through the full network, activating every layer and parameter. While these large, dense models have achieved state-of-the-art results on multiple natural language processing (NLP) tasks, their training cost increases linearly with model size.

An alternative, and increasingly popular, approach is to build sparsely activated models based on a mixture of experts (MoE) (e.g., GShard-M4 or GLaM), where each token passed to the network follows a separate subnetwork by skipping some of the model parameters. The choice of how to distribute the input tokens to each subnetwork (the “experts”) is determined by small router networks that are trained together with the rest of the network. This allows researchers to increase model size (and hence, performance) without a proportional increase in training cost.

While this is an effective strategy at training time, sending tokens of a long sequence to multiple experts, again makes inference computationally expensive because the experts have to be distributed among a large number of accelerators. For example, serving the 1.2T parameter GLaM model requires 256 TPU-v3 chips. Much like dense models, the number of processors needed to serve an MoE model still scales linearly with respect to the model size, increasing compute requirements while also resulting in significant communication overhead and added engineering complexity.

In “Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference”, we introduce a method called Task-level Mixture-of-Experts (TaskMoE), that takes advantage of the quality gains of model scaling while still being efficient to serve. Our solution is to train a large multi-task model from which we then extract smaller, stand-alone per-task subnetworks suitable for inference with no loss in model quality and with significantly reduced inference latency. We demonstrate the effectiveness of this method for multilingual neural machine translation (NMT) compared to other mixture of experts models and to models compressed using knowledge distillation.

Training Large Sparsely Activated Models with Task Information
We train a sparsely activated model, where router networks learn to send tokens of each task-specific input to different subnetworks of the model associated with the task of interest. For example, in the case of multilingual NMT, every token of a given language is routed to the same subnetwork. This differs from other recent approaches, such as the sparsely gated mixture of expert models (e.g., TokenMoE), where router networks learn to send different tokens in an input to different subnetworks independent of task.

Inference: Bypassing Distillation by Extracting Subnetworks
A consequence of this difference in training between TaskMoE and models like TokenMoE is in how we approach inference. Because TokenMoE follows the practice of distributing tokens of the same task to many experts at both training and inference time, it is still computationally expensive at inference.

For TaskMoE, we dedicate a smaller subnetwork to a single task identity during training and inference. At inference time, we extract subnetworks by discarding unused experts for each task. TaskMoE and its variants enable us to train a single large multi-task network and then use a separate subnetwork at inference time for each task without using any additional compression methods post-training. We illustrate the process of training a TaskMoE network and then extracting per-task subnetworks for inference below.

During training, tokens of the same language are routed to the same expert based on language information (either source, target or both) in task-based MoE. Later, during inference we extract subnetworks for each task and discard unused experts.

To demonstrate this approach, we train models based on the Transformer architecture. Similar to GShard-M4 and GLaM, we replace the feedforward network of every other transformer layer with a Mixture-of-Experts (MoE) layer that consists of multiple identical feedforward networks, the “experts”. For each task, the routing network, trained along with the rest of the model, keeps track of the task identity for all input tokens and chooses a certain number of experts per layer (two in this case) to form the task-specific subnetwork. The baseline dense Transformer model has 143M parameters and 6 layers on both the encoder and decoder. The TaskMoE and TokenMoE that we train are also both 6 layers deep but with 32 experts for every MoE layer and have a total of 533M parameters. We train our models using publicly available WMT datasets, with over 431M sentences across 30 language pairs from different language families and scripts. We point the reader to the full paper for further details.

Results
In order to demonstrate the advantage of using TaskMoE at inference time, we compare the throughput, or the number of tokens decoded per second, for TaskMoE, TokenMoE, and a baseline dense model. Once the subnetwork for each task is extracted, TaskMoE is 7x smaller than the 533M parameter TokenMoE model, and it can be served on a single TPUv3 core, instead of 64 cores required for TokenMoE. We see that TaskMoE has a peak throughput twice as high as that of TokenMoE models. In addition, on inspecting the TokenMoE model, we find that 25% of the inference time has been spent in inter-device communication, while virtually no time is spent in communication by TaskMoE.

Comparing the throughput of TaskMoE with TokenMoE across different batch sizes. The maximum batch size for TokenMoE is 1024 as opposed to 4096 for TaskMoE and the dense baseline model. Here, TokenMoE has one instance distributed across 64 TPUv3 cores, while TaskMoE and the baseline model have one instance on each of the 64 cores.

A popular approach to building a smaller network that still performs well is through knowledge distillation, in which a large teacher model trains a smaller student model with the goal of matching the teacher’s performance. However, this method comes at the cost of additional computation needed to train the student from the teacher. So, we also compare TaskMoE to a baseline TokenMoE model that we compress using knowledge distillation. The compressed TokenMoE model has a size comparable to the per-task subnetwork extracted from TaskMoE.

We find that in addition to being a simpler method that does not need any additional training, TaskMoE improves upon a distilled TokenMoE model by 2.1 BLEU on average across all languages in our multilingual translation model. We note that distillation retains 43% of the performance gains achieved from scaling a dense multilingual model to a TokenMoE, whereas extracting the smaller subnetwork from the TaskMoE model results in no loss of quality.

BLEU scores (higher is better) comparing a distilled TokenMoE model to the TaskMoE and TokenMoE models with 12 layers (6 on the encoder and 6 on the decoder) and 32 experts. While both approaches improve upon a multilingual dense baseline, TaskMoE improves upon the baseline by 3.1 BLEU on average while distilling from TokenMoE improves upon the baseline by 1.0 BLEU on average.

Next Steps
The quality improvements often seen with scaling machine learning models has incentivized the research community to work toward advancing scaling technology to enable efficient training of large models. The emerging need to train models capable of generalizing to multiple tasks and modalities only increases the need for scaling models even further. However, the practicality of serving these large models remains a major challenge. Efficiently deploying large models is an important direction of research, and we believe TaskMoE is a promising step towards more inference friendly algorithms that retain the quality gains of scaling.

Acknowledgements
We would like to first thank our coauthors – Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin and Minh-Thang Luong. We would also like to thank Wolfgang Macherey, Yuanzhong Xu, Zhifeng Chen and Macduff Richard Hughes for their helpful feedback. Special thanks to the Translate and Brain teams for their useful input and discussions, and the entire GShard development team for their foundational contributions to this project. We would also like to thank Tom Small for creating the animations for the blog post.

Categories
Misc

Expanding the NVIDIA DOCA Community for Developers in China

The developer community for NVIDIA DOCA continues to gain momentum around the world.

On January 13, NVIDIA hosted an online workshop to engage with the NVIDIA DOCA developer community in China. The core team at NVIDIA and leading partner representatives joined the workshop to discuss the application scenarios of NVIDIA BlueField DPUs and the NVIDIA DOCA software framework for cloud, data center, and edge. The workshop focused on the requirements for DOCA developers in key industries, such as consumer Internet, cybersecurity, and higher education, and designed a plan to expand the DOCA developer community in China.

Since the June 2021 launch of the DOCA community in China, nearly 1,000 developers have registered for the DOCA Early Access Program, accounting for almost half of global registrations. BlueField DPUs and DOCA show great potential for adoption in China, with developer numbers continuing to grow.

In the second half of 2021, NVIDIA has held three online bootcamps to introduce the application and future technology evolution of the BlueField DPU and DOCA in modern data centers. They also explored the software stack, development environment, developer resources, developer guides, and reference applications of the DOCA Software Development Kit (SDK). More than 3,500 developers participated in the bootcamps. At the latest bootcamp, the newest services and applications released with DOCA 1.2 attracted much attention. Technical information and success stories posted on social media and in knowledge communities have many developers and industry professionals excited about future advancements.

NVIDIA is working with leading global platform providers and partners, such as Juniper Networks, Excelero, VMware, and Palo Alto Networks, to integrate and extend solutions based on BlueField DPU and the DOCA software framework. Through the workshop, NVIDIA will help enable developers in China to develop applications in scenarios such as Zero Trust Security, Morpheus AI Security, edge network service platforms, and high-speed distributed storage. A rich developer program in China will be launched in 2022.

DOCA Developer Bootcamp and Virtual Hackathon

Following the hackathons in Europe and North America, NVIDIA intends to host the first spring DOCA Developer Hackathon in China in the second quarter of 2022. Before the hackathon, NVIDIA will host an online bootcamp to teach contestants about BlueField DPU and DOCA programming skills.

NVIDIA will invite teams of developers from partners, customers, and academia to learn, collaborate, and accelerate their software designs under the guidance of NVIDIA expert mentors. The aim is to foster innovative, breakthrough software projects based on BlueField DPU and DOCA 1.2 in high-performance networking, virtualization, cybersecurity, distributed storage, accelerated AI, edge computing, and video streaming processing. The program will empower the developer community in China to create revolutionary data center infrastructure applications and services. After evaluation, NVIDIA will reward outstanding innovation teams.

DPU and DOCA Excellence Center

Leadtek (Shanghai) Information Technology Co., Ltd. and Shanghai Zentek Intelligent Technology Co., Ltd. are highly familiar with NVIDIA products and solutions, including deep learning applications in cloud, data center, and edge scenarios. They have partnered with the NVIDIA Deep Learning Institute to train partners and customers. 

As the first group of members of the NVIDIA authorized DPU and DOCA Excellence Center, the two partners have set up their own Excellence Centers and begun their pilots. During the pilots, each partner will independently build and operate a virtual development platform based on the BlueField-2 DPU, establish a third-party DPU development environment, provide an online practice development environment for DOCA developers in China, and contribute to the DPU and DOCA ecosystem with NVIDIA.

The implementation of the DOCA developer program in China will help grow the global DOCA developer community, facilitate talent development, and enhance the capabilities of developers. It will also boost the performance advantages of solutions based on BlueField DPU and the DOCA SDK, and accelerate time-to-market, creating greater value for customers and partners.

Apply now to join the DOCA developer community and get early access to the DOCA software framework.

Categories
Misc

From Imagination to Animation, How an Omniverse Creator Makes Films Virtually

Growing up in the Philippines, award-winning filmmaker Jae Solina says he turned to movies for a reminder that the world was much larger than himself and his homeland.

The post From Imagination to Animation, How an Omniverse Creator Makes Films Virtually appeared first on The Official NVIDIA Blog.

Categories
Misc

Single or multiple output for model

I want to predict the genre(s) of the given text. The dataset I am planning on using is this kaggle dataset. While I know how to predict a single genre, I am not sure how to work with a possibility of more than 1 genre if needed.

submitted by /u/Electric_Dragon1703
[visit reddit] [comments]

Categories
Misc

Global AI Weather Forecaster Makes Predictions in Seconds

Graphic of a hurricane approaching the southwest US from a global view.Using convolutional neural networks researchers create an algorithm that can quickly calculate global forecasts 4 to 6 weeks into the future.Graphic of a hurricane approaching the southwest US from a global view.

New weather-forecasting research using AI is fast-tracking global weather predictions. The study, recently published in the Journal of Advances in Modeling Earth Systems, could help identify potential extreme weather 2–6 weeks into the future. Accurate predictions of extreme weather with a longer lead time give communities and critical sectors such as public health, water management, energy, and agriculture more time to prepare for and mitigate potential disasters.

Climate change is amplifying the intensity and frequency of extreme weather events, with 2021 shattering storm, heatwave, flood, and drought records across the globe. According to a recent NOAA report, last year the US experienced 20 separate climate-induced weather disasters, each totaling over $1 billion in damage. 

Short-term and seasonal weather forecasting can play a large role in decreasing the socioeconomic and human costs of extreme weather. In 2019, meteorologists warned local and national leaders in the Philippines of a torrential rainstorm looming about 3 weeks out. The forecast gave communities time to weatherize structures and evacuate before the Category 4 Typhoon hit, saving lives, and reducing overall damage to the region. 

Current weather forecasting relies on supercomputers processing large amounts of global data such as temperature, pressure, humidity, and wind speed. These systems require massive computational resources and take time to process. 

Also, according to the authors, the ability to accurately predict forecasts further out, from several weeks to months, decreases significantly. 

Looking to improve current weather forecasting the researchers aimed to create a computationally efficient model, capable of accurately predicting upcoming weather called the Deep Learning Weather Prediction (DLWP). Originally introduced in a paper published in 2020, the DLWP relies on an AI algorithm that learns and recognizes patterns in historical weather data based on global grids.

The current work refines the DLWP by training a deep convolutional neural network on two additional data points—temperature at the atmospheric boundary layer and total column water vapor. They also improved the grid resolution at the equator to approximately 1.4°. 

Running on a single cuDNN-accelerated TensorFlow deep learning framework on an NVIDIA V100 GPU, the model runs 320 ensemble 6-week forecasts in just 3 minutes. The algorithm can process a 1-week forecast in 1/10th of a second. 

The DLWP is able to produce realistic forecasting of weather events such as Hurricane Irma, a Category 4 storm that hit Florida and the Caribbean in 2017. While the speedy DLWP model matches the performance of current state-of-the-art weather forecasters 4 to 6 weeks into the future, it has limitations predicting precipitation and is less accurate in shorter lead times of 2–3 weeks. 

According to the study, the DLWP may also prove a valuable tool for supplementing spring and summer forecasts in the tropics, a region that challenges current weather models.

The open-source code is available on GitHub.


Read the study in Journal of Advances in Modeling Earth Systems. >>

Read more. >>

Categories
Offsites

Scaling Vision with Sparse Mixture of Experts

Advances in deep learning over the last few decades have been driven by a few key elements. With a small number of simple but flexible mechanisms (i.e., inductive biases such as convolutions or sequence attention), increasingly large datasets, and more specialized hardware, neural networks can now achieve impressive results on a wide range of tasks, such as image classification, machine translation, and protein folding prediction.

However, the use of large models and datasets comes at the expense of significant computational requirements. Yet, recent works suggest that large model sizes might be necessary for strong generalization and robustness, so training large models while limiting resource requirements is becoming increasingly important. One promising approach involves the use of conditional computation: rather than activating the whole network for every single input, different parts of the model are activated for different inputs. This paradigm has been featured in the Pathways vision and recent works on large language models, while it has not been well explored in the context of computer vision.

In “Scaling Vision with Sparse Mixture of Experts”, we present V-MoE, a new vision architecture based on a sparse mixture of experts, which we then use to train the largest vision model to date. We transfer V-MoE to ImageNet and demonstrate matching state-of-the-art accuracy while using about 50% fewer resources than models of comparable performance. We have also open-sourced the code to train sparse models and provided several pre-trained models.

Vision Mixture of Experts (V-MoEs)
Vision Transformers (ViT) have emerged as one of the best architectures for vision tasks. ViT first partitions an image into equally-sized square patches. These are called tokens, a term inherited from language models. Still, compared to the largest language models, ViT models are several orders of magnitude smaller in terms of number of parameters and compute.

To massively scale vision models, we replace some dense feedforward layers (FFN) in the ViT architecture with a sparse mixture of independent FFNs (which we call experts). A learnable router layer selects which experts are chosen (and how they are weighted) for every individual token. That is, different tokens from the same image may be routed to different experts. Each token is only routed to at most K (typically 1 or 2) experts, among a total of E experts (in our experiments, E is typically 32). This allows scaling the model’s size while keeping its computation per token roughly constant. The figure below shows the structure of the encoder blocks in more detail.

V-MoE Transformer Encoder block.

Experimental Results
We first pre-train the model once on JFT-300M, a large dataset of images. The left plot below shows our pre-training results for models of all sizes: from the small S/32 to the huge H/14.

We then transfer the model to new downstream tasks (such as ImageNet), by using a new head (the last layer in a model). We explore two transfer setups: either fine-tuning the entire model on all available examples of the new task, or freezing the pre-trained network and tuning only the new head using a few examples (known as few-shot transfer). The right plot in the figure below summarizes our transfer results to ImageNet, training on only 5 images per class (called 5-shot transfer).

JFT-300M Precision@1 and ImageNet 5-shot accuracy. Colors represent different ViT variants and markers represent either standard ViT (●), or V-MoEs (▸) with expert layers on the last n even blocks. We set n=2 for all models, except V-MoE-H where n=5. Higher indicates better performance, with more efficient models being to the left.

In both cases, the sparse model strongly outperforms its dense counterpart at a given amount of training compute (shown by the V-MoE line being above the ViT line), or achieves similar performance much faster (shown by the V-MoE line being to the left of the ViT line).

To explore the limits of vision models, we trained a 15-billion parameter model with 24 MoE layers (out of 48 blocks) on an extended version of JFT-300M. This massive model — the largest to date in vision as far as we know — achieved 90.35% test accuracy on ImageNet after fine-tuning, near the current state-of-the-art.

Priority Routing
In practice, due to hardware constraints, it is not efficient to use buffers with a dynamic size, so models typically use a pre-defined buffer capacity for each expert. Assigned tokens beyond this capacity are dropped and not processed once the expert becomes “full”. As a consequence, higher capacities yield higher accuracy, but they are also more computationally expensive.

We leverage this implementation constraint to make V-MoEs faster at inference time. By decreasing the total combined buffer capacity below the number of tokens to be processed, the network is forced to skip processing some tokens in the expert layers. Instead of choosing the tokens to skip in some arbitrary fashion (as previous works did), the model learns to sort tokens according to an importance score. This maintains high quality predictions while saving a lot of compute. We refer to this approach as Batch Priority Routing (BPR), illustrated below.

Under high capacity, both vanilla and priority routing work well as all patches are processed. However, when the buffer size is reduced to save compute, vanilla routing selects arbitrary patches to process, often leading to poor predictions. BPR smartly prioritizes important patches resulting in better predictions at lower computational costs.

Dropping the right tokens turns out to be essential to deliver high-quality and more efficient inference predictions. When the expert capacity decreases, performance quickly decreases with the vanilla routing mechanism. Conversely, BPR is much more robust to low capacities.

Performance versus inference capacity buffer size (or ratio) C for a V-MoE-H/14 model with K=2. Even for large C’s, BPR improves performance; at low C the difference is quite significant. BPR is competitive with dense models (ViT-H/14) by processing only 15-30% of the tokens.

Overall, we observed that V-MoEs are highly flexible at inference time: for instance, one can decrease the number of selected experts per token to save time and compute, without any further training on the model weights.

Exploring V-MoEs
Because much is yet to be discovered about the internal workings of sparse networks, we also explored the routing patterns of the V-MoE.

One hypothesis is that routers would learn to discriminate and assign tokens to experts based on some semantic grounds (the “car” expert, the “animal” experts, and so on). To test this, below we show plots for two different MoE layers (a very early-on one, and another closer to the head). The x-axis corresponds to each of the 32 experts, and the y-axis shows the ID of the image classes (from 1 to 1000). Each entry in the plot shows how often an expert was selected for tokens corresponding to the specific image class, with darker colors indicating higher frequency. While in the early layers there is little correlation, later in the network, each expert receives and processes tokens from only a handful of classes. Therefore, we can conclude that some semantic clustering of the patches emerges in the deeper layers of the network.

Higher routing decisions correlate with image classes. We show two MoE layers of a V-MoE-H/14. The x-axis corresponds to the 32 experts in a layer. The y-axis are the 1000 ImageNet classes; orderings for both axes are different across plots (to highlight correlations). For each pair (expert e, class c) we show the average routing weight for the tokens corresponding to all images with class c for that particular expert e.

Final Thoughts
We train very large vision models using conditional computation, delivering significant improvements in representation and transfer learning for relatively little training cost. Alongside V-MoE, we introduced BPR, which requires the model to process only the most useful tokens in the expert layers.

We believe this is just the beginning of conditional computation at scale for computer vision; extensions include multi-modal and multi-task models, scaling up the expert count, and improving transfer of the representations produced by sparse models. Heterogeneous expert architectures and conditional variable-length routes are also promising directions. Sparse models can especially help in data rich domains such as large-scale video modeling. We hope our open-source code and models help attract and engage researchers new to this field.

Acknowledgments
We thank our co-authors: Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. We thank Alex Kolesnikov, Lucas Beyer, and Xiaohua Zhai for providing continuous help and details about scaling ViT models. We are also grateful to Josip Djolonga, Ilya Tolstikhin, Liam Fedus, and Barret Zoph for feedback on the paper; James Bradbury, Roy Frostig, Blake Hechtman, Dmitry Lepikhin, Anselm Levskaya, and Parker Schuh for invaluable support helping us run our JAX models efficiently on TPUs; and many others from the Brain team for their support. Finally, we would also like to thank and acknowledge Tom Small for the awesome animated figure used in this post.

Categories
Misc

How Retailers Meet Tough Challenges Using NVIDIA AI

At the National Retail Federation’s annual trade show, conversations tend to touch on recurring themes: “Will we be able to stock must-have products for next Christmas?,” “What incentives can I offer to loyal workers?” and “What happens to my margins if Susie Consumer purchases three of the same dresses online and returns two?” The $26 Read article >

The post How Retailers Meet Tough Challenges Using NVIDIA AI  appeared first on The Official NVIDIA Blog.

Categories
Misc

AI Startup to Take a Bite Out of Fast-Food Labor Crunch

Addressing a growing labor crisis among quick-service restaurants, startup Vistry is harnessing AI to automate the process of taking orders. The company will share its story at the NRF Big Show, the annual industry gathering of the National Retail Federation in New York, starting Jan. 16. “They’re closing restaurants because there is not enough labor,” Read article >

The post AI Startup to Take a Bite Out of Fast-Food Labor Crunch appeared first on The Official NVIDIA Blog.

Categories
Misc

GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta

Starting next week, Fortnite on GeForce NOW will launch in a limited-time closed beta for mobile, all streamed through the Safari web browser on iOS and the GeForce NOW Android app. The beta is open for registration for all GeForce NOW members, and will help test our server capacity, graphics delivery and new touch controls Read article >

The post GFN Thursday: ‘Fortnite’ Comes to iOS Safari and Android Through NVIDIA GeForce NOW via Closed Beta appeared first on The Official NVIDIA Blog.

Categories
Misc

World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients

Cutting down the time needed to sequence and analyze a patient’s whole genome from days to hours isn’t just about clinical efficiency — it can save lives. By accelerating every step of this process — from collecting a blood sample to sequencing the whole genome to identifying variants linked to diseases — a research team Read article >

The post World Record-Setting DNA Sequencing Technique Helps Clinicians Rapidly Diagnose Critical Care Patients appeared first on The Official NVIDIA Blog.