Rafi Nizam is an award-winning independent animator, director, character designer and more. He’s developed feature films at Sony Pictures, children’s series and comedies at BBC and global transmedia content at NBCUniversal.
On Sept. 12, learn about the connection between MATLAB and NVIDIA Isaac Sim through ROS.
On Sept. 12, learn about the connection between MATLAB and NVIDIA Isaac Sim through ROS.
With coral reefs in rapid decline across the globe, researchers from the University of Hawaii at Mānoa have pioneered an AI-based surveying tool that monitors reef health from the sky. Using deep learning models and high-resolution satellite imagery powered by NVIDIA GPUs, the researchers have developed a new method for spotting and tracking coral reef Read article >
Creating 3D scans of physical products can be time consuming. Businesses often use traditional methods, like photogrammetry-based apps and scanners, but these can take hours or even days. They also don’t always provide the 3D quality and level of detail needed to make models look realistic in all its applications. Italy-based startup Covision Media is Read article >
Underscoring NVIDIA’s growing relationship with the global technology superpower, Indian Prime Minister Narendra Modi met with NVIDIA founder and CEO Jensen Huang Monday evening. The meeting at 7 Lok Kalyan Marg — as the Prime Minister’s official residence in New Delhi is known — comes as Modi prepares to host a gathering of leaders from Read article >
Every year, as part of their coursework, students from the University of Warsaw, Poland get to work under the supervision of engineers from the NVIDIA Warsaw…
Every year, as part of their coursework, students from the University of Warsaw, Poland get to work under the supervision of engineers from the NVIDIA Warsaw office on challenging problems in deep learning and accelerated computing. We present the work of three M.Sc. students—Alicja Ziarko, Paweł Pawlik, and Michał Siennicki—who managed to significantly reduce the latency in TorToiSe, a multi-stage, diffusion-based, text-to-speech (TTS) model.
Alicja, Paweł, and Michał first learned about the recent advancements in speech synthesis and diffusion models. They chose the combination of classifier-free guidance and progressive distillation, which performs well in computer vision, and adapted it to speech synthesis, achieving a 5x reduction in diffusion latency without a regression in speech quality. Small perceptual speech tests confirmed the results. Notably, this approach does not require costly training from scratch on the original model.
Why speed up diffusion-based TTS?
Since the publication of WaveNet in 2016, neural networks have become the primary models for speech synthesis. In simple applications, such as synthesis for AI-based voice assistants, synthetic voices are almost indistinguishable from human speech. Such voices can be synthesized orders of magnitudes faster than real time, for instance with the NVIDIA NeMo AI toolkit.
However, achieving high expressivity or imitating a voice based on a few seconds of recorded speech (few-shot) is still considered challenging.
Denoising Diffusion Probabilistic Models (DDPMs) emerged as a generative technique that enables the generation of images of great quality and expressivity based on input text. DDPMs can be readily applied to TTS because a frequency-based spectrogram, which graphically represents a speech signal, can be processed like an image.
For instance, in TorToiSe, which is a guided diffusion-based TTS model, a spectrogram is generated by combining the results of two diffusion models (Figure 1). The iterative diffusion process involves hundreds of steps to achieve a high-quality output, significantly increasing latency compared to state-of-the-art TTS methods, which severely limits its applications.
In Figure 1, the unconditional diffusion model iteratively refines the initial noise until a high-quality spectrogram is obtained. The second diffusion model is further conditioned on the text embeddings produced by the language model.
Methods for speeding up diffusion
Existing latency reduction techniques in diffusion-based TTS can be divided into training-free and training-based methods.
Training-free methods do not involve training the network used to generate images by reversing the diffusion process. Instead, they only focus on optimizing the multi-step diffusion process. The diffusion process can be seen as solving ODE/SDE equations, so one way to optimize it is to create a better solver like DDPM, DDIM, and DPM, which lowers the number of diffusion steps. Parallel sampling methods, such as those based on Picard iterations or Normalizing Flows, can parallelize the diffusion process to benefit from parallel computing on GPUs.
Training-based methods focus on optimizing the network used in the diffusion process. The network can be pruned, quantized, or sparsified, and then fine-tuned for higher accuracy. Alternatively, its neural architecture can be changed manually or automatically using NAS. Knowledge distillation techniques enable distilling the student network from the teacher network to reduce the number of steps in the diffusion process.
Distillation in diffusion-based TTS
Alicja, Paweł, and Michał decided to use the distillation approach based on promising results in computer vision and its potential for an estimated 5x reduction in latency of the diffusion model at inference. They have managed to adapt progressive distillation to the diffusion part of a pretrained TorToiSe model, overcoming problems like the lack of access to the original training data.
Their approach consists of two knowledge distillation phases:
- Mimicking the guided diffusion model output
- Training another student model
In the first knowledge distillation phase (Figure 2), the student model is trained to mimic the output of the guided diffusion model at each diffusion step. This phase reduces latency by half by combining the two diffusion models into one model.
To address the lack of access to the original training data, text embeddings from the language model are passed through the original teacher model to generate synthetic data used in distillation. The use of synthetic data also makes the distillation process more efficient because the entire TTS, guided diffusion pipeline does not have to be invoked at each distillation step.
In the second progressive distillation phase (Figure 3), the newly trained student model serves as a teacher to train another student model. In this technique, the student model is trained to mimic the teacher model while reducing the number of diffusion steps by a factor of two. This process is repeated many times to further reduce the number of steps, while each time, a new student serves as the teacher for the next round of distillation.
A progressive distillation with seven iterations reduces the number of inference steps 7^2 times, from 4,000 steps on which the model was trained to 31 steps. This reduction results in a 5x speedup compared to the guided diffusion model, excluding the text embedding calculation cost.
The perceptual pairwise speech test shows that the distilled model (after the second phase) matches the quality of speech produced by the TTS model based on guided distillation.
As an example, listen to audio samples in Table 1 generated by the progressive distillation-based TTS model. The samples match the quality of the audio samples from the guided diffusion-based TTS model. If we simply reduced the number of distillation steps to 31, instead of using progressive distillation, the quality of the generated speech deteriorates significantly.
Speaker |
Guided diffusion-based TTS model (2×80 diffusion steps) |
Diffusion-based TTS after progressive distillation (31 diffusion steps) |
Guided diffusion-based TTS model (naive reduction to 31 diffusion steps) |
---|---|---|---|
Female 1 |
Audio | Audio | Audio |
Female 2 | Audio | Audio | Audio |
Female 3 | Audio | Audio | Audio |
Male 1 | Audio | Audio | Audio |
Conclusion
Collaborating with academia and assisting young students in shaping their future in science and engineering is one of the core NVIDIA values. Alicja, Paweł, and Michał’s successful project exemplifies the NVIDIA Warsaw, Poland office partnership with local universities.
The students managed to solve the challenging problem of speeding up the pretrained, diffusion-based, text-to-speech (TTS) model. They designed and implemented a knowledge distillation-based solution in the complex field of diffusion-based TTS, achieving a 5x speedup of the diffusion process. Most notably, their unique solution based on synthetic data generation is applicable to pretrained TTS models without access to the original training data.
We encourage you to explore NVIDIA Academic Programs and try out the NVIDIA NeMo Framework to create complete conversational AI (TTS, ASR, or NLP/LLM) solutions for the new era of generative AI.
Advanced API Performance: Shaders
This post covers best practices when working with shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced…
This post covers best practices when working with shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.
Shaders play a critical role in graphics programming by enabling you to control various aspects of the rendering process. They run on the GPU and are responsible for manipulating vertices, pixels, and other data.
- General shaders
- Compute shaders
- Pixel shaders
- Vertex shaders
- Geometry, domain, and hull shaders
General shaders
These tips apply to all types of shaders.
Recommended
- Avoid warp-divergent constant buffer view (CBV) and immediate constant buffer (ICB) reads.
- Constant buffer reads are most effective when threads in a warp access data uniformly. If you need divergent reads, use shader resource view (SRVs).
- Typical cases where SRVs should be preferred over CBVs include the following:
- Bones or skinning data
- Lookup tables, like precomputed random numbers
- To optimize buffers and group shared memory, use manual bit packing. When creating structures for packing data, consider the range of values a field can hold and choose the smallest datatype that can encompass this range.
- Optimize control flow by providing hints of the expected runtime behavior.
- Make sure to enable compile flag -all-resources-bound for DXC (or D3DCOMPILE_ALL_RESOURCES_BOUND in FXC) if possible. This enables a larger set of driver-side optimizations.
- Consider using the [FLATTEN] and [BRANCH] keywords where appropriate.
- A conditional branch may prevent the compiler from hoisting long-latency instructions, such as texture fetches.
- The [FLATTEN] keyword hints that the compiler is free to hoist and start the load operations before the statement has been evaluated.
- Use Root Signature 1.1 to specify static data and descriptors to enable the driver to make the most optimal shader optimizations.
- Keep the register use to a minimum. Register allocation could limit occupancy and may force the driver to spill registers to memory.
- Prefer the use of gather instructions when loading single channel texture quads.
- This will cut down the expected latency by almost 4x compared to the equivalent operation constructed from consecutive sample instructions.
- Prefer structured buffers over raw buffers.
- Structured buffers have stricter alignment requirements, which enables the driver to schedule more efficient load instructions.
- Consider using numerical approximations or precomputed lookup tables of transcendental functions (exp, log, sin, cos, sqrt) in math-intensive shaders, for instance, physics simulations and denoisers.
- To promote a fast path in the TEX unit, with up to 2x speedup, use point filtering in certain circumstances:
- Low-resolution textures where point filtering is already an accurate representation.
- Textures that are being accessed at their native resolution.
Not recommended
- Don’t assume that half-precision floats are always faster than full precision and the reverse.
- On NVIDIA Ampere GPUs, it’s just as efficient to execute FP32 as FP16 instructions. The overhead of converting between precision formats may just end up with a net loss.
- NVIDIA Turing GPUs may benefit from using FP16 math, as FP16 can be issued at twice the rate of FP32.
Compute shaders
Compute shaders are used for general-purpose computations, from data processing and simulations to machine learning.
Recommended
- Consider using wave intrinsics over group shared memory when possible for communication across threads.
- Wave intrinsics don’t require explicit thread synchronization.
- Starting from SM 6.0, HLSL supports warp-wide wave intrinsics natively without the need for vendor-specific HLSL extensions. Consider using vendor-specific APIs only when the expected functionality is missing. For more information, see Unlocking GPU Intrinsics in HLSL.
- To increase atomic throughput, use wave instructions to coalesce atomic operations across a warp.
- To maximize cache locality and to improve L1 and L2 hit rate, try thread group ID swizzling for full-screen compute passes.
- A good starting point is to target a thread group size corresponding to between two or eight warps. For instance, thread group size 8x8x1 or 16x16x1 for full-screen passes. Make sure to profile your shader and tune the dimensions based on profiling results.
Not recommended
- Do not make your thread group size difficult to scale per platform and GPU architecture.
- Specialization constants can be used in Vulkan to set the dimensions at pipeline creation time whereas HLSL requires the thread group size to be known at shader compile time.
- Be careless of thread group launch latency.
- If your CS has early-out conditions that are expected to early out in most cases, it might be better to choose larger thread group dimensions and cut down on the total number of thread groups launched.
Pixel shaders
Pixel shaders, also known as fragment shaders, are used to calculate effects on a per-pixel basis.
Recommended
- Prefer the use of depth bounds test or stencil and depth testing over manual depth tests in pixel shaders.
- Depth and stencil tests may discard entire 16×16 raster tiles down to individual pixels. Make sure that Early-Z is enabled.
- Be mindful of the use patterns that may force the driver to disable Early-Z testing:
- Conditional z-writes such as clip and discard
- As an alternative consider using null blend ops instead
- Pixel shader depth write
- Writing to UAV resources
- Conditional z-writes such as clip and discard
- Consider converting your full screen pass to a compute shader if there’s a large difference in latency between warps.
Not recommended
- Don’t use raster order view (ROV) techniques pervasively.
- Guaranteeing order doesn’t come for free.
- Always compare with alternative approaches like advanced blending ops and atomics.
Vertex shaders
Vertex shaders are used to calculate effects on a per-vertex basis.
Recommended
- Prefer the use of compressed vertex formats.
- Prefer the use of SRVs for skinning data over CBVs. This is a typical case of divergent CBV reads.
Geometry, domain, and hull shaders
Geometry, domain, and hull shaders are used to control, evaluate, and generate geometry, enabling tessellation to create a dynamic generation of surfaces and objects.
Recommended
- Replace the geometry, domain, and hull shaders with the mesh shading capabilities introduced in NVIDIA Turing.
- Enable the fast geometry path with the following configuration:
- Fixed topology: Neither an expansion or reduction in the number of vertices.
- Fixed primitive type: The input primitive type is equal to the output primitive type.
- Immutable per-vertex attributes: The application cannot change the vertex attributes and can only copy them from the input to the output.
- Mutable per-primitive attributes: The application can compute a single value for the whole primitive, which then is passed to the fragment shader stage. For example, it can compute the area of the triangle.
Acknowledgments
Thanks to Ryan Prescott, Ana Mihut, Katherine Sun, and Ivan Fedorov.
In 1950, weather forecasting started its digital revolution when researchers used the first programmable, general-purpose computer ENIAC to solve mathematical equations describing how weather evolves. In the more than 70 years since, continuous advancements in computing power and improvements to the model formulations have led to steady gains in weather forecast skill: a 7-day forecast today is about as accurate as a 5-day forecast in 2000 and a 3-day forecast in 1980. While improving forecast accuracy at the pace of approximately one day per decade may not seem like a big deal, every day improved is important in far reaching use cases, such as for logistics planning, disaster management, agriculture and energy production. This “quiet” revolution has been tremendously valuable to society, saving lives and providing economic value across many sectors.
Now we are seeing the start of yet another revolution in weather forecasting, this time fueled by advances in machine learning (ML). Rather than hard-coding approximations of the physical equations, the idea is to have algorithms learn how weather evolves from looking at large volumes of past weather data. Early attempts at doing so go back to 2018 but the pace picked up considerably in the last two years when several large ML models demonstrated weather forecasting skill comparable to the best physics-based models. Google’s MetNet [1, 2], for instance, demonstrated state-of-the-art capabilities for forecasting regional weather one day ahead. For global prediction, Google DeepMind created GraphCast, a graph neural network to make 10 day predictions at a horizontal resolution of 25 km, competitive with the best physics-based models in many skill metrics.
Apart from potentially providing more accurate forecasts, one key advantage of such ML methods is that, once trained, they can create forecasts in a matter of minutes on inexpensive hardware. In contrast, traditional weather forecasts require large super-computers that run for hours every day. Clearly, ML represents a tremendous opportunity for the weather forecasting community. This has also been recognized by leading weather forecasting centers, such as the European Centre for Medium-Range Weather Forecasts’ (ECMWF) machine learning roadmap or the National Oceanic and Atmospheric Administration’s (NOAA) artificial intelligence strategy.
To ensure that ML models are trusted and optimized for the right goal, forecast evaluation is crucial. Evaluating weather forecasts isn’t straightforward, however, because weather is an incredibly multi-faceted problem. Different end-users are interested in different properties of forecasts, for example, renewable energy producers care about wind speeds and solar radiation, while crisis response teams are concerned about the track of a potential cyclone or an impending heat wave. In other words, there is no single metric to determine what a “good” weather forecast is, and the evaluation has to reflect the multi-faceted nature of weather and its downstream applications. Furthermore, differences in the exact evaluation setup — e.g., which resolution and ground truth data is used — can make it difficult to compare models. Having a way to compare novel and established methods in a fair and reproducible manner is crucial to measure progress in the field.
To this end, we are announcing WeatherBench 2 (WB2), a benchmark for the next generation of data-driven, global weather models. WB2 is an update to the original benchmark published in 2020, which was based on initial, lower-resolution ML models. The goal of WB2 is to accelerate the progress of data-driven weather models by providing a trusted, reproducible framework for evaluating and comparing different methodologies. The official website contains scores from several state-of-the-art models (at the time of writing, these are Keisler (2022), an early graph neural network, Google DeepMind’s GraphCast and Huawei’s Pangu-Weather, a transformer-based ML model). In addition, forecasts from ECMWF’s high-resolution and ensemble forecasting systems are included, which represent some of the best traditional weather forecasting models.
![]() |
Making evaluation easier
The key component of WB2 is an open-source evaluation framework that allows users to evaluate their forecasts in the same manner as other baselines. Weather forecast data at high-resolutions can be quite large, making even evaluation a computational challenge. For this reason, we built our evaluation code on Apache Beam, which allows users to split computations into smaller chunks and evaluate them in a distributed fashion, for example using DataFlow on Google Cloud. The code comes with a quick-start guide to help people get up to speed.
Additionally, we provide most of the ground-truth and baseline data on Google Cloud Storage in cloud-optimized Zarr format at different resolutions, for example, a comprehensive copy of the ERA5 dataset used to train most ML models. This is part of a larger Google effort to provide analysis-ready, cloud-optimized weather and climate datasets to the research community and beyond. Since downloading these data from the respective archives and converting them can be time-consuming and compute-intensive, we hope that this should considerably lower the entry barrier for the community.
Assessing forecast skill
Together with our collaborators from ECMWF, we defined a set of headline scores that best capture the quality of global weather forecasts. As the figure below shows, several of the ML-based forecasts have lower errors than the state-of-the-art physical models on deterministic metrics. This holds for a range of variables and regions, and underlines the competitiveness and promise of ML-based approaches.
![]() |
This scorecard shows the skill of different models compared to ECMWF’s Integrated Forecasting System (IFS), one of the best physics-based weather forecasts, for several variables. IFS forecasts are evaluated against IFS analysis. All other models are evaluated against ERA5. The order of ML models reflects publication date. |
Toward reliable probabilistic forecasts
However, a single forecast often isn’t enough. Weather is inherently chaotic because of the butterfly effect. For this reason, operational weather centers now run ~50 slightly perturbed realizations of their model, called an ensemble, to estimate the forecast probability distribution across various scenarios. This is important, for example, if one wants to know the likelihood of extreme weather.
Creating reliable probabilistic forecasts will be one of the next key challenges for global ML models. Regional ML models, such as Google’s MetNet already estimate probabilities. To anticipate this next generation of global models, WB2 already provides probabilistic metrics and baselines, among them ECMWF’s IFS ensemble, to accelerate research in this direction.
As mentioned above, weather forecasting has many aspects, and while the headline metrics try to capture the most important aspects of forecast skill, they are by no means sufficient. One example is forecast realism. Currently, many ML forecast models tend to “hedge their bets” in the face of the intrinsic uncertainty of the atmosphere. In other words, they tend to predict smoothed out fields that give lower average error but do not represent a realistic, physically consistent state of the atmosphere. An example of this can be seen in the animation below. The two data-driven models, Pangu-Weather and GraphCast (bottom), predict the large-scale evolution of the atmosphere remarkably well. However, they also have less small-scale structure compared to the ground truth or the physical forecasting model IFS HRES (top). In WB2 we include a range of these case studies and also a spectral metric that quantifies such blurring.
![]() |
Forecasts of a front passing through the continental United States initialized on January 3, 2020. Maps show temperature at a pressure level of 850 hPa (roughly equivalent to an altitude of 1.5km) and geopotential at a pressure level of 500 hPa (roughly 5.5 km) in contours. ERA5 is the corresponding ground-truth analysis, IFS HRES is ECMWF’s physics-based forecasting model. |
Conclusion
WeatherBench 2 will continue to evolve alongside ML model development. The official website will be updated with the latest state-of-the-art models. (To submit a model, please follow these instructions). We also invite the community to provide feedback and suggestions for improvements through issues and pull requests on the WB2 GitHub page.
Designing evaluation well and targeting the right metrics is crucial in order to make sure ML weather models benefit society as quickly as possible. WeatherBench 2 as it is now is just the starting point. We plan to extend it in the future to address key issues for the future of ML-based weather forecasting. Specifically, we would like to add station observations and better precipitation datasets. Furthermore, we will explore the inclusion of nowcasting and subseasonal-to-seasonal predictions to the benchmark.
We hope that WeatherBench 2 can aid researchers and end-users as weather forecasting continues to evolve.
Acknowledgements
WeatherBench 2 is the result of collaboration across many different teams at Google and external collaborators at ECMWF. From ECMWF, we would like to thank Matthew Chantry, Zied Ben Bouallegue and Peter Dueben. From Google, we would like to thank the core contributors to the project: Stephan Rasp, Stephan Hoyer, Peter Battaglia, Alex Merose, Ian Langmore, Tyler Russell, Alvaro Sanchez, Antonio Lobato, Laurence Chiu, Rob Carver, Vivian Yang, Shreya Agrawal, Thomas Turnbull, Jason Hickey, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. We also would like to thank Kunal Shah, Rahul Mahrsee, Aniket Rawat, and Satish Kumar. Thanks to John Anderson for sponsoring WeatherBench 2. Furthermore, we would like to thank Kaifeng Bi from the Pangu-Weather team and Ryan Keisler for their help in adding their models to WeatherBench 2.