
Leveraging AI Music with NVIDIA DGX-2

Language models such as the NVIDIA Megatron-LM and OpenAI GPT-2 and GPT-3 have been used to enhance human productivity and creativity. Specifically, these…

Language models such as the NVIDIA Megatron-LM and OpenAI GPT-2 and GPT-3 have been used to enhance human productivity and creativity. Specifically, these models have been used as powerful tools for writing, programming, and painting. The same architecture can be used for music composition.

Large datasets are required to use language models in these domains. Starting with 50 GB of uncompressed text files for language generation is no surprise. This implies the need for a log of GPU compute to train the models effectively for rapid development, prototyping, and iteration.

This post provides an account of a series of experiments performed in the field of AI music using the NVIDIA DGX-2 platform. DGX-2 boosted progress significantly in both data preprocessing and training language models.

Datasets for AI music

There are two major classes when it comes to datasets for Computational Music. One approach involves training on the music represented as pure audio (WAV files or MP3s). The second approach does not work with the pure audio. Instead, you map anything that resembles sheet music to a token representation. 

Usually, this requires tokens for which note starts (C, D, E, F, G), how much time passes (quarter notes or eighth notes, for example), and which note ends. In research and application, MIDI-files have proven to be fruitful sources for musical material. The MIDI standard has been designed to electronically store music information.

These experiments used several sets of MIDI files, including:

Video 1. AI music composed using a GPT trained on the MetaMIDI Dataset

The MIDI format is a non-human-readable representation of music which, in order to train a Causal Language Model, has to be mapped to a readable token representation. For this representation, we took inspiration from the mmmtrack encoding

This encoding represents pieces of music as a hierarchy. A piece of music consists of different tracks for different instruments: drums, guitars, bass, and piano, for example. Each track consists of several bars (4, 8, or 16 bars, depending on the use case). And each bar holds a sequence of note-on, time-delta, and note-off events. Although this hierarchy can be considered a tree, it is possible to encode everything as a linear sequence, making it an ideal representation for decoder-only language models.

The example below is a four-part chorale in its piano roll representation. A chorale features four voices: soprano, alto, tenor, and bass. Soprano and alto are female voices, and tenor and bass are male voices. Usually, all four voices sing at the same time but with different, harmonic pitches. 

Figure 1 visualizes the voices with pitch color coding. The soprano is green, the alto is orange, the tenor is blue, and the bass is red. You can encode these musical events—which have both a time and a pitch dimension—to a sequence of tokens.

Graph visualization of music tokens generated with the Music GPT model. The music tokens are pitch color-coded.
Figure 1. A sample of generated music tokens visualized with pitch color coding

Following the mmmtrack encoding, the bass part would be mapped to the following token representation:


With a little practice, humans can read and understand this representation. The representation starts with PIECE_START indicating the start of a piece of music. TRACK_START indicates the beginning and TRACK_END the end of a track (or instrument or voice). The INST=BASS token denotes that this track contains the bass voice. Other voices are represented the same way. BAR_START and BAR_END represent the beginning and the end of a bar, respectively. NOTE_ON=61 is the start of a note with pitch 61. 

On the piano, this would be the note C#5. TIME_DELTA=4 means that a duration of four sixteenth notes would elapse. That would be a quarter note. After that, the note would end, represented by NOTE_OFF=61. And so on and so forth. At this point, this notation would also allow for polyphony. Several tracks would sound notes at the same time, and each track could have parallel notes. This makes the encoding universal.

Each piece of music differs in the number of bars. It is quite possible that encoding an entire song would require a long sequence length, making the training of a respective Transformer computationally expensive. These experiments encode most of the datasets with four bars and a few with eight bars. Experiments with 16 bars are underway. In addition, only music in a 4/4 time meter was used. This covers the better part of western music. Other meters such as 3/4 (waltz) can be the subject of future work.

This sequence of different experiments mapped many MIDI datasets to the described token format. The same preprocessor was used throughout. Once the preprocessor worked with small datasets, it immediately worked with larger ones. 

The processing time depends on the number of MIDI files to be encoded, ranging from a few minutes to many hours. The longest preprocessing took 30 hours on DGX-2 running on all 96 CPUs in parallel. It is estimated that this would take about 10-14 days of processing on a state-of-the-art MacBook Pro.

Graph of music datasets (MIDI files) used for training the Music GPT models in bar chart sorted from the largest datasets to the smallest
Figure 2. Music datasets used for training the GPT models

Encoding a dataset of MIDI files would yield a collection of token files. The size of those token files depends on the number of MIDI files and the number of bars. Consider some of the experiment datasets and their encoded dataset sizes:

  • JS Fake Chorales Dataset: 14 MB with four bars per sample
  • The Lakh MIDI Dataset: 72 GB, its Clean subset 19 GB with four bars per sample
  • The MetaMIDI Dataset: 130 GB with four bars and 230 GB with eight bars per sample

You can imagine that training on the 14 MB of JS Fake Chorales would take just a few hours. Training on the MetaMIDI Dataset with its 130 GB would take many days. Training for these experiments lasted between 10 and 15 days.

Model training

Many models were trained using the HuggingFace GPT-2 implementation. A few models were trained using the NVIDIA Megatron-LM in GPT-2 mode. 

Training with HuggingFace boiled down to uploading the dataset to the DGX-2 and then running a training script that contained all functionality, including the model and training parameters. The same script was used, with just a few changes here and there for all our datasets. It was just a matter of scale.

For Megatron-LM, the environment setup is as easy as pulling and running an NGC PyTorch Docker container, then getting to work immediately with a Jupyter notebook in the browser through ssh tunneling into the DGX-2 machine.

Most of the experiments used the same GPT-2 architecture:  six decoder-blocks and eight attention heads; the embedding size was 512, and the sequence length was 2048. Although this is definitely not a Large Language Model (LLM), which can have around 100 decoder blocks, subjective evaluation showed that for AI music this architecture works like a charm.

Using the NVIDIA DGX-2 really made a difference in rapid iteration. Datasets that would train for multiple days on a single GPU, would train for just a few hours on DGX-2. Datasets that would train for months on a single GPU, finished training after two weeks maximum on DGX-2. Especially for experiments with datasets

Training times for some of the datasets were as follows:

  • The Lakh MIDI Clean Dataset took 15 hours for 10 epochs and roughly 15K songs
  • The Lakh MIDI Dataset took 130 hours for 10 epochs and roughly 175K songs
  • The MetaMIDI Dataset took 290 hours for 9 epochs and roughly 400K songs

Note that the JS Fake Chorales dataset was trained earlier and not on the DGX-2. Due to its very small size, it was not necessary to use a multi-GPU setup. It could even be trained overnight on a MacBook Pro.


This section provides a closer look at the NVIDIA DGX-2 specifications. As mentioned above, the platform is very effective, both when it comes to accelerated dataset preprocessing, and when it comes to training language models. This section will be a delightfully technical one.

A picture of a DGX-2 station
Figure 3. DGX-2 station 

NVIDIA DGX-2 is a powerful system with 16 fully connected Tesla V100 32 GB GPUs using NVSwitch. It is capable of delivering 2.4 TB/sec of bisection bandwidth. DGX-2 has been designed for AI researchers and data scientists who need both performance and scalability. 

For transformer models, NVIDIA DGX-2 is able to deliver up to 517,227 tokens per second throughput with mixed precision, making it especially powerful.

A table showing multi-GPU performance information for the NVIDIA DGX-2 station, specifically on throughput for Floating Point 32 / mixed precision varying number of GPUs and batch sizes.
Table 1. Multi-GPU performance table for DGX-2

Software framework: NVIDIA Megatron-LM

To get the most out of powerful compute, you need stable and optimized software. With a performance-optimized framework such as NVIDIA Megatron-LM, performance is scaled almost linearly as the GPT model sizes are scaled. For related information, see Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.

A baseline is achieved by training a model of 1.2 billion parameters on a single NVIDIA V100 32 GB GPU, that sustains 39 teraflops. This is 30% of the theoretical peak flops for a single GPU as configured in a DGX-2H server, and is thus a strong baseline. 

Scaling the model to 8.3 billion parameters on 512 GPUs with 8-way model parallelism achieved up to 15.1 petaflops per second sustained over the entire application. This is 76% scaling efficiency compared to the single GPU case.

A plot with achieved petaflops on the y-axis and number of GPUs used on the x-axis to demonstrate near linear scaling performance with Megatron-LM up to thousands of GPUs as we scale model sizes up to 1 trillion parameters.
Figure 4. Scaling to thousands of GPUs with NVIDIA Megatron-LM, without losing performance

By fixing the seq_len, short-hands, equal to 4,096, and modifying training configurations and launch training runs with only a few iterations, it is possible to calculate the teraflop percent achieved in real application job runs.

After a native run, both the nvidia-smi as well as the output Nsight profile were analyzed. Different configurations were tested to obtain the highest possible teraflop, as the below table illustrates:

Table showing Teraflops calculation using the third equation presented in Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Table 2. Teraflops calculation using the third equation presented in Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

The training configuration presented in the last row of the table delivered the highest teraflop of 45.45%. 

Note that eight V100 32 GB GPUs were used instead of 16 GPUs to shorten the time it takes to run each profiling job. The nvidia-smi command was used to verify with the training config that achieved 45.45% teraflops utilization, as illustrated below. 

Training performance was interactively monitored through the use of nvidia-smi commands
Figure 5. Training performance was interactively monitored through the use of nvidia-smi commands


The AI music experiments presented here were performed using the NVIDIA DGX-2. We trained language models using datasets ranging from just a few megabytes in size to 230 GB. We used the HuggingFace GPT-2 implementation and showed that NVIDIA Megatron-LM is also a great alternative for experimentation.

NVIDIA DGX-2 made a significant difference in accelerating dataset preprocessing—mapping MIDI files to a token representation—and training models. This allowed for rapid experimentation. DGX-2 worked like a charm when it came to training the largest MIDI dataset available (MetaMIDI with 400K files).


DLI Courses: Enhance Your Skills with Hands-On Training at GTC

Select from 20 hands-on workshops, offered at GTC, available in multiple languages and time zones. Early bird pricing of just $99 ends Aug 29 (regular $500).

Select from 20 hands-on workshops, offered at GTC, available in multiple languages and time zones. Early bird pricing of just $99 ends Aug 29 (regular $500).


Get Your NVIDIA RTX Games Ready for Epic MegaJam

Create a project in Unreal Engine and submit it by September 1 for your chance to win an NVIDIA GeForce RTX 3080 GPU.

Create a project in Unreal Engine and submit it by September 1 for your chance to win an NVIDIA GeForce RTX 3080 GPU.


An AI-Enabled Drone Could Soon Become Every Rhino Poacher’s… Horn Enemy

Watching out for the nearly-extinct two-ton beasts may be the ultimate example of a job best done remotely.

The post An AI-Enabled Drone Could Soon Become Every Rhino Poacher’s… Horn Enemy appeared first on NVIDIA Blog.


Featured Sessions for Startups at GTC 2022

Learn how startups use AI to build solutions faster and accelerate their growth with these recommended sessions at GTC.

Learn how startups use AI to build solutions faster and accelerate their growth with these recommended sessions at GTC.


Five Unique Real-Time Rendering Tips from NVIDIA Experts

We recently kicked off our NVIDIA Developer Program exclusive series of Connect with Experts Ask Me Anything (AMA) sessions featuring NVIDIA experts and Ray…

We recently kicked off our NVIDIA Developer Program exclusive series of Connect with Experts Ask Me Anything (AMA) sessions featuring NVIDIA experts and Ray Tracing Gems editors Eric Haines, Adam Marrs, Peter Shirley, and Ingo Wald.

During the AMA, the editors offered some valuable guidance and tips on how to successfully integrate real-time rendering. Check out the top five questions and answers from the AMA:

1. Are there some rules of thumb one should follow when adding ray tracing (RT) applications like translucency, reflections, shadows, GI, or diffuse illumination to games? 

Adam: There are many things to take into consideration when adding ray-traced effects to a game’s renderer. The main consideration to keep top of mind is for the ray-traced effects to work hand-in-hand with the goals of your game’s art direction. This will change what performance costs are reasonable for any given effect. 

For example, if shadows are an important game mechanic (think of Splinter Cell), then a higher cost for extra-nice ray-traced shadows makes sense, but spending extra performance on RT translucency probably doesn’t make as much sense. For guidance on how to balance ray tracing and performance, we have a variety of webinars and other content that you can learn from. In fact, there’s an event coming up about RTX in Unreal Engine 5. (Note that you can access this content on demand.) 

2. When sampling direct lighting, both reservoir sampling and resampled importance sampling can be useful techniques. But it seems difficult to recompute PDFs for the sake of MIS when a light has been sampled through a BSDF sample. Could you provide any insights into this problem?

Ingo: Sample importance resampling is only generating samples relative to an existing PDF (that you choose to take these samples). So it should be possible to evaluate that existing PDF to compute PDF values for other samples (in an MIS context).

3. Do ray tracing and deep learning overlap?

Eric: Yes, in many ways. Deep learning can be used to complement ray tracing, “filling in” missing information with plausible interpolated data, such as with NVIDIA Deep Learning Super Sampling (DLSS). This works today.

Neural rendering and neural graphics primitives are hot areas of research currently. One place to start is with Advances in Neural Rendering from SIGGRAPH 2021. Another good resource is a recent overview of NeRF at CVPR 2022, where ray tracing is used to render radiance fields. 

4. What’s the latest scoop on using ML training to help with ray-traced GI? Are there any neat advances in ray tracing that benefit from deep learning? Have you connected lower sampling and filtering using an ML upscaling 2D filter?

Adam: There’s been quite a lot of work in the machine learning space to assist with real-time (and not real-time) graphics. For ray-traced global illumination, check out a paper recently published by Thomas Müller, Real-Time Neural Radiance Caching for Path Tracing. Their approach trains a neural network to learn the light transport characteristics of a scene and then builds a light cache that can be queried at a lower cost than tracing the full paths.

5. What are your top three favorite graphics papers of all time?





Join the discussion on the NVIDIA Developer Forums. And don’t forget to sign up for the NVIDIA Developer Program to be notified about the next AMA this October on Recommender Systems. 

Register for GTC 2022 to learn the latest about RTX real-time ray tracing. For a full list of content for game developers including tools and training, visit NVIDIA Game Development.


Meet the Omnivore: Startup in3D Turns Selfies Into Talking, Dancing Avatars With NVIDIA Omniverse

Imagine taking a selfie and using it to get a moving, talking, customizable 3D avatar of yourself in just seconds. 

The post Meet the Omnivore: Startup in3D Turns Selfies Into Talking, Dancing Avatars With NVIDIA Omniverse appeared first on NVIDIA Blog.


NVIDIA to Share New Details on Grace CPU, Hopper GPU, NVLink Switch, Jetson Orin Module at Hot Chips

In four talks over two days, senior NVIDIA engineers will describe innovations in accelerated computing for modern data centers and systems at the edge of the network. Speaking at a virtual Hot Chips event, an annual gathering of processor and system architects, they’ll disclose performance numbers and other technical details for NVIDIA’s first server CPU, Read article >

The post NVIDIA to Share New Details on Grace CPU, Hopper GPU, NVLink Switch, Jetson Orin Module at Hot Chips appeared first on NVIDIA Blog.


Graphics pioneer Dr. Donald Greenberg shares the new chapter in digital design and how NVI…

Graphics pioneer Dr. Donald Greenberg shares the new chapter in digital design and how NVIDIA Omniverse supports the expansion. #DigitalTwins #SIGGRAPH2022


OptFormer: Towards Universal Hyperparameter Optimization with Transformers

One of the most important aspects in machine learning is hyperparameter optimization, as finding the right hyperparameters for a machine learning task can make or break a model’s performance. Internally, we regularly use Google Vizier as the default platform for hyperparameter optimization. Throughout its deployment over the last 5 years, Google Vizier has been used more than 10 million times, over a vast class of applications, including machine learning applications from vision, reinforcement learning, and language but also scientific applications such as protein discovery and hardware acceleration. As Google Vizier is able to keep track of use patterns in its database, such data, usually consisting of optimization trajectories termed studies, contain very valuable prior information on realistic hyperparameter tuning objectives, and are thus highly attractive for developing better algorithms.

While there have been many previous methods for meta-learning over such data, such methods share one major common drawback: their meta-learning procedures depend heavily on numerical constraints such as the number of hyperparameters and their value ranges, and thus require all tasks to use the exact same total hyperparameter search space (i.e., tuning specifications). Additional textual information in the study, such as its description and parameter names, are also rarely used, yet can hold meaningful information about the type of task being optimized. Such a drawback becomes more exacerbated for larger datasets, which often contain significant amounts of such meaningful information.

Today in “Towards Learning Universal Hyperparameter Optimizers with Transformers”, we are excited to introduce the OptFormer, one of the first Transformer-based frameworks for hyperparameter tuning, learned from large-scale optimization data using flexible text-based representations. While numerous works have previously demonstrated the Transformer’s strong abilities across various domains, few have touched on its optimization-based capabilities, especially over text space. Our core findings demonstrate for the first time some intriguing algorithmic abilities of Transformers: 1) a single Transformer network is capable of imitating highly complex behaviors from multiple algorithms over long horizons; 2) the network is further capable of predicting objective values very accurately, in many cases surpassing Gaussian Processes, which are commonly used in algorithms such as Bayesian Optimization.

Approach: Representing Studies as Tokens
Rather than only using numerical data as common with previous methods, our novel approach instead utilizes concepts from natural language and represents all of the study data as a sequence of tokens, including textual information from initial metadata. In the animation below, this includes “CIFAR10”, “learning rate”, “optimizer type”, and “Accuracy”, which informs the OptFormer of an image classification task. The OptFormer then generates new hyperparameters to try on the task, predicts the task accuracy, and finally receives the true accuracy, which will be used to generate the next round’s hyperparameters. Using the T5X codebase, the OptFormer is trained in a typical encoder-decoder fashion using standard generative pretraining over a wide range of hyperparameter optimization objectives, including real world data collected by Google Vizier, as well as public hyperparameter (HPO-B) and blackbox optimization benchmarks (BBOB).

The OptFormer can perform hyperparameter optimization encoder-decoder style, using token-based representations. It initially observes text-based metadata (in the gray box) containing information such as the title, search space parameter names, and metrics to optimize, and repeatedly outputs parameter and objective value predictions.

Imitating Policies
As the OptFormer is trained over optimization trajectories by various algorithms, it may now accurately imitate such algorithms simultaneously. By providing a text-based prompt in the metadata for the designated algorithm (e.g. “Regularized Evolution”), the OptFormer will imitate the algorithm’s behavior.

Over an unseen test function, the OptFormer produces nearly identical optimization curves as the original algorithm. Mean and standard deviation error bars are shown.

Predicting Objective Values
In addition, the OptFormer may now predict the objective value being optimized (e.g. accuracy) and provide uncertainty estimates. We compared the OptFormer’s prediction with a standard Gaussian Process and found that the OptFormer was able to make significantly more accurate predictions. This can be seen below qualitatively, where the OptFormer’s calibration curve closely follows the ideal diagonal line in a goodness-of-fit test, and quantitatively through standard aggregate metrics such as log predictive density.

Left: Rosenblatt Goodness-of-Fit. Closer diagonal fit is better. Right: Log Predictive Density. Higher is better.

Combining Both: Model-based Optimization
We may now use the OptFormer’s function prediction capability to better guide our imitated policy, similar to techniques found in Bayesian Optimization. Using Thompson Sampling, we may rank our imitated policy’s suggestions and only select the best according to the function predictor. This produces an augmented policy capable of outperforming our industry-grade Bayesian Optimization algorithm in Google Vizier when optimizing classic synthetic benchmark objectives and tuning the learning rate hyperparameters of a standard CIFAR-10 training pipeline.

Left: Best-so-far optimization curve over a classic Rosenbrock function. Right: Best-so-far optimization curve over hyperparameters for training a ResNet-50 on CIFAR-10 via init2winit. Both cases use 10 seeds per curve, and error bars at 25th and 75th percentiles.

Throughout this work, we discovered some useful and previously unknown optimization capabilities of the Transformer. In the future, we hope to pave the way for a universal hyperparameter and blackbox optimization interface to use both numerical and textual data to facilitate optimization over complex search spaces, and integrate the OptFormer with the rest of the Transformer ecosystem (e.g. language, vision, code) by leveraging Google’s vast collection of offline AutoML data.

The following members of DeepMind and the Google Research Brain Team conducted this research: Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc’aurelio Ranzato, Sagi Perel, and Nando de Freitas.

We would like to also thank Chris Dyer, Luke Metz, Kevin Murphy, Yannis Assael, Frank Hutter, and Esteban Real for providing valuable feedback, and further thank Sebastian Pineda Arango, Christof Angermueller, and Zachary Nado for technical discussions on benchmarks. In addition, we thank Daniel Golovin, Daiyi Peng, Yingjie Miao, Jack Parker-Holder, Jie Tan, Lucio Dery, and Aleksandra Faust for multiple useful conversations.

Finally, we thank Tom Small for designing the animation for this post.