Recent AI advances enable modeling of weather forecasting 4-5 magnitudes faster than traditional computing methods. The brightest leaders, researchers and developers in climate science, high performance computing and AI will discuss such technology breakthroughs — and how they can help foster a greener Earth — at NVIDIA GTC. The virtual conference, running Sept. 19-22, also Read article >
Floods in Kentucky and wildfires in California are the kinds of disasters companies of all sorts are trying to address with AI. Tom Rikert, co-founder and CEO of San Francisco-based startup Masterful AI, is one of many experts helping them manage catastrophe risk. In the U.S. alone, the National Association of Insurance Commissioners estimates that Read article >
A triple threat steps In the NVIDIA Studio this week: a tantalizing trio of talented 3D artists who each reimagined and remastered classic European buildings with individualistic flair.
Language models such as the NVIDIA Megatron-LM and OpenAI GPT-2 and GPT-3 have been used to enhance human productivity and creativity. Specifically, these…
Language models such as the NVIDIA Megatron-LM and OpenAI GPT-2 and GPT-3 have been used to enhance human productivity and creativity. Specifically, these models have been used as powerful tools for writing, programming, and painting. The same architecture can be used for music composition.
Large datasets are required to use language models in these domains. Starting with 50 GB of uncompressed text files for language generation is no surprise. This implies the need for a log of GPU compute to train the models effectively for rapid development, prototyping, and iteration.
This post provides an account of a series of experiments performed in the field of AI music using the NVIDIA DGX-2 platform. DGX-2 boosted progress significantly in both data preprocessing and training language models.
Datasets for AI music
There are two major classes when it comes to datasets for Computational Music. One approach involves training on the music represented as pure audio (WAV files or MP3s). The second approach does not work with the pure audio. Instead, you map anything that resembles sheet music to a token representation.
Usually, this requires tokens for which note starts (C, D, E, F, G), how much time passes (quarter notes or eighth notes, for example), and which note ends. In research and application, MIDI-files have proven to be fruitful sources for musical material. The MIDI standard has been designed to electronically store music information.
These experiments used several sets of MIDI files, including:
Lakh MIDI Dataset and its Clean subset (176K and 15K MIDI files respectively), with a mixed variety of genres and styles
MetaMIDI Dataset with 463K MIDI files, again of varying genres and styles
The MIDI format is a non-human-readable representation of music which, in order to train a Causal Language Model, has to be mapped to a readable token representation. For this representation, we took inspiration from the mmmtrack encoding.
This encoding represents pieces of music as a hierarchy. A piece of music consists of different tracks for different instruments: drums, guitars, bass, and piano, for example. Each track consists of several bars (4, 8, or 16 bars, depending on the use case). And each bar holds a sequence of note-on, time-delta, and note-off events. Although this hierarchy can be considered a tree, it is possible to encode everything as a linear sequence, making it an ideal representation for decoder-only language models.
The example below is a four-part chorale in its piano roll representation. A chorale features four voices: soprano, alto, tenor, and bass. Soprano and alto are female voices, and tenor and bass are male voices. Usually, all four voices sing at the same time but with different, harmonic pitches.
Figure 1 visualizes the voices with pitch color coding. The soprano is green, the alto is orange, the tenor is blue, and the bass is red. You can encode these musical events—which have both a time and a pitch dimension—to a sequence of tokens.
Following the mmmtrack encoding, the bass part would be mapped to the following token representation:
With a little practice, humans can read and understand this representation. The representation starts with PIECE_START indicating the start of a piece of music. TRACK_START indicates the beginning and TRACK_END the end of a track (or instrument or voice). The INST=BASS token denotes that this track contains the bass voice. Other voices are represented the same way. BAR_START and BAR_END represent the beginning and the end of a bar, respectively. NOTE_ON=61 is the start of a note with pitch 61.
On the piano, this would be the note C#5. TIME_DELTA=4 means that a duration of four sixteenth notes would elapse. That would be a quarter note. After that, the note would end, represented by NOTE_OFF=61. And so on and so forth. At this point, this notation would also allow for polyphony. Several tracks would sound notes at the same time, and each track could have parallel notes. This makes the encoding universal.
Each piece of music differs in the number of bars. It is quite possible that encoding an entire song would require a long sequence length, making the training of a respective Transformer computationally expensive. These experiments encode most of the datasets with four bars and a few with eight bars. Experiments with 16 bars are underway. In addition, only music in a 4/4 time meter was used. This covers the better part of western music. Other meters such as 3/4 (waltz) can be the subject of future work.
This sequence of different experiments mapped many MIDI datasets to the described token format. The same preprocessor was used throughout. Once the preprocessor worked with small datasets, it immediately worked with larger ones.
The processing time depends on the number of MIDI files to be encoded, ranging from a few minutes to many hours. The longest preprocessing took 30 hours on DGX-2 running on all 96 CPUs in parallel. It is estimated that this would take about 10-14 days of processing on a state-of-the-art MacBook Pro.
Encoding a dataset of MIDI files would yield a collection of token files. The size of those token files depends on the number of MIDI files and the number of bars. Consider some of the experiment datasets and their encoded dataset sizes:
JS Fake Chorales Dataset: 14 MB with four bars per sample
The Lakh MIDI Dataset: 72 GB, its Clean subset 19 GB with four bars per sample
The MetaMIDI Dataset: 130 GB with four bars and 230 GB with eight bars per sample
You can imagine that training on the 14 MB of JS Fake Chorales would take just a few hours. Training on the MetaMIDI Dataset with its 130 GB would take many days. Training for these experiments lasted between 10 and 15 days.
Model training
Many models were trained using the HuggingFace GPT-2 implementation. A few models were trained using the NVIDIA Megatron-LM in GPT-2 mode.
Training with HuggingFace boiled down to uploading the dataset to the DGX-2 and then running a training script that contained all functionality, including the model and training parameters. The same script was used, with just a few changes here and there for all our datasets. It was just a matter of scale.
For Megatron-LM, the environment setup is as easy as pulling and running an NGC PyTorch Docker container, then getting to work immediately with a Jupyter notebook in the browser through ssh tunneling into the DGX-2 machine.
Most of the experiments used the same GPT-2 architecture: six decoder-blocks and eight attention heads; the embedding size was 512, and the sequence length was 2048. Although this is definitely not a Large Language Model (LLM), which can have around 100 decoder blocks, subjective evaluation showed that for AI music this architecture works like a charm.
Using the NVIDIA DGX-2 really made a difference in rapid iteration. Datasets that would train for multiple days on a single GPU, would train for just a few hours on DGX-2. Datasets that would train for months on a single GPU, finished training after two weeks maximum on DGX-2. Especially for experiments with datasets
Training times for some of the datasets were as follows:
The Lakh MIDI Clean Dataset took 15 hours for 10 epochs and roughly 15K songs
The Lakh MIDI Dataset took 130 hours for 10 epochs and roughly 175K songs
The MetaMIDI Dataset took 290 hours for 9 epochs and roughly 400K songs
Note that the JS Fake Chorales dataset was trained earlier and not on the DGX-2. Due to its very small size, it was not necessary to use a multi-GPU setup. It could even be trained overnight on a MacBook Pro.
NVIDIA DGX-2
This section provides a closer look at the NVIDIA DGX-2 specifications. As mentioned above, the platform is very effective, both when it comes to accelerated dataset preprocessing, and when it comes to training language models. This section will be a delightfully technical one.
NVIDIA DGX-2 is a powerful system with 16 fully connected Tesla V100 32 GB GPUs using NVSwitch. It is capable of delivering 2.4 TB/sec of bisection bandwidth. DGX-2 has been designed for AI researchers and data scientists who need both performance and scalability.
For transformer models, NVIDIA DGX-2 is able to deliver up to 517,227 tokens per second throughput with mixed precision, making it especially powerful.
A baseline is achieved by training a model of 1.2 billion parameters on a single NVIDIA V100 32 GB GPU, that sustains 39 teraflops. This is 30% of the theoretical peak flops for a single GPU as configured in a DGX-2H server, and is thus a strong baseline.
Scaling the model to 8.3 billion parameters on 512 GPUs with 8-way model parallelism achieved up to 15.1 petaflops per second sustained over the entire application. This is 76% scaling efficiency compared to the single GPU case.
By fixing the seq_len, short-hands, equal to 4,096, and modifying training configurations and launch training runs with only a few iterations, it is possible to calculate the teraflop percent achieved in real application job runs.
After a native run, both the nvidia-smi as well as the output Nsight profile were analyzed. Different configurations were tested to obtain the highest possible teraflop, as the below table illustrates:
The training configuration presented in the last row of the table delivered the highest teraflop of 45.45%.
Note that eight V100 32 GB GPUs were used instead of 16 GPUs to shorten the time it takes to run each profiling job. The nvidia-smi command was used to verify with the training config that achieved 45.45% teraflops utilization, as illustrated below.
Summary
The AI music experiments presented here were performed using the NVIDIA DGX-2. We trained language models using datasets ranging from just a few megabytes in size to 230 GB. We used the HuggingFace GPT-2 implementation and showed that NVIDIA Megatron-LM is also a great alternative for experimentation.
NVIDIA DGX-2 made a significant difference in accelerating dataset preprocessing—mapping MIDI files to a token representation—and training models. This allowed for rapid experimentation. DGX-2 worked like a charm when it came to training the largest MIDI dataset available (MetaMIDI with 400K files).
Select from 20 hands-on workshops, offered at GTC, available in multiple languages and time zones. Early bird pricing of just $99 ends Aug 29 (regular $500).
Select from 20 hands-on workshops, offered at GTC, available in multiple languages and time zones. Early bird pricing of just $99 ends Aug 29 (regular $500).
We recently kicked off our NVIDIA Developer Program exclusive series of Connect with Experts Ask Me Anything (AMA) sessions featuring NVIDIA experts and Ray…
During the AMA, the editors offered some valuable guidance and tips on how to successfully integrate real-time rendering. Check out the top five questions and answers from the AMA:
1. Are there some rules of thumb one should follow when adding ray tracing (RT) applications like translucency, reflections, shadows, GI, or diffuse illumination to games?
Adam: There are many things to take into consideration when adding ray-traced effects to a game’s renderer. The main consideration to keep top of mind is for the ray-traced effects to work hand-in-hand with the goals of your game’s art direction. This will change what performance costs are reasonable for any given effect.
For example, if shadows are an important game mechanic (think of Splinter Cell), then a higher cost for extra-nice ray-traced shadows makes sense, but spending extra performance on RT translucency probably doesn’t make as much sense. For guidance on how to balance ray tracing and performance, we have a variety of webinars and other content that you can learn from. In fact, there’s an event coming up about RTX in Unreal Engine 5. (Note that you can access this content on demand.)
2. When sampling direct lighting, both reservoir sampling and resampled importance sampling can be useful techniques. But it seems difficult to recompute PDFs for the sake of MIS when a light has been sampled through a BSDF sample. Could you provide any insights into this problem?
Ingo: Sample importance resampling is only generating samples relative to an existing PDF (that you choose to take these samples). So it should be possible to evaluate that existing PDF to compute PDF values for other samples (in an MIS context).
3. Do ray tracing and deep learning overlap?
Eric: Yes, in many ways. Deep learning can be used to complement ray tracing, “filling in” missing information with plausible interpolated data, such as with NVIDIA Deep Learning Super Sampling (DLSS). This works today.
Neural rendering and neural graphics primitives are hot areas of research currently. One place to start is with Advances in Neural Rendering from SIGGRAPH 2021. Another good resource is a recent overview of NeRF at CVPR 2022, where ray tracing is used to render radiance fields.
4. What’s the latest scoop on using ML training to help with ray-traced GI? Are there any neat advances in ray tracing that benefit from deep learning? Have you connected lower sampling and filtering using an ML upscaling 2D filter?
Adam: There’s been quite a lot of work in the machine learning space to assist with real-time (and not real-time) graphics. For ray-traced global illumination, check out a paper recently published by Thomas Müller, Real-Time Neural Radiance Caching for Path Tracing. Their approach trains a neural network to learn the light transport characteristics of a scene and then builds a light cache that can be queried at a lower cost than tracing the full paths.
5. What are your top three favorite graphics papers of all time?
Register for GTC 2022 to learn the latest about RTX real-time ray tracing. For a full list of content for game developers including tools and training, visit NVIDIA Game Development.