Categories
Misc

AI in the Sky: NVIDIA GPUs Help Researchers Remove Clouds from Satellite Images

Satellite images can be a fantastic civil engineering tool — at least when clouds don’t get in the way.  Now researchers at Osaka University have shown how to use GPU-accelerated deep learning to remove these clouds.  The scientists from the university’s Division of Sustainable Energy and Environmental Engineering used a “generative adversarial network” or GAN.  Read article >

The post AI in the Sky: NVIDIA GPUs Help Researchers Remove Clouds from Satellite Images appeared first on The Official NVIDIA Blog.

Categories
Offsites

High-Quality, Robust and Responsible Direct Speech-to-Speech Translation

Speech-to-speech translation (S2ST) is key to breaking down language barriers between people all over the world. Automatic S2ST systems are typically composed of a cascade of speech recognition, machine translation, and speech synthesis subsystems. However, such cascade systems may suffer from longer latency, loss of information (especially paralinguistic and non-linguistic information), and compounding errors between subsystems.

In 2019, we introduced Translatotron, the first ever model that was able to directly translate speech between two languages. This direct S2ST model was able to be efficiently trained end-to-end and also had the unique capability of retaining the source speaker’s voice (which is non-linguistic information) in the translated speech. However, despite its ability to produce natural sounding translated speech in high fidelity, it still underperformed compared to a strong baseline cascade S2ST system (e.g., composed of a direct speech-to-text translation model [1, 2] followed by a Tacotron 2 TTS model).

In “Translatotron 2: Robust direct speech-to-speech translation”, we describe an improved version of Translatotron that significantly improves performance while also applying a new method for transferring the source speakers’ voices to the translated speech. The revised approach to voice transference is successful even when the input speech contains multiple speakers speaking in turns while also reducing the potential for misuse and better aligning with our AI Principles. Experiments on three different corpora consistently showed that Translatotron 2 outperforms the original Translatotron by a large margin on translation quality, speech naturalness, and speech robustness.

Translatotron 2
Translatotron 2 is composed of four major components: a speech encoder, a target phoneme decoder, a target speech synthesizer, and an attention module that connects them together. The combination of the encoder, the attention module, and the decoder is similar to a typical direct speech-to-text translation (ST) model. The synthesizer is conditioned on the output from both the decoder and the attention.

Model architecture of Translatotron 2 (for translating Spanish speech into English speech).

There are three novel changes between Translatotron and Translatotron 2 that are key factors in improving the performance:

  1. While the output from the target phoneme decoder is used only as an auxiliary loss in the original Translatotron, it is one of the inputs to the spectrogram synthesizer in Translatotron 2. This strong conditioning makes Translatotron 2 easier to train and yields better performance.
  2. The spectrogram synthesizer in the original Translatotron is attention-based, similar to the Tacotron 2 TTS model, and as a consequence, it also suffers from the robustness issues exhibited by Tacotron 2. In contrast, the spectrogram synthesizer employed in Translatotron 2 is duration-based, similar to that used by Non-Attentive Tacotron, which drastically improves the robustness of the synthesized speech.
  3. Both Translatotron and Translatotron 2 use an attention-based connection to the encoded source speech. However, in Translatotron 2, this attention is driven by the phoneme decoder instead of the spectrogram synthesizer. This ensures the acoustic information that the spectrogram synthesizer sees is aligned with the translated content that it’s synthesizing, which helps retain each speaker’s voice across speaker turns.

More Powerful and Responsible Voice Retention
The original Translatotron was able to retain the source speaker’s voice in the translated speech, by conditioning its decoder on a speaker embedding generated from a separately trained speaker encoder. However, this approach also enabled it to generate the translated speech in a different speaker’s voice if a clip of the target speaker’s recording were used as the reference audio to the speaker encoder, or if the embedding of the target speaker were directly available. While this capability was powerful, it had the potential to be misused to spoof audio with arbitrary content, which posed a concern for production deployment.

To address this, we designed Translatotron 2 to use only a single speech encoder, which is responsible for both linguistic understanding and voice capture. In this way, the trained models cannot be directed to reproduce non-source voices. This approach can also be applied to the original Translatotron.

To retain speakers’ voices across translation, researchers generally prefer to train S2ST models on parallel utterances with the same speaker’s voice on both sides. Such a dataset with human recordings on both sides is extremely difficult to collect, because it requires a large number of fluent bilingual speakers. To avoid this difficulty, we use a modified version of PnG NAT, a TTS model that is capable of cross-lingual voice transferring to synthesize such training targets. Our modified PnG NAT model incorporates a separately trained speaker encoder in the same way as in our previous TTS work — the same strategy used for the original Translatotron — so that it is capable of zero-shot voice transference.

Following are examples of direct speech-to-speech translation from Translatotron 2 in which the source speaker’s voice is retained:

Input (Spanish): 
TTS-synthesized reference (English): 
Translatotron 2 prediction (English): 
Translatotron prediction (English): 

To enable S2ST models to retain each speaker’s voice in the translated speech when the input speech contains multiple speakers speaking in turns, we propose a simple concatenation-based data augmentation technique, called ConcatAug. This method augments the training data on the fly by randomly sampling pairs of training examples and concatenating the source speech, the target speech, and the target phoneme sequences into new training examples. The resulting samples contain two speakers’ voices in both the source and the target speech, which enables the model to learn on examples with speaker turns. Following are audio samples from Translatotron 2 with speaker turns:

Input (Spanish): 
TTS-synthesized reference (English): 
Translatotron 2 (with ConcatAug) prediction (English): 
Translatotron 2 (without ConcatAug) prediction (English): 

More audio samples are available here.

Performance
Translatotron 2 outperforms the original Translatotron by large margins in every aspect we measured: higher translation quality (measured by BLEU, where higher is better), speech naturalness (measured by MOS, higher is better), and speech robustness (measured by UDR, lower is better). It particularly excelled on the more difficult Fisher corpus. The performance of Translatotron 2 on translation quality and speech quality approaches that of a strong baseline cascade system, and is better than the cascade baseline on speech robustness.

Translation quality (measured by BLEU, where higher is better) evaluated on two Spanish-English corpora.
Speech naturalness (measured by MOS, where higher is better) evaluated on two Spanish-English corpora.
Speech robustness (measured by UDR, where lower is better) evaluated on two Spanish-English corpora.

Multilingual Speech-to-Speech Translation
Besides Spanish-to-English S2ST, we also evaluated the performance of Translatotron 2 on a multilingual set-up in which the model took speech input from four different languages and translated them into English. The language of the input speech was not provided, which forced the model to detect the language by itself.

Source Language  fr de es ca
Translatotron 2  27.0 18.8 27.7 22.5
Translatotron  18.9 10.8 18.8 13.9
ST (Wang et al. 2020 27.0 18.9 28.0 23.9
Training Target  82.1 86.0 85.1 89.3
Performance of multilingual X=>En S2ST on the CoVoST 2 corpus.

On this task, Translatotron 2 again outperformed the original Translatotron by a large margin. Although the results are not directly comparable between S2ST and ST, the close numbers suggest that the translation quality from Translatotron 2 is comparable to a baseline speech-to-text translation model, These results indicate that Translatotron 2 is also highly effective on multilingual S2ST.

Acknowledgments
The direct contributors to this work include Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz. We also thank Chung-Cheng Chiu, Quan Wang, Heiga Zen, Ron J. Weiss, Wolfgang Macherey, Yu Zhang, Yonghui Wu, Hadar Shemtov, Ruoming Pang, Nadav Bar, Hen Fitoussi, Benny Schlesinger, Michael Hassid for helpful discussions and support.

Categories
Misc

Competition and Community Insights from NVIDIA’s Kaggle Grandmasters

In this post, we summarize questions and answers from GTC sessions with NVIDIA’s Kaggle Grandmaster team.  Additionally, we answer audience questions we did not get a chance during these sessions. Q: How do you decide which competitions to join? Ahmet: I read the competition description and evaluation metric. Then I give myself several days to … Continued

In this post, we summarize questions and answers from GTC sessions with NVIDIA’s Kaggle Grandmaster team.  Additionally, we answer audience questions we did not get a chance during these sessions.

Q: How do you decide which competitions to join?

Ahmet: I read the competition description and evaluation metric. Then I give myself several days to think about if I have any novel ideas that I can try on. If I do not have any interesting ideas, then I do not join. But sometimes I just join for learning and improving my skills.

Q: Is mathematics mandatory for winning a competition?

Kazuki: Not mandatory, but you may want to understand the competition metric and how machine learning models work. For example, the linear model and tree model are totally different. So those would generate good results when ensembling.

Q: How do you approach a competition?

Bojan: On the first day, I always submit a sample so that I am on the leaderboard. Traditionally, I have not been very big on data analysis or EDA, which is one of my weaknesses. But recently, I started doing more and changing my approach.

One thing I always do is see how easy it is to ensemble different models in a competition. This dictates my strategy in the long run. If ensembling slightly different models can give a nice boost, it means that building many diverse models is important. However, if ensembling does not give you a big boost, then feature engineering or coming up with creative features is more important in the long run.

One of the strategies is to try to improve a single model as much as you can, and only ensemble once you are satisfied with it.

Jean-Francois: It is a good idea to read what people share in the forum in every competition. This means to read what the host writes, including comments. And to read top solutions in similar recent competitions. Surprisingly, some competitions are won by models that are publicly shared from previous competitions and adapted to the new one. People do not read enough. You can also try to find papers on the topic, especially for science competitions where there are often relevant papers.

Giba: Download the data and run some EDA. Get insights about feature and target distribution in order to find the best validation strategy. Random KFold is usually good for most of the problems, but usually it is necessary to groupedKFold or time split Kfold. Once you find the best validation strategy, run a simple model using it and submit to check the leaderboard score. Usually this is the most important thing in a competition, all metric improvements made locally should be translated to Kaggle leaderboard if validation is robust and made correctly. After that you must work on feature engineering and build a diverse set of models with different dataset and training algorithms. Usually, an ensemble of a model trained with ANN and a GBDT is good enough to rank high in LB. Search for target leakage is unfortunately part of Kaggle competition.

Q: Which deep learning framework would you recommend starting with?

Jean-Francois: I think the best one is Keras because it is very abstract. You can build rather complex models and train them in a few lines of code. Then you may want to move to PyTorch or TensorFlow for two reasons: to have better control of your models and customize your layers as well as the ability to reuse pretrained models. For that, I have the impression that PyTorch is taking the lead. What we do on Kaggle is mostly model prototyping. Maybe today, TensorFlow is better at model deployment, but it is not relevant at Kaggle.

Jiwei: I would add that PyTorch Lightning is a user-friendly package based on PyTorch, especially for new users. It abstracts the details of training and provides convenient APIs for advanced features such as Multi-GPU, TPU, and mixed precision.

Audience poll showed that 66% preferred PyTorch and 31% preferred TensorFlow.

Q: How do you prevent overfitting when using pseudo-labeling? Is it okay to use that strategy with an ensemble?

Bo: In the recent RANZCR competition, our team won using both pseudo-labeling and ensemble. It’s ok to use both, but you should be very careful doing so in order to prevent overfitting.

  • First, you want to split the original data into five folds and split external data into five folds. In both stages, there will be five models.
  • In Stage 1, train the model on the original data, and do inference on the external data to have external data prediction. Do this five times.
  • In Stage 2, combine the original data (with original labels) and external data (with Stage 1 predictions as pseudo labels) and train the models again.

The important thing is, when we make pseudo labels, we want to make five copies of the pseudo labels. For Stage 2’s fold0 model (trained on combined fold1,2,3,4 and validated on combined fold0), we want to make sure it never had fold0’s information, so the pseudo labels used for this model need to come from Stage 1’s fold0 model (train on original fold1,2,3,4). This way you will never have any leakage.

It is ok to use ensemble together with pseudo-labeling. In the RANZCR competition, we used ensembles in both stages.

Chris: pseudo-labeling is one of the things that I specifically learned at Kaggle because none of the books I had read talked about it. Kaggle is a great place to learn practical tricks like pseudo-labeling.

Q: What are commonly used post-processing techniques?  How can I improve my score on multi-label classification problems?

Chris: I’ll take a first stab at this. Recently a Kaggler has called me the Post-processing Grandmaster because I just got my fifth gold medal specifically using post-processing. It was a solo gold medal. [The criteria for competition grandmaster are five gold medals including at least one solo gold medal.] I will share a few secrets.

The first thing is to study the competition metric. Some metrics are called the ranking metrics (like AUC). For these metrics, the absolute predicted values do not matter. Only the relative orders matter. For a multi-label classification problem, the first thing to ask is whether the predictions are ranked per label, or altogether. In the recent Rainforest competition where we predict animal sounds in rainforests, all the predictions are ranked across labels. So it is important that the model knows which animals are common and which are rare.

Other types of metrics are based on mean values, like Mean Squared Error. If the test data have different mean than the training data, shifting the predictions can improve the metrics.

For metrics like recall and precision, you should know their meanings. Always know your metrics. Each metric requires you to do different things and apply different post-processing. Personally, I really enjoy doing this. I come from a mathematical background. Metrics are mathematical equations and I like to think about what is important to optimize.

Bo: I’d like to add one thing. If the metric is log loss, sometimes it helps to clip extreme values. Models can make confident predictions with values close to 0 or 1, but if there are label errors, the penalty by log loss can be huge. So, it may be a good idea to clip the predictions at 0.01/0.99 or 0.02/0.98. But always find the optimal clip thresholds in local validation.

Q: How can explainability be used when working with deep learning ensembles?

Christof: I would say, that strongly depends on the ensemble method. I often use a simple average of single models. So, if single models are explainable, so probably is the ensemble as a simple combination of them. But on the other hand, I agree, that ensembles introduce another aspect to explain, like why specific models contribute more to the ensemble than others despite a mediocre individual performance.

Q: How do you do the hyperparameter optimization, feature engineering and feature selection cycle in practice?

Chris: Personally, I do not spend too much time optimizing hyperparameters. I will explore the important parameters when building XGB or NN models. (For example, I will adjust max_depth, subsample, and colsample_bytree with XGB. And loss, learning rate and scheduler with NN). But when trying to improve models, I will spend more time exploring feature engineering with XGB and data augmentation, architecture design, and/or TTA with NN.

Q: How can you get the best performance out of a Neural Network?

Jean-Francois: Work on data pre-processing (including augmentations) and post-processing. Newcomers often focus too much on hyperparameter tuning or choice of optimizer.  I almost always stick to Adam with a cosine learning schedule.

Jiwei: Multi-head multi-loss function is another common trick to improve the performance of NN. It works as a way of regularization.

Bo: I want to point to this great post by Andrej Karpathy where he shared many NN tricks: http://karpathy.github.io/2019/04/25/recipe/

Q: What is the best way to learn from Kaggle as a beginner?

Bojan: Check out the notebooks that people post, read topics that are being discussed, and try running models that are shared and improve them using the ideas that are discussed. These are the few steps that can get you pretty far in your machine learning skills, if not your Kaggle performance.

Bo: A good way for beginners to get started is to team up. Of course, this depends on personality. Some people prefer working alone, like bestfitting. But for many people I think teaming up is a good way to learn because different people have different skill sets. They can often complement each other. You can always pick up a thing or two from each teammate. Of course you need to do some work before requesting a team merge. Do not ask people on top of the leaderboard to team up with you without doing much. Try to ask people who are close to your leaderboard position.

Chris: I concur. I did many solo competitions, but recently I have been doing more teaming up. In every single team-up, I learned stuff. Even if it is as simple as watching how people organize their code, or what computer language they are using. It can just be learning how they approach the problem, or how they set up their experiments. There is just so much to learn when working with someone else that can help you become a better data scientist.

Jean-Francois: Do not be shy. Just jump into the water. You will learn how to swim. There is one last thing I recommend. When you join Kaggle, you are asked to create a user name. You can use an alias if you are afraid your friends or colleagues see you struggle when you begin. That is what I did. I only disclosed my real name once I became comfortable. So just sign up, choose a pseudonym, learn, and try. After a competition, do not just go to the next one. Read what people share. Try to think about what you could have done better, why you could have thought of this cool idea you just read. The few days after a competition end is when you will learn the most.

Q: Do you recommend building your own machine or buying a pre-built system for deep learning?

Jean-Francois: Building is often cheaper, but it requires more skills and time.  If you have both then build your gear.  There are shops that will build custom PCs to your configuration for you. I personally did not have the time and skills and bought a custom-made PC with a GTX 1080 Ti and was very happy with it.  Nowadays, you can find PCs, including laptops, with good GPU from major PC makers.

Jiwei: Another option is an external GPU box. I used to train deep learning models on a laptop, which connected to an external GPU box with a desktop level GPU card.

Q: What do you enjoy the most about Kaggle?

Chris: The community. Kaggle is a unique place to meet great quality data scientists where you cannot elsewhere.

Jean-Francois: Kaggle is definitely the place to go if you want to know the state-of-the-art in modeling actual problems using machine learning.  And it is addictive.

Jiwei: To learn new algorithms, new modeling techniques in practice. I find myself more motivated and focused when I can apply new models from papers to solve real-world problems.

Bo: Reading top solution posted by Kagglers after each competition. Every time I can learn some new tricks.

Bojan: It is an amazing platform for learning. I do not think there is any other platform where you can learn as much and as quickly as on Kaggle.

Giba: The ability to work on the most diverse problems, and at the same time to learn and apply the state-of-the-art algorithms to solve them.

Kazuki: I enjoy getting knowledge which I’m interested in.

Christof: To solve very complex problems and come up with innovative solutions.

Ahmet: I enjoy Kaggle’s problem diversity, and I enjoy climbing the leaderboard.

Categories
Misc

Upcoming Webinar: Introduction to Building Conversational AI Applications

Join the Emerging Chapters Educational Series webinar on conversational AI concepts and building efficient pipelines.

NVIDIA is hosting a webinar with live Q&A at 10 am PDT on Oct. 14 as a part of the Emerging Chapters Educational Series. 

This technical session is open to anyone looking to learn beginner and intermediate level concepts and applications that show how to build efficient conversational AI pipelines.  

Highlights Include:

  1. Building conversational AI apps and services using NVIDIA TAO and open-source toolkits such as NeMo 
  2. Deploying apps using NVIDIA RIVA 
  3. Machine translation demos 

Register now >>

Categories
Misc

Doing the Math: Michigan Team Cracks the Code for Subatomic Insights

In record time, Vikram Gavini’s lab crossed a big milestone in viewing tiny things. The three-person team at the University of Michigan crafted a program that uses complex math to peer deep into the world of the atom. It could advance many fields of science, as well as the design for everything from lighter cars Read article >

The post Doing the Math: Michigan Team Cracks the Code for Subatomic Insights appeared first on The Official NVIDIA Blog.

Categories
Misc

Teamwork Makes the Dream Work: GFN Thursday Celebrates Team17 Titles Streaming From the Cloud

GFN Thursday is all about bringing games powered by GeForce NOW’s GPUs in the cloud to gamers. Today, that spotlight shines on Team17, the prolific publisher behind many games in the GeForce NOW library. The party gets started with their newest release, a day-and-date launch of Sheltered 2, streaming on the cloud alongside the 12 Read article >

The post Teamwork Makes the Dream Work: GFN Thursday Celebrates Team17 Titles Streaming From the Cloud appeared first on The Official NVIDIA Blog.

Categories
Misc

Are there any unofficial slacks or chat rooms where people post questions to use tensorflow, especially federated tensorflow?

So, like the title says. Just wondering if there is anything like this. I know that StackOverflow has stuff like that, but rather ask on slack or something similar.

submitted by /u/Throooaway10
[visit reddit] [comments]

Categories
Misc

Removing TensorFlow filters

With Python 3.8 and TensorFlow 2.5, my objective is to remove filters/kernels having lowest L2 norms. Sample code for this is:

 # Generate random 1 image/data point sample- x = tf.random.normal(shape = (1, 5, 5, 3), mean = 1.0, stddev = 0.5) x.shape # TensorShape([1, 5, 5, 3]) # Create conv layer- conv = Conv2D( filters = 3, kernel_size = (3, 3), activation='relu', kernel_initializer = tf.initializers.GlorotNormal(), bias_initializer = tf.ones_initializer, strides = (1, 1), padding = 'same', ) # Pass input through conv layer- out = conv(x) out.shape # TensorShape([1, 5, 5, 3]) out = tf.squeeze(out) out.shape # TensorShape([5, 5, 3]) 

According to my understanding, the output consists of three (5, 5) matrices stacked together. However, printing ‘out’ shows five (5, 3) matrices stacked together:

 out.numpy() ''' array([[[1.45877 , 0. , 1.9293344 ], [0.9910869 , 0.01100129, 1.7364411 ], [1.8199034 , 0. , 1.3457474 ], [1.219409 , 0.22021294, 0.62214017], [0.5572515 , 0.7246016 , 0.6772853 ]], [[1.161148 , 0. , 2.0277915 ], [0.38071448, 0. , 2.2438798 ], [2.2897398 , 0.1658966 , 2.3147004 ], [1.2516301 , 0.14660472, 1.6381929 ], [1.1554463 , 0.72516847, 1.6170584 ]], [[0. , 0. , 1.2525308 ], [0.4337383 , 0. , 0.91200435], [0.71451795, 0. , 2.093022 ], [2.265062 , 0. , 2.7562256 ], [0.82517993, 0. , 1.8439718 ]], [[0.7089497 , 0. , 1.041831 ], [0. , 0. , 1.2754116 ], [0.41919613, 0. , 0.88135654], [0. , 0. , 0.71492153], [0.18725157, 0.27108306, 0.11248505]], [[0.86042166, 0.45840383, 1.084069 ], [0.53202367, 0.42414713, 1.2529668 ], [1.2257886 , 0.31592917, 1.3377004 ], [0.36588144, 0. , 0.6085663 ], [0.3065148 , 0.574654 , 1.0214479 ]]], dtype=float32) ''' 

So, if I use the code out[:, :, 0], out[:, :, 1] & out[:, :, 2], do they refer to the first, second and third filters?

And if yes, is computing L2-norm using:

 tf.norm(out, ord = 'euclidean', axis = (0, 1)).numpy() # array([5.275869 , 1.4290226, 7.545658 ], dtype=float32) 

the correct way?

submitted by /u/grid_world
[visit reddit] [comments]

Categories
Misc

Proving Superior Cloud, AI, and Storage Performance with NVIDIA Spectrum-3 Switches

Independent IT lab The Tolly Group compared the cloud, AI, and storage performance of an NVIDIA Ethernet switch to the performance of a comparable switch built with commodity silicon.

Does the switch matter?

The network fabric is key to the performance of modern data centers. There are many requirements for data center switches, but the most basic is to provide equal amounts of bandwidth to all clients so that resources are shared evenly. Without fair networking, all workloads experience unpredictable performance due to throughput deterioration, delay, slow distributed workloads, and so on.

To answer the question of whether the switch matters, the Tolly Group benchmarked the cloud, AI, and storage workload performance of the NVIDIA Spectrum-3 12.8Tbps Switch. It compared the results to the performance of a typical (commodity) 12.8 Tbps data center switch, in an apples-to-apples comparison.

The Tolly Group

The Tolly Group, a third-party, independent IT industry lab, has been conducting performance tests and hands-on evaluations of IT products for more than 30 years. The Tolly Group is positioned to provide evidence that products meet or exceed marketing claims and they won’t produce reports that conflict with The Tolly Group’s Fair Testing Charter. This proof-of-performance lets customers know they can deploy with confidence.

Distributed workload performance (AI and SPARK)

Every switch has a buffer to prevent packet loss. The buffer also protects application performance by absorbing packet bursts whenever more traffic is sent into the switch than can be sent out of the switch. This is sometimes referred to as incast traffic patterns. Distributed workloads like AI and Spark, by their nature, are plagued by incast traffic patterns.

Both switches claimed identical buffer sizes on their datasheets. However, The Tolly Group found that NVIDIA Spectrum-3 was able to absorb 4-8x more packets than the typical data center switch. Eight commodity switches would be needed to provide the packet absorption capabilities equal to a Spectrum-3 switch.

You need eight commodity switches to provide the packet absorption capabilities equal to a Spectrum-3 switch.
Figure 1. NVIDIA Spectrum-3 and commodity silicon

Maximum absorption capability is important but not enough. It is crucial that the switch evenly absorbs the microburst from all senders, because slowing down one node slows down the entire cluster.

The Tolly Group found that Spectrum-3 evenly absorbed microburst traffic from all senders in all cases while the commodity switch slowed down multiple nodes, resulting in under-utilized compute resources.

Public and private cloud performance

The noisy neighbor problem crops up in public and private cloud environments, where multiple tenants use a shared resource, like CPU cycles or network bandwidth, and a “noisy neighbor” tenant shows up and hogs those resources.

The result of the noisy neighbor problem could be that one tenant can degrade the experience of another tenant due to the inadequate capability of the switch to isolate between them. A data center switch must protect tenants from the activities of other tenants, both from nefarious attacks as well as noisy neighbors.

The Tolly Group found that the Spectrum-3 switch fully protected each tenant. The competitor switch failed to protect tenants, allowing the bandwidth of some tenants to be victimized and get fully starved by the traffic pattern of the noisy neighbor.

When scaling out multitenant environments, Spectrum-3 protected each tenant. However, the noisy neighbor problem-scale was way bigger with the commodity switch and can be expanded to half the total number of switch ports. In other words, up to 70 ports can be victimized and starved.

If a switch is not capable of protecting tenants from a noisy neighbor, that switch is not matching a basic requirement from a cloud fabric switch.

With Spectrum-3, there is no effect from noisy neighbor traffic patterns. With the commodity switch, victim tenants are starved for bandwidth.
Figure 2. Noisy neighbor isolation

(ALT: With Spectrum-3, there is no effect from noisy neighbor traffic patterns. With the commodity switch, victim tenants are starved for bandwidth.)

Storage performance

Today, most storage traffic in the data center runs on Ethernet. More specifically, storage typically uses 9-KB jumbo frames. As a result, this packet size has become more important than ever, and most every switch now supports a default packet size of 9 KB.

However, just because a typical data center switch supports 9-KB packets, doesn’t mean they are optimized for storage workloads. To measure and compare the storage performance levels of each switch, The Tolly Group used 9 KB packets with standard network test tools from IXIA.

The Tolly Group found that Spectrum-3 provided predictable and fair performance across all storage nodes in all cases. The commodity switch showed unfair traffic sharing with 9-KB packets, forcing one storage node to run 17x slower than the other storage nodes. These unpredictable results harshly affect storage performance.

This has real-world implications. Think about the time it takes to run a storage backup. What if your planned and expected 2-hour backup time starts taking 34 hours to complete?

Mixed application performance

Most data centers run many different applications, each with their own packet sizes. Even a single application uses a variety of packet sizes. Adding in control traffic patterns, you will probably end up encountering an even greater variety of packet sizes on your fabric.

The Tolly Group found that Spectrum-3 provided fairness regardless of the packet size, while the commodity switch tended to starve applications that used smaller packet sizes. Even worse, as the difference in packet sizes increased, the worse it got for the smaller packets.

Spectrum-3 treats packets of different sizes equally by providing them equitable throughput. With the commodity switch, smaller packets suffer through uneven throughput distribution.
Figure 3. Performance by frame size disparity

For the commodity switch, mixed packet size starvation adversely affects the cloud, storage, and distributed workloads.

But why?

Architecture. Simple as that.

The Spectrum switches have a modern fully shared buffer architecture and flexible pipeline architecture that were designed to optimize data center application performance and security. For more information about the results, download the new Tolly Group Performance Evaluation report. It explains the architecture of Spectrum switches and commodity switches, along with their advantages and disadvantages.

Architecture is indeed a zero-sum game. However, unlike many other vendors, NVIDIA develops both the ASIC and the switch. As a result, we have managed to eliminate tradeoffs and provide the superior results that The Tolly Group has verified.

Learn more

The switch matters and it can make a huge difference, either leveraging your workloads or adversely affecting them. For more information, join the Tolly Report webinar, download the Tolly Group Performance Evaluation report, or see The Tolly Group website.

Categories
Misc

Introducing Profiling Enhancements with the Latest NVIDIA Nsight Systems

The latest update to NVIDIA Nsight Systems is now available and introduces several improvements aimed to enhance the profiling experience. 

The latest update to NVIDIA Nsight Systems—a performance analysis tool—is now available for download. Designed to help developers tune and scale software across CPUs and GPUs, this release introduces several improvements aimed to enhance the profiling experience. 

Nsight Systems is part of the powerful debugging and profiling NVIDIA Nsight Tools suite. A developer can start with Nsight Systems for overall system view and avoid picking less efficient optimizations based on assumptions and false-positive indicators. 

2021.4 Highlights Include:

  • Windows ISRs and DPCs trace
  • Windows GPU hardware-based scheduling trace 
  • Windows Direct3D12 and Vulkan correlation to WDDM events
  • NVTX event categorization support – to enable viewing of categories in isolated rows
  • Multi-report loading for multinode, VM, container, rank, and process investigations

Nsight system 2021.4 added features to help users understand how timing of other range-based events are affected by the OS interruptions to better account for jitter in statistics or consider binding their processes and threads to core with less frequent ISR and DPC processing. 

Figure 1. Windows interrupt trace lets users see when the kernel is operating within your CPU thread. 

This release also added data capture that will help users understand when, where, and how deep packets are queued.

Figure 2. Improved graphics correlation allows users to explicitly follow API work submissions through WDDM queue packets and to their GPU workloads, rather than inferring it via timing.

More Information