
How to Read Research Papers: A Pragmatic Approach for ML Practitioners

This article presents an effective systematic method to approach reading research papers to be used as a resource for Machine Learning practitioners.

Is it necessary for data scientists or machine-learning experts to read research papers?

The short answer is yes. And don’t worry if you lack a formal academic background or have only obtained an undergraduate degree in the field of machine learning.

Reading academic research papers may be intimidating for individuals without an extensive educational background. However, a lack of academic reading experience should not prevent Data scientists from taking advantage of a valuable source of information and knowledge for machine learning and AI development.

This article provides a hands-on tutorial for data scientists of any skill level to read research papers published in academic journals such as NeurIPS, JMLR, ICML, and so on.

Before diving wholeheartedly into how to read research papers, the first phases of learning how to read research papers cover selecting relevant topics and research papers.

Step 1: Identify a topic

The domain of machine learning and data science is home to a plethora of subject areas that may be studied. But this does not necessarily imply that tackling each topic within machine learning is the best option.

Although generalization for entry-level practitioners is advised, I’m guessing that when it comes to long-term machine learning, career prospects, practitioners, and industry interest often shifts to specialization.

Identifying a niche topic to work on may be difficult, but good. Still, a rule of thumb is to select an ML field in which you are either interested in obtaining a professional position or already have experience.

Deep Learning is one of my interests, and I’m a Computer Vision Engineer that uses deep learning models in apps to solve computer vision problems professionally. As a result, I’m interested in topics like pose estimation, action classification, and gesture identification.

Based on roles, the following are examples of ML/DS occupations and related themes to consider.

Machine Learning and Data Science roles and associated topics.
Figure 1: Machine Learning and Data Science roles and associated topics. Image created by Author.

For this article, I’ll select the topic Pose Estimation to explore and choose associated research papers to study.

Step 2: Finding research papers

One of the most excellent tools to use while looking at machine learning-related research papers, datasets, code, and other related materials is PapersWithCode.

We use the search engine on the PapersWithCode website to get relevant research papers and content for our chosen topic, “Pose Estimation.” The following image shows you how it’s done.

Gif searching Pose Estimation
Figure 2: Image created by Author: GIF searching Pose Estimation.

The search results page contains a short explanation of the searched topic, followed by a table of associated datasets, models, papers, and code. Without going into too much detail, the area of interest for this use case is the “Greatest papers with code”. This section contains the relevant papers related to the task or topic. For the purpose of this article, I’ll select the DensePose: Dense Human Pose Estimation In The Wild.

Step 3: First pass (gaining context and understanding)

A notepad with a lightbulb drawn on it.
Figure 3: Gaining context and understanding. Photo by AbsolutVision on Unsplash

At this point, we’ve selected a research paper to study and are prepared to extract any valuable learnings and findings from its content.

It’s only natural that your first impulse is to start writing notes and reading the document from beginning to end, perhaps taking some rest in between. However, having a context for the content of a study paper is a more practical way to read it. The title, abstract, and conclusion are three key parts of any research paper to gain an understanding.

The goal of the first pass of your chosen paper is to achieve the following:

  • Assure that the paper is relevant.
  • Obtain a sense of the paper’s context by learning about its contents, methods, and findings.
  • Recognize the author’s goals, methodology, and accomplishments.


The title is the first point of information sharing between the authors and the reader. Therefore, research papers titles are direct and composed in a manner that leaves no ambiguity.

The research paper title is the most telling aspect since it indicates the study’s relevance to your work. The importance of the title is to give a brief perception of the paper’s content.

In this situation, the title is “DensePose: Dense Human Pose Estimation in the Wild.” This gives a broad overview of the work and implies that it will look at how to provide pose estimations in environments with high levels of activity and realistic situations properly.


The abstract portion gives a summarized version of the paper. It’s a short section that contains 300-500 words and tells you what the paper is about in a nutshell. The abstract is a brief text that provides an overview of the article’s content, researchers’ objectives, methods, and techniques.

When reading an abstract of a machine-learning research paper, you’ll typically come across mentions of datasets, methods, algorithms, and other terms. Keywords relevant to the article’s content provide context. It may be helpful to take notes and keep track of all keywords at this point.

For the paper: “DensePose: Dense Human Pose Estimation In The Wild“, I identified in the abstract the following keywords: pose estimation, COCO dataset, CNN, region-based models, real-time.


It’s not uncommon to experience fatigue when reading the paper from top to bottom at your first initial pass, especially for Data Scientists and practitioners with no prior advanced academic experience. Although extracting information from the later sections of a paper might seem tedious after a long study session, the conclusion sections are often short. Hence reading the conclusion section in the first pass is recommended.

The conclusion section is a brief compendium of the work’s author or authors and/or contributions and accomplishments and promises for future developments and limitations.

Before reading the main content of a research paper, read the conclusion section to see if the researcher’s contributions, problem domain, and outcomes match your needs.

Following this particular brief first pass step enables a sufficient understanding and overview of the research paper’s scope and objectives, as well as a context for its content. You’ll be able to get more detailed information out of its content by going through it again with laser attention.

Step 4: Second pass (content familiarization)

Content familiarization is a process that’s relevant to the initial steps. The systematic approach to reading the research paper presented in this article. The familiarity process is a step that involves the introduction section and figures within the research paper.

As previously mentioned, the urge to plunge straight into the core of the research paper is not required because knowledge acclimatization provides an easier and more comprehensive examination of the study in later passes.


Introductory sections of research papers are written to provide an overview of the objective of the research efforts. This objective mentions and explains problem domains, research scope, prior research efforts, and methodologies.

It’s normal to find parallels to past research work in this area, using similar or distinct methods. Other papers’ citations provide the scope and breadth of the problem domain, which broadens the exploratory zone for the reader. Perhaps incorporating the procedure outlined in Step 3 is sufficient at this point.

Another aspect of the benefit provided by the introduction section is the presentation of requisite knowledge required to approach and understand the content of the research paper.

Graph, diagrams, figures

Illustrative materials within the research paper ensure that readers can comprehend factors that support problem definition or explanations of methods presented. Commonly, tables are used within research papers to provide information on the quantitative performances of novel techniques in comparison to similar approaches.

Generally, the visual representation of data and performance enables the development of an intuitive understanding of the paper’s context. In the Dense Pose paper mentioned earlier, illustrations are used to depict the performance of the author’s approach to pose estimation and create. An overall understanding of the steps involved in generating and annotating data samples.

In the realm of deep learning, it’s common to find topological illustrations depicting the structure of artificial neural networks. Again this adds to the creation of intuitive understanding for any reader. Through illustrations and figures, readers may interpret the information themselves and gain a fuller perspective of it without having any preconceived notions about what outcomes should be.

Step 5: Third pass (deep reading)

The third pass of the paper is similar to the second, though it covers a greater portion of the text. The most important thing about this pass is that you avoid any complex arithmetic or technique formulations that may be difficult for you. During this pass, you can also skip over any words and definitions that you don’t understand or aren’t familiar with. These unfamiliar terms, algorithms, or techniques should be noted to return to later.

Image of a magnifying glass depicting deep reading.
Figure 6: Deep reading. Photo by Markus Winkler on Unsplash.

During this pass, your primary objective is to gain a broad understanding of what’s covered in the paper. Approach the paper, starting again from the abstract to the conclusion, but be sure to take intermediary breaks in between sections. Moreover, it’s recommended to have a notepad, where all key insights and takeaways are noted, alongside the unfamiliar terms and concepts.

The Pomodoro Technique is an effective method of managing time allocated to deep reading or study. Explained simply, the Pomodoro Technique involves the segmentation of the day into blocks of work, followed by short breaks.

What works for me is the 50/15 split, that is, 50 minutes studying and 15 minutes allocated to breaks. I tend to execute this split twice consecutively before taking a more extended break of 30 minutes. If you are unfamiliar with this time management technique, adopt a relatively easy division such as 25/5 and adjust the time split according to your focus and time capacity.

Step 6: Forth pass (final pass)

The final pass is typically one that involves an exertion of your mental and learning abilities, as it involves going through the unfamiliar terms, terminologies, concepts, and algorithms noted in the previous pass. This pass focuses on using external material to understand the recorded unfamiliar aspects of the paper.

In-depth studies of unfamiliar subjects have no specified time length, and at times efforts span into the days and weeks. The critical factor to a successful final pass is locating the appropriate sources for further exploration.

 Unfortunately, there isn’t one source on the Internet that provides the wealth of information you require. Still, there are multiple sources that, when used in unison and appropriately, fill knowledge gaps. Below are a few of these resources.

The Reference sections of research papers mention techniques and algorithms. Consequently, the current paper either draws inspiration from or builds upon, which is why the reference section is a useful source to use in your deep reading sessions.

Step 7: Summary (optional)

In almost a decade of academic and professional undertakings of technology-associated subjects and roles, the most effective method of ensuring any new information learned is retained in my long-term memory through the recapitulation of explored topics. By rewriting new information in my own words, either written or typed, I’m able to reinforce the presented ideas in an understandable and memorable manner.

An image of someone blogging on a laptop
Figure 7: Blogging and summarizing. Photo by NeONBRAND on Unsplash

To take it one step further, it’s possible to publicize learning efforts and notes through the utilization of blogging platforms and social media. An attempt to explain the freshly explored concept to a broad audience, assuming a reader isn’t accustomed to the topic or subject, requires understanding topics in intrinsic details.


Undoubtedly, reading research papers for novice Data Scientists and ML practitioners can be daunting and challenging; even seasoned practitioners find it difficult to digest the content of research papers in a single pass successfully.

The nature of the Data Science profession is very practical and involved. Meaning, there’s a requirement for its practitioners to employ an academic mindset, more so as the Data Science domain is closely associated with AI, which is still a developing field.

To summarize, here are all of the steps you should follow to read a research paper:

  • Identify A Topic.
  • Finding associated Research Papers
  • Read title, abstract, and conclusion to gain a vague understanding of the research effort aims and achievements.
  • Familiarize yourself with the content by diving deeper into the introduction; including the exploration of figures and graphs presented in the paper.
  • Use a deep reading session to digest the main content of the paper as you go through the paper from top to bottom.
  • Explore unfamiliar terms, terminologies, concepts, and methods using external resources.
  • Summarize in your own words essential takeaways, definitions, and algorithms.

Thanks for reading!


Training Machine Learning Models More Efficiently with Dataset Distillation

For a machine learning (ML) algorithm to be effective, useful features must be extracted from (often) large amounts of training data. However, this process can be made challenging due to the costs associated with training on such large datasets, both in terms of compute requirements and wall clock time. The idea of distillation plays an important role in these situations by reducing the resources required for the model to be effective. The most widely known form of distillation is model distillation (a.k.a. knowledge distillation), where the predictions of large, complex teacher models are distilled into smaller models.

An alternative option to this model-space approach is dataset distillation [1, 2], in which a large dataset is distilled into a synthetic, smaller dataset. Training a model with such a distilled dataset can reduce the required memory and compute. For example, instead of using all 50,000 images and labels of the CIFAR-10 dataset, one could use a distilled dataset consisting of only 10 synthesized data points (1 image per class) to train an ML model that can still achieve good performance on the unseen test set.

Top: Natural (i.e., unmodified) CIFAR-10 images. Bottom: Distilled dataset (1 image per class) on CIFAR-10 classification task. Using only these 10 synthetic images as training data, a model can achieve test set accuracy of ~51%.

In “Dataset Meta-Learning from Kernel Ridge Regression”, published in ICLR 2021, and “Dataset Distillation with Infinitely Wide Convolutional Networks”, presented at NeurIPS 2021, we introduce two novel dataset distillation algorithms, Kernel Inducing Points (KIP) and Label Solve (LS), which optimize datasets using the loss function arising from kernel regression (a classical machine learning algorithm that fits a linear model to features defined through a kernel). Applying the KIP and LS algorithms, we obtain very efficient distilled datasets for image classification, reducing the datasets to 1, 10, or 50 data points per class while still obtaining state-of-the-art results on a number of benchmark image classification datasets. Additionally, we are also excited to release our distilled datasets to benefit the wider research community.

One of the key theoretical insights of deep neural networks (DNN) in recent years has been that increasing the width of DNNs results in more regular behavior that makes them easier to understand. As the width is taken to infinity, DNNs trained by gradient descent converge to the familiar and simpler class of models arising from kernel regression with respect to the neural tangent kernel (NTK), a kernel that measures input similarity by computing dot products of gradients of the neural network. Thanks to the Neural Tangents library, neural kernels for various DNN architectures can be computed in a scalable manner.

We utilized the above infinite-width limit theory of neural networks to tackle dataset distillation. Dataset distillation can be formulated as a two-stage optimization process: an “inner loop” that trains a model on learned data, and an “outer loop” that optimizes the learned data for performance on natural (i.e., unmodified) data. The infinite-width limit replaces the inner loop of training a finite-width neural network with a simple kernel regression. With the addition of a regularizing term, the kernel regression becomes a kernel ridge-regression (KRR) problem. This is a highly valuable outcome because the kernel ridge regressor (i.e., the predictor from the algorithm) has an explicit formula in terms of its training data (unlike a neural network predictor), which means that one can easily optimize the KRR loss function during the outer loop.

The original data labels can be represented by one-hot vectors, i.e., the true label is given a value of 1 and all other labels are given values of 0. Thus, an image of a cat would have the label “cat” assigned a 1 value, while the labels for “dog” and “horse” would be 0. The labels we use involve a subsequent mean-centering step, where we subtract the reciprocal of the number of classes from each component (so 0.1 for 10-way classification) so that the expected value of each label component across the dataset is normalized to zero.

While the labels for natural images appear in this standard form, the labels for our learned distilled datasets are free to be optimized for performance. Having obtained the kernel ridge regressor from the inner loop, the KRR loss function in the outer loop computes the mean-square error between the original labels of natural images and the labels predicted by the kernel ridge regressor. KIP optimizes the support data (images and possibly labels) by minimizing the KRR loss function through gradient-based methods. The Label Solve algorithm directly solves for the set of support labels that minimizes the KRR loss function, generating a unique dense label vector for each (natural) support image.

Example of labels obtained by label solving. Left and Middle: Sample images with possible labels listed below. The raw, one-hot label is shown in blue and the final LS generated dense label is shown in orange. Right: The covariance matrix between original labels and learned labels. Here, 500 labels were distilled from the CIFAR-10 dataset. A test accuracy of 69.7% is achieved using these labels for kernel ridge-regression.

Distributed Computation
For simplicity, we focus on architectures that consist of convolutional neural networks with pooling layers. Specifically, we focus on the so-called “ConvNet” architecture and its variants because it has been featured in other dataset distillation studies. We used a slightly modified version of ConvNet that has a simple architecture given by three blocks of convolution, ReLu, and 2×2 average pooling and then a final linear readout layer, with an additional 3×3 convolution and ReLu layer prepended (see our GitHub for precise details).

ConvNet architecture used in DC/DSA. Ours has an additional 3×3 Conv and ReLu prepended.

To compute the neural kernels needed in our work, we used the Neural Tangents library.

The first stage of this work, in which we applied KRR, focused on fully-connected networks, whose kernel elements are cheap to compute. But a hurdle facing neural kernels for models with convolutional layers plus pooling is that the computation of each kernel element between two images scales as the square of the number of input pixels (due to the capturing of pixel-pixel correlations by the kernel). So, for the second stage of this work, we needed to distribute the computation of the kernel elements and their gradients across many devices.

Distributed computation for large scale metalearning.

We invoke a client-server model of distributed computation in which a server distributes independent workloads to a large pool of client workers. A key part of this is to divide the backpropagation step in a way that is computationally efficient (explained in detail in the paper).

We accomplish this using the open-source tools Courier (part of DeepMind’s Launchpad), which allows us to distribute computations across GPUs working in parallel, and JAX, for which novel usage of the jax.vjp function enables computationally efficient gradients. This distributed framework allows us to utilize hundreds of GPUs per distillation of the dataset, for both the KIP and LS algorithms. Given the compute required for such experiments, we are releasing our distilled datasets to benefit the wider research community.

Our first set of distilled images above used KIP to distill CIFAR-10 down to 1 image per class while keeping the labels fixed. Next, in the below figure, we compare the test accuracy of training on natural MNIST images, KIP distilled images with labels fixed, and KIP distilled images with labels optimized. We highlight that learning the labels provides an effective, albeit mysterious benefit to distilling datasets. Indeed the resulting set of images provides the best test performance (for infinite-width networks) despite being less interpretable.

MNIST dataset distillation with trainable and non-trainable labels. Top: Natural MNIST data. Middle: Kernel Inducing Point distilled data with fixed labels. Bottom: Kernel Inducing Point distilled data with learned labels.

Our distilled datasets achieve state-of-the-art performance on benchmark image classification datasets, improving performance beyond previous state-of-the-art models that used convolutional architectures, Dataset Condensation (DC) and Dataset Condensation with Differentiable Siamese Augmentation (DSA). In particular, for CIFAR-10 classification tasks, a model trained on a dataset consisting of only 10 distilled data entries (1 image / class, 0.02% of the whole dataset) achieves a 64% test set accuracy. Here, learning labels and an additional image preprocessing step leads to a significant increase in performance beyond the 50% test accuracy shown in our first figure (see our paper for details). With 500 images (50 images / class, 1% of the whole dataset), the model reaches 80% test set accuracy. While these numbers are with respect to neural kernels (using the KRR infinite width limit), these distilled datasets can be used to train finite-width neural networks as well. In particular, for 10 data points on CIFAR-10, a finite-width ConvNet neural network achieves 50% test accuracy with 10 images and 68% test accuracy using 500 images, which are still state-of-the-art results. We provide a simple Colab notebook demonstrating this transfer to a finite-width neural network.

Dataset distillation using Kernel Inducing Points (KIP) with a convolutional architecture outperforms prior state-of-the-art models (DC/DSA) on all benchmark settings on image classification tasks. Label Solve (LS, middle columns) while only distilling information in the labels could often (e.g. CIFAR-10 10, 50 data points per class) outperform prior state-of-the-art models as well.

In some cases, our learned datasets are more effective than a natural dataset one hundred times larger in size.

We believe that our work on dataset distillation opens up many interesting future directions. For instance, our algorithms KIP and LS have demonstrated the effectiveness of using learned labels, an area that remains relatively underexplored. Furthermore, we expect that utilizing efficient kernel approximation methods can help to reduce computational burden and scale up to larger datasets. We hope this work encourages researchers to explore other applications of dataset distillation, including neural architecture search and continual learning, and even potential applications to privacy.

Anyone interested in the KIP and LS learned datasets for further analysis is encouraged to check out our papers [ICLR 2021, NeurIPS 2021] and open-sourced code and datasets available on Github.

This project was done in collaboration with Zhourong Chen, Roman Novak and Lechao Xiao. We would like to acknowledge special thanks to Samuel S. Schoenholz, who proposed and helped develop the overall strategy for our distributed KIP learning methodology.

1Now at DeepMind.  


Bringing Networking into View with the NVIDIA Air Marketplace

NVIDIA Air now includes the NVIDIA Air Marketplace—a collection of demos to get started building your network digital twin.

Networking simulations are essential since the classical model of deployment, based on CLI and adventurous copy/paste-based configuration, has become inefficient for medium– and large-scale environments. NVIDIA Air provides a platform to build, simulate, and experience a modern data center powered by a modern network operating system (NOS). 

What is NVIDIA Air?

NVIDIA Air is a cloud-based environment, which runs in your browser and is powered in its backend by NVIDIA Cumulus Linux, SONiC, and Linux (that is, a standard server Linux). This approach to networking simulations shows the paradigm shift from traditional networking to the new area of native cloud.  

Air is designed to remove the need for the hypervisor, which is frequently a bottleneck in terms of resources and a time-consuming constraint, for fast feature testing. Air addresses many scenarios:

  • Demo infrastructure (Sonic in the Cloud, Cumulus in the Cloud, Cumulus and Sonic in the Cloud) 
  • Continuous integration 
  • Custom topologies, with the builder 
  • Training and education 
  • Configuration management 

Air provides an always-accessible, always-on training or preproduction environment for networking teams. Enterprises can now shrink their hardware footprint and decrease expenses; lower CapEx due to reduced hardware needs; and lower OpEx using the Air public cloud operational model. With Air, modern cloud-scale networking has never been easier or more powerful.   

NVIDIA Air Marketplace

Recently, we launched the NVIDIA Air Marketplace—a collection of on-demand training, test resources, and demos for Cumulus Linux and other NVIDIA Networking offerings. This collection enhances the simplicity and lowered the barrier of entry to Air. The marketplace consists of content directly created by NVIDIA and by one of the best communities ever: you!  

A display of all the currently available demos in the NVIDIA Air Marketplace.
Figure 1. NVIDIA Air Demo Marketplace

How to get started

The marketplace is here to help those curious or new to Cumulus Linux. The curated demo environments make it easy to test new functionalities and environments in a simple and straightforward way. You can access a complete demo lab that has the same characteristics of a physical environment. Each demo lab also includes a validated demo guide to help you with the lab.  

First, you must access the portal using your username and password, or by creating a new account.  After you have entered the platform, on the left sidebar, choose Demo Marketplace. From here, you can view a catalog of prebuilt scenarios that allow you to create a lab about the specific feature or configuration you would like to test. 

Choose the scenario that piques your interest. From here, you can read the README, explore the git repository, or start the demo with a single click.

A screenshot showing the popup README and a
Figure 3.  Starting a demo

Air then allocates the resources required. Thanks to the low Cumulus footprint of 768 MB, it takes roughly 90 seconds to spin up 15+ nodes.

When the lab is loaded, you can log in to the mgmt-server from your browser or with your favorite SSH client. 

A screenshot of the loaded demo where you can navigate the command line through Air or with your favorite SSH client.
Figure 5. Guided tours

For example, from lterm2:

> ssh -p 16732
Welcome to Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-151-generic x86_64)

* Documentation:
* Management:
* Support:

System information as of Tue Oct 19 13:03:58 UTC 2021

System load:  0.07             Processes:           114
Usage of /:   29.2% of 9.29GB  Users logged in:     0
Memory usage: 23%              IP address for eth0:
Swap usage:   0%               IP address for eth1:

25 updates can be applied immediately.
16 of these updates are standard security updates.
To see these additional updates run: apt list --upgradable

New release '20.04.3 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Tue Oct 19 13:03:46 2021 from fd01:1:1:32c5::1
cumulus@oob-mgmt-server :~$ 

Create your own demo

Do you have an idea for a demo for a specific use case? This is the perfect time to become an active part of the community. You can create your own demo environment and submit it for review by contacting your NVIDIA sales representative or an NVIDIA team member. The Air team will review and publish the demos to the marketplace.

With the vibrant NVIDIA Air community, the sky’s the limit for training and collaboration in the marketplace!  


NVIDIA DPU Hackathon Unveils AI, Cloud, and Accelerated Computing Breakthroughs

Two hackathon participants typing on their computerNVIDIA announces the winners from the second global DPU Hackathon.
Two hackathon participants typing on their computer

The second global NVIDIA DPU Hackathon brought together 11 teams with the goal of creating new and exciting data processing unit (DPU) innovations. Spanning 24 hours from December 8 to 9, the second in a series of global NVIDIA DPU Hackathons received over 50 team applications from various universities and enterprises. 

As a new class of programmable processors, a DPU ignites unprecedented innovation for modern data centers. By offloading, accelerating, and isolating a broad range of advanced networking, storage, and security services, NVIDIA BlueField DPUs provide a secure and accelerated infrastructure for any workload in any environment. The NVIDIA DOCA software framework brings together APIs, drivers, libraries, sample code, documentation, services, and prepackaged containers so developers can speed application development and deployment on BlueField DPUs. They span several use cases, including security, automation, AI, HPC, and telemetry.

“We love hackathons, they create the right environment to perform a step function in the development. We put the DOCA developers in the center, offering them training, mentorship, preconfigured setups, documentation, a working environment, and visibility. Moving forward hackathons will play a significant role in establishing a strong DOCA developer community,”  said Dror Goldenberg, the SVP of Software Architecture at NVIDIA.

Two of the hackathon participants collaborating on their team application.
Figure 1. Hackathon participants sitting at their laptops.

DPU Hackathon winners

First Place – Team Rutgers University

Team Rutgers University focused on developing a unique, high-performance, DPU accelerated and scalable L4 Load Balancer. Using DOCA FLOW APIs to configure the embedded switch, Team Rutgers was able to build an application that delivers hardware acceleration for offloading the load-balancing algorithm and handle flow tracking in hardware. The final design is a testament to the unique value that can be achieved with DOCA.  

Second Place – Team Equinix Metal

Team Equinix Metal focused on innovative development on DPU service orchestration with gRPC APIs. They were clearly excited to try the DPU in a bare-metal cloud use case, to improve their existing synchronous method. By using gRPC to configure the bare-metal host network in an asynchronous way, they ensured networking commands were handled even if the network was disrupted. This enabled them to use gRPC commands to the DPU to allow asynchronous configuration with OVS running on the BlueField and deliver the BlueField service orchestration.

Third Place – Team BlueJazz from Versa Networks

Team BlueJazz created a DPU accelerated and secure access service by running traffic inspection and inference service on the DPU with 100G links. With their innovation, they offload any datapath with subfunctions and accelerate processing by virtualizing the DPU as an engine for security. Team BlueJazz used DOCA Deep Packet Inspection APIs to offload the pattern-matching logic and leveraged DOCA reference applications for URL filtering and application recognition.

Congratulations to our winners and thank you to all of the teams that participated, making this round of our NVIDIA DPU Hackathon a success!

Join the DOCA community

NVIDIA is building a broad community of DOCA developers to create innovative applications and services on top of BlueField DPUs to secure and accelerate modern, efficient data centers. To learn more about joining the community, visit the DOCA developer web page or register to download DOCA today.

Up next is the NVIDIA DPU Hackathon in China. Check the corporate calendar to stay informed for future events, and take part in our journey to reshape the data center of tomorrow. 



are there any difference between Mediapipe and MoveNet?

Hi everyone,

I’m a little confused after going through the documentation for these two as I don’t understand the difference between these two libraries. They both even use 17 points to find the person’s position. Is there any difference between these two libraries?

I’m going to be using this in a mobile app to take a picture when a user hits a specific pose, and as you can tell, I’m new to all of this.

Thank you for any help and guidance in advance.

submitted by /u/A_Tired_Founder
[visit reddit] [comments]


Advent of Code 2021 in pure TensorFlow – day 3. TensorArrays limitations, and tf.function relaxed shapes

Advent of Code 2021 in pure TensorFlow - day 3. TensorArrays limitations, and tf.function relaxed shapes submitted by /u/pgaleone
[visit reddit] [comments]

Scaling Zero Touch RoCE Technology with Round Trip Time Congestion Control

Zero Touch RoCE enables a smooth data highwayThe new NVIDIA RTTCC congestion control algorithm for ZTR delivers RoCE performance at scale, without special switch infrastructure configuration. Zero Touch RoCE enables a smooth data highway

NVIDIA Zero Touch RoCE (ZTR) enables data centers to seamlessly deploy RDMA over Converged Ethernet (RoCE) without requiring any special switch configuration. Until recently, ZTR was optimal for only small to medium-sized data centers. Meanwhile, large-scale deployments have traditionally relied on Explicit Congestion Notification (ECN) to enable RoCE network transport, which requires switch configuration.

The new NVIDIA congestion control algorithm—Round-Trip Time Congestion Control (RTTCC)—allows ZTR to scale to thousands of servers without compromising performance. Using ZTR and RTTCC allows data center operators to enjoy ease-of-deployment and operations together with the superb performance of Remote Direct Memory Access (RDMA) at a massive scale, without any switch configuration. 

This post describes the previously recommended RoCE congestion control in large and small-scale RoCE deployments. It then introduces a new congestion control algorithm that allows configuration-free, large-scale implementations of ZTR, which perform like ECN-enabled RoCE. 

RoCE deployments with Data Center Quantized Congestion Notification

In a typical TCP-based environment, distributed memory requests require many steps and CPU cycles, negatively impacting application performance.  RDMA eliminates all CPU involvement in memory data transfers between servers significantly accelerating both access to stored data and application performance. 

RoCE provides RDMA in Ethernet environments—the primary network fabric in data centers. Ethernet requires an advanced congestion control mechanism to support RDMA network transports. Data Center Quantized Congestion Notification (DCQCN) is a congestion control algorithm that enables responding to congestion notifications and dynamically adjusting traffic transmit rates. 

The implementation of DCQCN requires enabling Explicit Congestion Notification (ECN), which entails configuring network switches. ECN configures switches to set the Congestion Experienced (CE) bit to indicate the imminent onset of congestion. 

Zero touch RoCE—with reactive congestion control 

The NVIDIA-developed ZTR technology allows RoCE deployments, which don’t require configuring the switch infrastructure. Built according to the InfiniBand Trade Association (IBTA) RDMA standard and fully compliant with the RoCE specifications, ZTR enables seamless deployment of RoCE. ZTR also boasts performance equivalent to traditional switch-enabled RoCE and is significantly better than traditional TCP-based memory access. Moreover, with ZTR, RoCE network transport services operate side-by-side with non-RoCE communications in ordinary TCP/IP environments.

As noted in the NVIDIA Zero-Touch RoCE Technology Enables Cloud Economics for Microsoft Azure Stack HCI post, Microsoft has validated ZTR for their Azure Stack HCI platform, which typically scales to a few dozen nodes. In such environments, ZTR relies on implicit packet loss notification, which is sufficient for small-scale deployments. Adding a new Round Trip Timer (RTT)-based congestion control algorithm, ZTR becomes even more robust and scalable without relying on packet loss to notify the server of network congestion.

Introducing round-trip time congestion control

The new NVIDIA congestion control algorithm, RTTCC, actively monitors network RTT to proactively detect and adapt to the onset of congestion before dropping packets. RTTCC enables dynamic congestion control using a hardware-based feedback loop that provides dramatically superior performance compared to software-based congestion control algorithms. RTTCC also supports faster transmission rates and can deploy ZTR at a larger scale. ZTR with RTTCC is now available as a beta feature, with GA planned for the second half of 2022.

How ZTR-RTTCC works

ZTR-RTTCC extends DCQCN in RoCE networks with a hardware RTT-based congestion control algorithm.

Server A (the initiator) sends both payload and timing packets to server B. Timing packets are immediately returned to the initiator, enabling it to measure the round-trip latency.
Figure 1. Round trip timing between servers

Timing packets (green network packets in the preceding figure) are periodically sent from the initiator to the target. The timing packets are immediately returned, enabling measurement of round-trip latency. RTTCC measures the time interval between when the packet was sent and when the initiator received it. The difference (Time Received – Time Sent) measures round-trip latency which indicates path congestion. Uncongested flows continue to transmit packets to utilize the available network path bandwidth best. Flows showing increasing latency imply path congestion, for which RTTCC throttles traffic to avoid buffer overflow and packet drops.

Network traffic can be adjusted either up or down in real-time as congestion decreases or increases. The ability to actively monitor and react to congestion is critical to enabling ZTR to manage congestion proactively. This proactive rate control also results in reduced packet re-transmission and improved RoCE performance. With ZTR-RTTCC, data center nodes do not wait to be notified of packet loss; instead, they actively identify congestion prior to packet loss and react accordingly, notifying initiators to adjust transmission rates.

As noted earlier, one of the key benefits of ZTR is the ability to provide RoCE functionality while operating simultaneously with non-RoCE communications in ordinary TCP/IP traffic. ZTR provides seamless deployment of RoCE network capabilities. With the addition of RTTCC actively monitoring congestion, ZTR provides data center-wide operation without switch configuration. Read on to see how it performs.

ZTR with RTTCC performance

As shown in Figure 2, ZTR with RTTCC provides application performance comparable to RoCE when ECN and PFC are configured across the network fabric. These tests were performed under worst case many-to-one (in cast) scenarios to simulate the throughput under congested conditions. 

The results indicate that not only does ZTR with RTTCC scale to thousands of nodes, but it also performs comparably to the fastest RoCE solution currently available.

  • At small scale (256 connections and below), ZTR with RTTCC performs within 99% of RoCE with ECN congestion control enabled (conventional RoCE).
  • With over 16,000 connections, ZTR with RTTCC throughput is 98% of conventional RoCE throughput.

ZTR with RTTCC provides near-equivalent performance to conventional RoCE without requiring any switch configuration.

A diagram showing comparison of network throughput (Gb/s) for ZTR w/ RTTCC and RoCE w/ DC-QCN (Conventional RoCE)
Figure 2. Application bandwidth with increasing connections

Configuring ZTR

To configure ZTR with the new RTTCC algorithm, download and install the latest firmware and tools for your NVIDIA network interface card and perform the following steps.

Enable programmable congestion control using mlxconfig (persistent configuration):

mlxconfig -d /dev/mst/mt4125_pciconf0 -y s

Reset the device using mlxfwreset or reboot the host:

mlxfwreset -d /dev/mst/mt4125_pciconf0 -l 3 -y r

When you complete these steps, ZTR-RTTCC is used when RDMA-CM is used with Enhanced Connection Establishment (ECE, supported with MLNX_OFED version 5.1). 

If there’s an error, you can force ZTR-RTTCC usage regardless of RDMA-CM synchronization status:

mlxreg -d /dev/mst/mt4125_pciconf0 --reg_id 0x506e --reg_len
0x40 --set "0x0.0:8=2,0x4.0:4=15" -y


NVIDIA RTTCC, the new congestion control algorithm for ZTR, delivers superb RoCE performance at data center scale, without any special configuration of the switch infrastructure. This enhancement allows data centers to enable RoCE seamlessly in both existing and new data center infrastructure and benefit from immediate application performance improvements. 

We encourage you to test ZTR with RTTCC for your application use cases by downloading the latest NVIDIA software.


