Deep Learning Study Could Spark New Dinosaur Discoveries

Researchers combine CT imaging with deep learning to evaluate dinosaur fossils. The approach could change how paleontologists study ancient remains.

Applying new technology to studying ancient history, researchers are looking to expand their understanding of dinosaurs with a new AI algorithm. The study, published in Frontiers in Earth Science, uses high-resolution Computed Tomography (CT) imaging combined with deep learning models to scan and evaluate dinosaur fossils. The research is a step toward creating a new tool that would vastly change the way paleontologists study ancient remains. 

“Computed Tomography as well we other imaging techniques have revealed previously hidden structures in fossils, but the high-resolution images require paleontologists spending weeks to even months in post-processing, usually segmenting fossils from rock matrices. The introduction of AI can not only accelerate data processing in fossil studies, but also establish benchmarks for more objective and more reproducible studies,” said lead author Congyu Yu, a Ph.D. student at the Richard Gilder Graduate School at the American Museum of Natural History. 

For a complete picture of ancient vertebrates, paleontologists focus on internal anatomy such as cranial capacity, inner ears, or vascular spaces. To do this, researchers use a technique called thin sectioning. Removing a small piece (as thin as several micrometers) from a fossil, examining it under a microscope, and annotating the structures they find, helps them piece together the morphology of a dinosaur. However, this technique is destructive to the remains and can be extremely time consuming.

Computed tomography (CT) scans have given scientists the ability to look inside a sample while leaving the fossil unscathed. The technology essentially examines a fossil section, capturing thousands of images of it. Software then reconstructs the images and generates a three-dimensional graphic, resulting in an internal snapshot of the sample. Scientists can then examine and label identifiable morphology in the graphic to learn more about a specimen.

Imaging has given scientists a tool for revealing hidden internal structures and advancing 3D models of dinosaurs. Studies have helped researchers estimate body mass, analyze skulls, and even understand dental morphology along with tooth replacement patterns.

However, with this approach, scientists still manually choose segments, examine, and label images, which beyond being time intensive, is subjective, and can introduce errors. Plus, scans have limitations differentiating between the rock that may be coating a fossil and the bones themselves, making it difficult to determine where a rock ends and fossil begins. 

AI has proven capable of quick image segmentation in the medical world, ranging from identifying brain lesions to skin cancer. The researchers saw an opportunity to apply similar deep learning models to CT fossil images.  

They tested this new approach using deep neural networks and over 10,000 annotated CT scans of three well-preserved embryonic skulls of Protoceratop dinosaurs. Recovered in the 1990s from the Mongolian Gobi Desert, these fossils come from early horned dinosaurs and are a smaller relative of the better-known Triceratops.

The team used a classic U-net deep neural network for processing fossil segmentation, teaching the algorithm to recognize rock from the fossils. A modified DeepLab v3+ network was used for training feature identification, categorizing parts of the CT images, and 3D rendering. 

The models were trained using 7,986 manually annotated bone structure CT slices on the cuDNN-accelerated TensorFlow deep learning framework with dual NVIDIA GeForce RTX 2080 Ti GPUs.

Testing the results against a dataset of 3,329, they found that while the segmentation model reached high accuracy of around 97%, the 3D feature renderings were not as meticulous or accurate as humans. While the results showed that the features models did not perform as accurately as the scientists, the segmentation models worked smoothly and did so in record time. The models segmented each slice in seconds—manually segmenting the same piece took minutes or even hours in some cases. This could help paleontologists reduce their time spent working to differentiate fossils from rock. 

Comparison of 3d renderings of raw reconstruction, manual segmentation, and deep learning segmentation of skulls showing the raw reconstruction worked best.
Figure 1. Comparison of different 3D renderings from left: raw reconstruction, manual segmentation, and deep learning segmentation.

The researchers suggest that larger data sets incorporating other dinosaur species and different sediment types could help create a high-performing algorithm down the line. 

​​”We are confident that a segmentation model for fossils from the Gobi Desert is not far away, but a more generalized model needs not only more training dataset but innovations in algorithms,” Yu said in a press release. “I believe deep learning can eventually process imagery better than us, and there have already been various examples in deep learning performance exceeding humans, including Go playing and protein 3D-structure prediction.”

The dataset used in the study, CT Segmentation of Dinosaur Fossils by Deep Learning, is available for download.

Read the study in Frontiers in Earth Science. >>


Federated Learning with Formal Differential Privacy Guarantees

In 2017, Google introduced federated learning (FL), an approach that enables mobile devices to collaboratively train machine learning (ML) models while keeping the raw training data on each user’s device, decoupling the ability to do ML from the need to store the data in the cloud. Since its introduction, Google has continued to actively engage in FL research and deployed FL to power many features in Gboard, including next word prediction, emoji suggestion and out-of-vocabulary word discovery. Federated learning is improving the “Hey Google” detection models in Assistant, suggesting replies in Google Messages, predicting text selections, and more.

While FL allows ML without raw data collection, differential privacy (DP) provides a quantifiable measure of data anonymization, and when applied to ML can address concerns about models memorizing sensitive user data. This too has been a top research priority, and has yielded one of the first production uses of DP for analytics with RAPPOR in 2014, our open-source DP library, Pipeline DP, and TensorFlow Privacy.

Through a multi-year, multi-team effort spanning fundamental research and product integration, today we are excited to announce that we have deployed a production ML model using federated learning with a rigorous differential privacy guarantee. For this proof-of-concept deployment, we utilized the DP-FTRL algorithm to train a recurrent neural network to power next-word-prediction for Spanish-language Gboard users. To our knowledge, this is the first production neural network trained directly on user data announced with a formal DP guarantee (technically ρ=0.81 zero-Concentrated-Differential-Privacy, zCDP, discussed in detail below). Further, the federated approach offers complimentary data minimization advantages, and the DP guarantee protects all of the data on each device, not just individual training examples.

Data Minimization and Anonymization in Federated Learning
Along with fundamentals like transparency and consent, the privacy principles of data minimization and anonymization are important in ML applications that involve sensitive data.

Federated learning systems structurally incorporate the principle of data minimization. FL only transmits minimal updates for a specific model training task (focused collection), limits access to data at all stages, processes individuals’ data as early as possible (early aggregation), and discards both collected and processed data as soon as possible (minimal retention).

Another principle that is important for models trained on user data is anonymization, meaning that the final model should not memorize information unique to a particular individual’s data, e.g., phone numbers, addresses, credit card numbers. However, FL on its own does not directly tackle this problem.

The mathematical concept of DP allows one to formally quantify this principle of anonymization. Differentially private training algorithms add random noise during training to produce a probability distribution over output models, and ensure that this distribution doesn’t change too much given a small change to the training data; ρ-zCDP quantifies how much the distribution could possibly change. We call this example-level DP when adding or removing a single training example changes the output distribution on models in a provably minimal way.

Showing that deep learning with example-level differential privacy was even possible in the simpler setting of centralized training was a major step forward in 2016. Achieved by the DP-SGD algorithm, the key was amplifying the privacy guarantee by leveraging the randomness in sampling training examples (“amplification-via-sampling”).

However, when users can contribute multiple examples to the training dataset, example-level DP is not necessarily strong enough to ensure the users’ data isn’t memorized. Instead, we have designed algorithms for user-level DP, which requires that the output distribution of models doesn’t change even if we add/remove all of the training examples from any one user (or all the examples from any one device in our application). Fortunately, because FL summarizes all of a user’s training data as a single model update, federated algorithms are well-suited to offering user-level DP guarantees.

Both limiting the contributions from one user and adding noise can come at the expense of model accuracy, however, so maintaining model quality while also providing strong DP guarantees is a key research focus.

The Challenging Path to Federated Learning with Differential Privacy
In 2018, we introduced the DP-FedAvg algorithm, which extended the DP-SGD approach to the federated setting with user-level DP guarantees, and in 2020 we deployed this algorithm to mobile devices for the first time. This approach ensures the training mechanism is not too sensitive to any one user’s data, and empirical privacy auditing techniques rule out some forms of memorization.

However, the amplification-via-samping argument is essential to providing a strong DP guarantee for DP-FedAvg, but in a real-world cross-device FL system ensuring devices are subsampled precisely and uniformly at random from a large population would be complex and hard to verify. One challenge is that devices choose when to connect (or “check in”) based on many external factors (e.g., requiring the device is idle, on unmetered WiFi, and charging), and the number of available devices can vary substantially.

Achieving a formal privacy guarantee requires a protocol that does all of the following:

  • Makes progress on training even as the set of devices available varies significantly with time.
  • Maintains privacy guarantees even in the face of unexpected or arbitrary changes in device availability.
  • For efficiency, allows client devices to locally decide whether they will check in to the server in order to participate in training, independent of other devices.

Initial work on privacy amplification via random check-ins highlighted these challenges and introduced a feasible protocol, but it would have required complex changes to our production infrastructure to deploy. Further, as with the amplification-via-sampling analysis of DP-SGD, the privacy amplification possible with random check-ins depends on a large number of devices being available. For example, if only 1000 devices are available for training, and participation of at least 1000 devices is needed in each training step, that requires either 1) including all devices currently available and paying a large privacy cost since there is no randomness in the selection, or 2) pausing the protocol and not making progress until more devices are available.

Achieving Provable Differential Privacy for Federated Learning with DP-FTRL
To address this challenge, the DP-FTRL algorithm is built on two key observations: 1) the convergence of gradient-descent-style algorithms depends primarily not on the accuracy of individual gradients, but the accuracy of cumulative sums of gradients; and 2) we can provide accurate estimates of cumulative sums with a strong DP guarantee by utilizing negatively correlated noise, added by the aggregating server: essentially, adding noise to one gradient and subtracting that same noise from a later gradient. DP-FTRL accomplishes this efficiently using the Tree Aggregation algorithm [1, 2].

The graphic below illustrates how estimating cumulative sums rather than individual gradients can help. We look at how the noise introduced by DP-FTRL and DP-SGD influence model training, compared to the true gradients (without added noise; in black) which step one unit to the right on each iteration. The individual DP-FTRL gradient estimates (blue), based on cumulative sums, have larger mean-squared-error than the individually-noised DP-SGD estimates (orange), but because the DP-FTRL noise is negatively correlated, some of it cancels out from step to step, and the overall learning trajectory stays closer to the true gradient descent steps.

To provide a strong privacy guarantee, we limit the number of times a user contributes an update. Fortunately, sampling-without-replacement is relatively easy to implement in production FL infrastructure: each device can remember locally which models it has contributed to in the past, and choose to not connect to the server for any later rounds for those models.

Production Training Details and Formal DP Statements
For the production DP-FTRL deployment introduced above, each eligible device maintains a local training cache consisting of user keyboard input, and when participating computes an update to the model which makes it more likely to suggest the next word the user actually typed, based on what has been typed so far. We ran DP-FTRL on this data to train a recurrent neural network with ~1.3M parameters. Training ran for 2000 rounds over six days, with 6500 devices participating per round. To allow for the DP guarantee, devices participated in training at most once every 24 hours. Model quality improved over the previous DP-FedAvg trained model, which offered empirically-tested privacy advantages over non-DP models, but lacked a meaningful formal DP guarantee.

The training mechanism we used is available in open-source in TensorFlow Federated and TensorFlow Privacy, and with the parameters used in our production deployment it provides a meaningfully strong privacy guarantee. Our analysis gives ρ=0.81 zCDP at the user level (treating all the data on each device as a different user), where smaller numbers correspond to better privacy in a mathematically precise way. As a comparison, this is stronger than the ρ=2.63 zCDP guarantee chosen by the 2020 US Census.

Next Steps
While we have reached the milestone of deploying a production FL model using a mechanism that provides a meaningfully small zCDP, our research journey continues. We are still far from being able to say this approach is possible (let alone practical) for most ML models or product applications, and other approaches to private ML exist. For example, membership inference tests and other empirical privacy auditing techniques can provide complimentary safeguards against leakage of users’ data. Most importantly, we see training models with user-level DP with even a very large zCDP as a substantial step forward, because it requires training with a DP mechanism that bounds the sensitivity of the model to any one user’s data. Further, it smooths the road to later training models with improved privacy guarantees as better algorithms or more data become available. We are excited to continue the journey toward maximizing the value that ML can deliver while minimizing potential privacy costs to those who contribute training data.

The authors would like to thank Alex Ingerman and Om Thakkar for significant impact on the blog post itself, as well as the teams at Google that helped develop these ideas and bring them to practice:

  • Core research team: Galen Andrew, Borja Balle, Peter Kairouz, Daniel Ramage, Shuang Song, Thomas Steinke, Andreas Terzis, Om Thakkar, Zheng Xu
  • FL infrastructure team: Katharine Daly, Stefan Dierauf, Hubert Eichner, Igor Pisarev, Timon Van Overveldt, Chunxiang Zheng
  • Gboard team: Angana Ghosh, Xu Liu, Yuanbo Zhang
  • Speech team: Françoise Beaufays, Mingqing Chen, Rajiv Mathews, Vidush Mukund, Igor Pisarev, Swaroop Ramaswamy, Dan Zivkovic


Using a tensorflow model as a loss function

I am trying to use an empirical metric as a loss function to train a Tensorflow model. Calculating the metric function is slow, but I can train a regression neural network to accurately and quickly predict the metric score after it is trained. Is there a straightforward way (or tutorial?) to use a trained Tensorflow or scikit-learn model as a custom loss function for a Tensorflow model?

Edit: I have found this StackOverflow entry as a starting point. I will try it out and report back.

submitted by /u/baudie
[visit reddit] [comments]


Constrained Reweighting for Training Deep Neural Nets with Noisy Labels

Over the past several years, deep neural networks (DNNs) have been quite successful in driving impressive performance gains in several real-world applications, from image recognition to genomics. However, modern DNNs often have far more trainable model parameters than the number of training examples and the resulting overparameterized networks can easily overfit to noisy or corrupted labels (i.e., examples that are assigned a wrong class label). As a consequence, training with noisy labels often leads to degradation in accuracy of the trained model on clean test data. Unfortunately, noisy labels can appear in several real-world scenarios due to multiple factors, such as errors and inconsistencies in manual annotation and the use of inherently noisy label sources (e.g., the internet or automated labels from an existing system).

Earlier work has shown that representations learned by pre-training large models with noisy data can be useful for prediction when used in a linear classifier trained with clean data. In principle, it is possible to directly train machine learning (ML) models on noisy data without resorting to this two-stage approach. To be successful, such alternative methods should have the following properties: (i) they should fit easily into standard training pipelines with little computational or memory overhead; (ii) they should be applicable in “streaming” settings where new data is continuously added during training; and (iii) they should not require data with clean labels.

In “Constrained Instance and Class Reweighting for Robust Learning under Label Noise”, we propose a novel and principled method, named Constrained Instance reWeighting (CIW), with these properties that works by dynamically assigning importance weights both to individual instances and to class labels in a mini-batch, with the goal of reducing the effect of potentially noisy examples. We formulate a family of constrained optimization problems that yield simple solutions for these importance weights. These optimization problems are solved per mini-batch, which avoids the need to store and update the importance weights over the full dataset. This optimization framework also provides a theoretical perspective for existing label smoothing heuristics that address label noise, such as label bootstrapping. We evaluate the method with varying amounts of synthetic noise on the standard CIFAR-10 and CIFAR-100 benchmarks and observe considerable performance gains over several existing methods.

Training ML models involves minimizing a loss function that indicates how well the current parameters fit to the given training data. In each training step, this loss is approximately calculated as a (weighted) sum of the losses of individual instances in the mini-batch of data on which it is operating. In standard training, each instance is treated equally for the purpose of updating the model parameters, which corresponds to assigning uniform (i.e., equal) weights across the mini-batch.

However, empirical observations made in earlier works reveal that noisy or mislabeled instances tend to have higher loss values than those that are clean, particularly during early to mid-stages of training. Thus, assigning uniform importance weights to all instances means that due to their higher loss values, the noisy instances can potentially dominate the clean instances and degrade the accuracy on clean test data.

Motivated by these observations, we propose a family of constrained optimization problems that solve this problem by assigning importance weights to individual instances in the dataset to reduce the effect of those that are likely to be noisy. This approach provides control over how much the weights deviate from uniform, as quantified by a divergence measure. It turns out that for several types of divergence measures, one can obtain simple formulae for the instance weights. The final loss is computed as the weighted sum of individual instance losses, which is used for updating the model parameters. We call this the Constrained Instance reWeighting (CIW) method. This method allows for controlling the smoothness or peakiness of the weights through the choice of divergence and a corresponding hyperparameter.

Schematic of the proposed Constrained Instance reWeighting (CIW) method.

Illustration with Decision Boundary on a 2D Dataset
As an example to illustrate the behavior of this method, we consider a noisy version of the Two Moons dataset, which consists of randomly sampled points from two classes in the shape of two half moons. We corrupt 30% of the labels and train a multilayer perceptron network on it for binary classification. We use the standard binary cross-entropy loss and an SGD with momentum optimizer to train the model. In the figure below (left panel), we show the data points and visualize an acceptable decision boundary separating the two classes with a dotted line. The points marked red in the upper half-moon and those marked green in the lower half-moon indicate noisy data points.

The baseline model trained with the binary cross-entropy loss assigns uniform weights to the instances in each mini-batch, thus eventually overfitting to the noisy instances and resulting in a poor decision boundary (middle panel in the figure below).

The CIW method reweights the instances in each mini-batch based on their corresponding loss values (right panel in the figure below). It assigns larger weights to the clean instances that are located on the correct side of the decision boundary and damps the effect of noisy instances that incur a higher loss value. Smaller weights for noisy instances help in preventing the model from overfitting to them, thus allowing the model trained with CIW to successfully converge to a good decision boundary by avoiding the impact of label noise.

Illustration of decision boundary as the training proceeds for the baseline and the proposed CIW method on the Two Moons dataset. Left: Noisy dataset with a desirable decision boundary. Middle: Decision boundary for standard training with cross-entropy loss. Right: Training with the CIW method. The size of the dots in (middle) and (right) are proportional to the importance weights assigned to these examples in the minibatch.


Illustration of decision boundary as the training proceeds for the baseline and the proposed CIW method on the Two Moons dataset. Left: Noisy dataset with a desirable decision boundary. Middle: Decision boundary for standard training with cross-entropy loss. Right: Training with the CIW method. The size of the dots in (middle) and (right) are proportional to the importance weights assigned to these examples in the minibatch.


Constrained Class reWeighting
Instance reweighting assigns lower weights to instances with higher losses. We further extend this intuition to assign importance weights over all possible class labels. Standard training uses a one-hot label vector as the class weights, assigning a weight of 1 to the labeled class and 0 to all other classes. However, for the potentially mislabeled instances, it is reasonable to assign non-zero weights to classes that could be the true label. We obtain these class weights as solutions to a family of constrained optimization problems where the deviation of the class weights from the label one-hot distribution, as measured by a divergence of choice, is controlled by a hyperparameter.

Again, for several divergence measures, we can obtain simple formulae for the class weights. We refer to this as Constrained Instance and Class reWeighting (CICW). The solution to this optimization problem also recovers the earlier proposed methods based on static label bootstrapping (also referred as label smoothing) when the divergence is taken to be total variation distance. This provides a theoretical perspective on the popular method of static label bootstrapping.

Using Instance Weights with Mixup
We also propose a way to use the obtained instance weights with mixup, which is a popular method for regularizing models and improving prediction performance. It works by sampling a pair of examples from the original dataset and generating a new artificial example using a random convex combination of these. The model is trained by minimizing the loss on these mixed-up data points. Vanilla mixup is oblivious to the individual instance losses, which might be problematic for noisy data because mixup will treat clean and noisy examples equally. Since a high instance weight obtained with our CIW method is more likely to indicate a clean example, we use our instance weights to do a biased sampling for mixup and also use the weights in convex combinations (instead of random convex combinations in vanilla mixup). This results in biasing the mixed-up examples towards clean data points, which we refer to as CICW-Mixup.

We apply these methods with varying amounts of synthetic noise (i.e., the label for each instance is randomly flipped to other labels) on the standard CIFAR-10 and CIFAR-100 benchmark datasets. We show the test accuracy on clean data with symmetric synthetic noise where the noise rate is varied between 0.2 and 0.8.

We observe that the proposed CICW outperforms several methods and matches the results of dynamic mixup, which maintains the importance weights over the full training set with mixup. Using our importance weights with mixup in CICW-M, resulted in significantly improved performance vs these methods, particularly for larger noise rates (as shown by lines above and to the right in the graphs below).

Test accuracy on clean data while varying the amount of symmetric synthetic noise in the training data for CIFAR-10 and CIFAR-100. Methods compared are: standard Cross-Entropy Loss (CE), Bi-tempered Loss, Active-Passive Normalized Loss, the proposed CICW, Mixup, Dynamic Mixup, and the proposed CICW-Mixup.

Summary and Future Directions
We formulate a novel family of constrained optimization problems for tackling label noise that yield simple mathematical formulae for reweighting the training instances and class labels. These formulations also provide a theoretical perspective on existing label smoothing–based methods for learning with noisy labels. We also propose ways for using the instance weights with mixup that results in further significant performance gains over instance and class reweighting. Our method operates solely at the level of mini-batches, which avoids the extra overhead of maintaining dataset-level weights as in some of the recent methods.

As a direction for future work, we would like to evaluate the method on realistic noisy labels that are encountered in large scale practical settings. We also believe that studying the interaction of our framework with label smoothing is an interesting direction that can result in a loss adaptive version of label smoothing. We are also excited to release the code for CICW, now available on Github.

We’d like to thank Kevin Murphy for providing constructive feedback during the course of the project.


Doubling all2all Performance with NVIDIA Collective Communication Library 2.12

The NCCL 2.12 release significantly improves all2all communication collective performance, with the PXN feature.

Collective communications are a performance-critical ingredient of modern distributed AI training workloads such as recommender systems and natural language processing.

NVIDIA Collective Communication Library (NCCL), a Magnum IO Library, implements GPU-accelerated collective operations:

  • all-gather
  • all-reduce
  • broadcast
  • reduce
  • reduce-scatter
  • point-to-point send and receive

NCCL is topology-aware and is optimized to achieve high bandwidth and low latency over PCIe, NVLink, Ethernet, and InfiniBand interconnect. NCCL GCP plugin and NCCL AWS plugin enable high-performance NCCL operations in popular cloud environments with custom network connectivity.

NCCL releases have been relentlessly focusing on improving collective communication performance. This post focuses on the improvements that come with the NCCL 2.12 release.

Combining NVLink and network communication

The new feature introduced in NCCL 2.12 is called PXN, as PCI × NVLink, as it enables a GPU to communicate with a NIC on the node through NVLink and then PCI. This is instead of going through the CPU using QPI or other inter-CPU protocols, which would not be able to deliver full bandwidth. That way, even though each GPU still tries to use its local NIC as much as possible, it can reach other NICs if required.

Instead of preparing a buffer on its local memory for the local NIC to send, the GPU prepares a buffer on an intermediate GPU, writing to it through NVLink. It then notifies the CPU proxy managing that NIC that the data is ready, instead of notifying its own CPU proxy. The GPU-CPU synchronization might be a little slower because it may have to cross CPU sockets, but the data itself only uses NVLink and PCI switches, guaranteeing maximum bandwidth.

Topology that shows NIC0s of all DGXs connected to the same switch, NIC1s to another leaf switch and so on.
Figure 1. Rail-optimized topology

In the topology in Figure 1, NIC-0 from each DGX system is connected to the same leaf switch (L0), NIC-1s are connected to the same leaf switch (L1), and so on. Such a design is often called rail-optimized. Rail-optimized network topology helps maximize all-reduce performance while minimizing network interference between flows. It can also reduce the cost of the network by having lighter connections between rails.

PXN leverages NVIDIA NVSwitch connectivity between GPUs within the node to first move data on a GPU on the same rail as the destination, then send it to the destination without crossing rails. That enables message aggregation and network traffic optimization.

Topology shows PXN avoiding second-tier spine switches.
Figure 2. Example message path from GPU0 in DGX-A to GPU3 in DGX-B

Before NCCL 2.12, the message in Figure X would have traversed through three hops of network switches (L0, S1, and L3), potentially causing contention and being slowed down by other traffic. The messages passed between the same pair of NICs are aggregated to maximize effective message rate and network bandwidth.

Message aggregation

With PXN, all GPUs on a given node move their data onto a single GPU for a given destination. This enables the network layer to aggregate messages, by implementing a new multireceive function. The function enables the remote CPU proxy to send all messages as one as soon as they are all ready.

For example, if a GPU on a node is performing an all2all operation and is to receive data from all eight GPUs from a remote node, NCCL calls a multireceive with eight buffers and sizes. On the sender side, the network layer can then wait until all eight sends are ready, then send all eight messages at one time, which can have a significant effect on the message rate.

Another aspect of message aggregation is that connections are now shared between all GPUs of a node for a given destination. This means fewer connections to establish. It can also affect the routing efficiency, if the routing algorithm was relying on having a lot of different connections to get good entropy.

PXN improves all2all performance

Diagram shows All2all is like a matrix transpose operation using a 4x4 matrix example.
Figure 3. all2all collective operation across four participating processes

Figure 3 shows that all2all entails communication from each process to every other process. In other words, the number of messages exchanged as part of an all2all operation in an N-GPU cluster is $latex  O(N^{2})$.

The messages exchanged between the GPUs are distinct and can’t be optimized using algorithms such as tree/ring (used for allreduce). When you run billion+ parameter models across 100s of GPUs, the number of messages can trigger congestion, create network hotspots, and adversely affect performance.

As discussed earlier, PXN combines NVLink and PCI communications to reduce traffic flow through the second-tier spine switches and optimizes network traffic. It also improves message rates by aggregating up to eight messages into one. Both improvements significantly improve all2all performance. 

all-reduce on 1:1 GPU:NIC topologies

Another problem that PXN solves is the case of topologies where there is a single GPU close to each NIC. The ring algorithm requires two GPUs to be close to each NIC. Data must go from the network to a first GPU, go around all GPUs through NVLink, and then exit from the last GPU onto the network. The first and last GPUs must both be close to the NIC. The first GPU must be able to receive from the network efficiently, and the last GPU must be able to send through the network efficiently. If only one GPU is close to a given NIC, then you cannot close the ring and must send data through the CPU, which can heavily affect performance.

With PXN, as long as the last GPU can access the first GPU through NVLink, it can move its data to the first GPU. The data is sent from there to the NIC, keeping all transfers local to PCI switches.

This case is not only relevant for PCI topologies featuring one GPU and one NIC per PCI switch but can also happen on other topologies when an NCCL communicator only includes a subset of GPUs. Consider a node with 8xGPUs interconnected with an NVLink hypercube mesh. 

Diagram shows the DGX-1 hypercube mesh topology with GPUs, NVSWITCHes and PCIe switches.
Figure 4. Network topology in a NVIDIA DGX-1 system

Figure 5 shows a ring that can be formed by leveraging the high-bandwidth NVLink connections that are available in the topology when the communicator includes all the 8xGPUs in the system. This is possible as both GPU0 and GPU1 share access to the same local NIC.

Diagram shows an example ring path used by NCCL touching each GPU in the system exactly once only using NVLINKs.
Figure 5. Example ring path used by NCCL

The communicator can just include a subset of the GPUs. For example, it can just include GPUs 0, 2, 4, and 6. In that case, creating rings is impossible without crossing rails: rings entering the node from GPU 0 would have to exit from GPUs 2, 4, or 6, which do not have direct access to the local NICs of GPUs 0 (NICs 0 and 1).

On the other hand, PXN enables rings to be formed as GPU 2 could move data back to GPU 0 before going through NIC 0/1.

This case is common with model parallelism, depending on how the model is split. If for example a model is split between GPUs 0-3, then another model runs on GPUs 4-7. That means GPU 0 and 4 take care of the same part of the model, and an NCCL communicator is created with all GPUs 0 and 4 on all nodes, to perform all-reduce operations for the corresponding layers. Those communicators can’t perform all-reduce operations efficiently without PXN.

The only way to have efficient model parallelism so far was to split the model on GPUs 0, 2, 4, 6 and 1, 3, 5, 7 so that NCCL subcommunicators would include GPUs [0,1], [2,3], [4,5] and [6,7] instead of [0,4], [1,5], [2,6], and [3,7]. The new PXN feature gives you more flexibility and eases the use of model parallelism.

Histogram shows more than 2X improvement when using PXN.
Figure 6. NCCL 2.12 PXN performance improvements

Figure 6 contrasts the time to complete alltoall collective operations with and without the PXN. In addition, PXN enables a more flexible choice of GPUs for all-reduce operations. 


The NCCL 2.12 release significantly improves all2all communication collective performance. Download the latest NCCL release and experience the improved performance firsthand.

For more information see the following resources:


TensorFlow 1.15 C++ API Documentation


I train TensorFlow models using the Python version but I need them to run using the C++ api. I had previously compiled version 1.11 and used the models successfully, however, when trying to use the same program with the 1.15 version it fails with an segmentation fault. I assume the way a model loads changed between 1.11 and 1.15 using the C++ api, however, I cannot for the life of me find documentation on the C++ API version 1.15 since the official site always redirects to version 2.8.

Could anyone please point me to the v1.15 documentation of the C++ api, please?

submitted by /u/quinseptopol
[visit reddit] [comments]


TinyML Monitoring Air Quality an 8-bit Microcontroller

TinyML Monitoring Air Quality an 8-bit Microcontroller

I’d like to share my experiment on how to easily create your own tiny machine learning model and run inferences on a microcontroller to detect the concentration of various gases. I will illustrate the whole process with my example of detecting the concentration of benzene (С6H6(GT)) based on the concentration of other recorded compounds.

Things I used in this project: Arduino Mega 2560, Neuton Tiny ML software

To my mind, such simple solutions may contribute to improving the air pollution problem which now causes serious concerns. In fact, the World Health Organization estimates that over seven million people die prematurely each year from diseases caused by air pollution. Can you imagine that?

As such, more and more organizations, responsible for monitoring emissions, need to have effective tools at their disposal to monitor the air quality in a timely way, and TinyML solutions seem to be the best technology for that. They are quite low-energy and cheap to produce, as well as they don’t require a permanent Internet connection. I believe these factors will promote the mass implementation of TinyML as a great opportunity to create AI-based devices and successfully solve various challenges.

Therefore, in my experiment, I take the most primitive 8-bit MCU to show that even such a device today can have ML models in it.

Dataset description:

My dataset contained 5875 rows of hourly averaged responses from an array of oxide chemical sensors that were located on the field in a polluted area in Italy, at road level. Hourly averaged concentrations for CO, Non-Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx), and Nitrogen Dioxide (NO2) were provided.

It is a regression problem.

Target metric – MAE (Mean Absolute Error). Target – C6H6(GT).

Attribute Information:RH – Relative Humidity

AH – Absolute Humidity

T – Temperature in °C;

PT08.S3(NOx) – Tungsten oxide. Hourly averaged sensor response (nominally NOx targeted);

PT08.S4(NO2) – Tungsten oxide. Hourly averaged sensor response (nominally NO2 targeted);

PT08.S5(O3) – Indium oxide. Hourly averaged sensor response (nominally O3 targeted);

PT08.S1(CO) – (Tin oxide) hourly averaged sensor response (nominally CO targeted);

CO(GT) – True hourly averaged concentration CO in mg/m^3 (reference analyzer);

PT08.S2(NMHC) – Titania. hourly averaged sensor response (nominally NMHC targeted);

You can see more details and download the dataset here: ​​

Step 1: Model Training

The model was created and trained with a free tool, Neuton TinyML, as I needed a super compact model that would fit into a tiny microcontroller with 8-bit precision. I tried to make such a model with the help of TensorFlow before, but it was too large to run operations on 8 bit.

To train the model, I converted the dataset into a CSV file, uploaded it to the platform, and selected the column that should be trained to make predictions.

The trained model had the following characteristics:
The model turned out to be super compact, having only 38 coefficients and 0.234 KB in size!

Additionally, I created models with TF and TF Lite and measured metrics on the same dataset. The comparison speaks louder than words. Also, as I said above, TF models still cannot run operations on 8 bits, but it was interesting for me to use just such a primitive device.

Step 2: Embedding into a Microcontroller

Upon completion of training, I downloaded the archive which contained all the necessary files, including meta-information about the model in two formats (binary, and HEX), calculator, Neuton library, and the implementation file.

Since I couldn’t run the experiment in field conditions with real gases, I developed a simple protocol to stream data from a computer.

Step 3: Running Inference on the Microcontroller

I connected a microcontroller on which the prediction was performed to a computer via a serial port, so signals were received in a binary format.

The microcontroller was programmed to turn on the red LED if the concentration of benzene was exceeded, and the green LED – if the concentration was within permitted limits. Check out the videos below to see how it worked.

In this case, the concentration of benzene is within reasonable bounds (<15 mg/m3).

In this case, the concentration of benzene exceeds the limits (>15 mg/m3).


My example vividly illustrates how everyone can easily use the TinyML approach to create compact but smart devices, even with 8-bit precision. I’m convinced that the low production costs and high efficiency of TinyML open up enormous opportunities for its worldwide implementation.

Due to the absence of the need to involve technical specialists, in this particular case, even non-data scientists can rapidly build super compact models and locate smart AI-driven devices throughout the area to monitor air quality in real-time. To my mind, it’s really inspiring that such small solutions can help us improve the environmental situation on a global scale!

submitted by /u/literallair
[visit reddit] [comments]


[tf.js] Is there an equivalent to Keras’ Resizing layer?

I’m using tensorflow.js and I need a layer that can take in an image and output the image resized at a new resolution (bilinear filtering is fine.) I can’t find one in the tf.js API so I’m not sure what I can use. I need to make sure the model can still be serialized to disk, so I think writing a custom layer class might be off the table.

Any help would be appreciated.

submitted by /u/SaltyKoopa
[visit reddit] [comments]


Tensorflow not working in Jupyter Notebook (Anaconda)

I did this in Linux Mint, a Ubuntu variant, and I did do pip install tensorflow and all of that, granted when I try pip3 install tensorflow, I get a wall of red text.

Relevant code is:

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers

#Defining the model using default linear activation function





2022-02-27 11:34:48.204164: I tensorflow/compiler/jit/] Not creating XLA devices, tf_xla_enable_xla_devices not set 2022-02-27 11:34:48.204436: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘’; dlerror: cannot open shared object file: No such file or directory 2022-02-27 11:34:48.204446: W tensorflow/stream_executor/cuda/] failed call to cuInit: UNKNOWN ERROR (303) 2022-02-27 11:34:48.204464: I tensorflow/stream_executor/cuda/] kernel driver does not appear to be running on this host (term-IdeaPad-Flex): /proc/driver/nvidia/version does not exist 2022-02-27 11:34:48.204624: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-02-27 11:34:48.204886: I tensorflow/compiler/jit/] Not creating XLA devices, tf_xla_enable_xla_devices not set

I have an all AMD laptop, but I don’t see how you need NVIDIA when I see people using it in virtual machines. If you know of a way for me to fix this, or do something where I can upload to a different site, and have that work, let me know.

submitted by /u/Term_Grecos
[visit reddit] [comments]


Is tensorflow a workable solution for my side project?

Is tensorflow a workable solution for my side project?


I am in the process of learning TensorFlow and I am wondering if TF is a workable solution for what I am trying to achieve. My side project is chess website where users can come submit their chess ratings, and then the website uses their data to compare ratings between different chess websites and orgs. My data set currently has around 7500 rows and looks like this:

7500 rows that look about like this

My backend is a Python API that is hosted on Heroku. What I would like to achieve is that once a player enters in their ratings, every rating they leave empty, I use machine learning/TensorFlow to predict each null value for that player? Is that doable with TF in backend hosted on Heroku?

Also, if anyone has any tips to lead me in the right direction, they are most welcome. I should also note that I suspect this might not be the most appropriate use of TF, or that TF might not be the best solution, but I am using this side project go grow and demonstrate skills I take interest in .

submitted by /u/DavidDoesChess
[visit reddit] [comments]