I can load text data using: tf.keras.utils.text_dataset_from_directory but to use that, each of my “rows” need to be in individual files in their own directory and the label is determined by the directory name.
For the life of me, I can’t figure out how to convert a pandas dataframe into a DataSet.
The dataframe would have some structure like this:
index label unstructured_text 1 1 i like ice cream 2 0 i don't like ice cream 3 1 i'm a little teapot and i'm happy
Is there a practical limit on the size of graph Tensorflow can handle? We are looking at large computation problems that could expand to hundreds of million or billions of nodes in the graph, am wondering if Tensorflow can automatically handle the construction and distributed execution of such large graphs over multiple computers using CPUs? Or we have to manually break the graph into smaller ones before sending to Tensorflow?
Any first-hand experience in handling large graphs are great appreciated, such as the maximum graph size you were able to run in Tensorflow, and any potential pitfalls etc,
GauGAN, an AI demo for photorealistic image generation, allows anyone to create stunning landscapes using generative adversarial networks. Named after post-Impressionist painter Paul Gauguin, it was created by NVIDIA Research and can be experienced free through NVIDIA AI Demos. How to Create With GauGAN The latest version of the demo, GauGAN2, turns any combination of Read article >
Since months I want to implement a VERY basic RNN layer for a project. Its kinda frustrating because 99% of the rest of the project is done. Its only the RNN layer and its a major point of the project.
Using NVIDIA TAO Toolkit and Innotescus’ data curation and analysis platform to improve a popular object detection model’s performance on the person class by over 20%.
AI applications are powered by machine learning models that are trained to predict outcomes accurately based on input data such as images, text, or audio. Training a machine learning model from scratch requires vast amounts of data and a considerable amount of human expertise, often making the process too expensive and time-consuming for most organizations.
Transfer learning is the happy medium between building a custom model from scratch and choosing an off-the-shelf commercial model to integrate into an ML application. With transfer learning, you can select a pretrained model that’s related to your solution and retrain it on data reflecting your specific use case. Transfer learning strikes the right balance between the custom-everything approach (often too expensive) and an off-the-shelf approach (often too rigid) and enables you to build tailored solutions with fewer resources.
The NVIDIA TAO Toolkit enables you to apply transfer learning to pretrained models and create custom, production-ready models without the complexity of AI frameworks. To train these models, high-quality data is a must. While TAO focuses on the model-centric steps of the development process, Innotescus focuses on the data-centric steps.
Innotescus is a web-based platform for annotating, analyzing, and curating robust, unbiased datasets for computer vision–based machine learning. Innotescus helps teams scale operations without sacrificing quality. The platform includes automated and assisted annotation for both images and videos, consensus and review features for QA processes, and interactive analytics for proactive dataset analysis and balancing. Together, Innotescus and the TAO toolkit make it cost effective for organizations to apply transfer learning successfully in custom applications, arriving at high performing solutions in little time.
In this post, we address the challenges of building a robust object detection model by integrating the NVIDIA TAO Toolkit with Innotescus. This solution alleviates several common pain points that businesses encounter when building and deploying commercial solutions.
YOLO object detection model
Your goal in this project is to apply transfer learning to the YOLO object detection model in the TAO toolkit using data curated on Innotescus.
Object detection is the ability to localize and classify objects with a bounding box in an image or video. It is the most widely used application of computer vision technology. Object detection solves many complex, real-world challenges, such as the following:
Context and scene understanding
Automating solutions for smart retail
Autonomous driving
Precision agriculture
Why should you use YOLO for this model? Traditionally, deep learning–based object detectors operate through a two-stage process. In the first stage, the model identifies regions of interest in an image. In the second stage, each of these regions are classified.
Typically, many regions are sent to the classification stage, and because classification is an expensive operation, two-stage object detectors are extremely slow. YOLO stands for “You only look once.” As the name suggests, YOLO can localize and classify simultaneously, leading to highly accurate real-time performance, which is essential for most deployable solutions. In April 2020, the fourth iteration of YOLO was published. It has been tested on a multitude of applications and industries and has proven to be robust.
Figure 1 shows the general pipeline for training object detection models. For each step of this more traditional development pipeline, we discuss the typical challenges that people encounter and how the combination of TAO and Innotescus solves these problems.
Figure 1. Typical AI development workflow
Before you begin, install the TAO toolkit and authenticate your instance of the Innotescus API.
Installing TAO Toolkit
Figure 2. TAO Toolkit stack
The TAO toolkit can be run as a CLI or in a Jupyter notebook. It’s only compatible with Python3 (3.6.9 and 3.7), so first install the prerequisites.
Create an NGC account and generate an API key for authentication.
Log in to the NGC Docker registry by running the command docker login nvcr.io and enter your credentials for authentication.
After the prerequisites are installed, install the TAO toolkit. NVIDIA recommends installing the package in a virtual environment using virtualenvwrapper. To install the TAO launcher Python package, run the following commands:
Check whether you’ve gone through the installation correctly by running tao --help.
Accessing the Innotescus API
Innotescus is accessible as a web-based application, but you’ll also use its API to demonstrate how to accomplish the same tasks programmatically. To begin, install the Innotescus library.
pip install innotescus
Next, authenticate the API instance using the client_id and client_secret values retrieved from the platform.
Figure 3. Generate and retrieve API keys
from innotescus import client_factory
client = client_factory(client_id=’client_id’, client_secret=’client_secret’)
Now you’re ready to interact with the platform through the API, which you’ll do as you walk through each step of the pipeline that follows.
Data collection
You need data to train the model. Though it’s often overlooked, data collection is arguably the most important step in the development process. While collecting data, you should ask yourself a few questions:
Is the training data adequately representative of each object of interest?
Are you accounting for all the scenarios in which you expect the model to be deployed?
Do you have enough data to train the model?
You can’t always answer these questions completely but having a well-rounded game plan for data collection helps you avoid issues during subsequent steps in the development process. Data collection is a time-consuming and expensive process. Because the models provided by TAO are pretrained, the data requirements for retraining are much smaller, saving organizations significant resources in this phase.
For this experiment, you use images and annotations from the MS COCO Validation 2017 dataset. This dataset has 5,000 images with 80 different classes, but you only use the 2,685 images containing at least one person.
%matplotlib inline
from pycocotools.coco import COCO
import matplotlib.pyplot as plt
dataDir=’Your Data Directory’
dataType=’val2017’
annFile=’{}/annotations/instances_{}.json’.format(dataDir,dataType)
coco=COCO(annFile)
catIds = coco.getCatIds(catNms=[‘person’]) # only using ‘person’ category
imgIds = coco.getImgIds(catIds=catIds)
for num_imgs in len(imgIds):
img = coco.loadImgs(imgIds[num_imgs])[0]
I = io.imread(img[‘coco_url’])
Figure 4. Examples of images from the dataset that include one or more ‘person’ objects
With the authenticated instance of the Innotescus client, begin setting up a project and uploading the human-focused dataset.
#create a new project
client.create_project(project_name)
#upload data to the new project
client.upload_data(project_name, dataset_name, file_paths, data_type, storage_type)
data_type: The type of data this dataset holds. Accepted values:
DataType.IMAGE
DataType.VIDEO
storage_type: The source of the data. Accepted values:
StorageType.FILE_SYSTEM
StorageType.URL
This dataset is now accessible through the Innotescus user interface.
Figure 5. Gallery view of the human-centric Coco Validation 2017 dataset from within the Innotescus platform
Data curation
Now that you have your initial dataset, begin curating it to ensure a well-balanced dataset. Studies have repeatedly shown that this phase of the process takes around 80% of the time spent on a machine learning project.
Using TAO and Innotescus, we highlight techniques like pre-annotation and review that save time during this step without sacrificing dataset size or quality.
Pre-annotation
Pre-annotation enables you to use model-generated annotations to remove a significant amount of the time and manual effort necessary to label the subset of 2,685 images accurately. You use YOLOv4—the same model that you’re retraining—to generate pre-annotations for the annotators to refine.
Because pre-annotation saves you so much time on the easier components of the annotation task, you can focus your attention on the harder examples that the model can’t yet handle.
YOLOv4 is included in the TAO toolkit and supports k-means clustering, training, evaluation, inference, pruning, and exporting. To use the model, you must first create a YOLOv4 spec file, which has the following major components:
yolov4_config
training_config
eval_config
nms_config
augmentation_config
dataset_config
The spec file is a protobuf text (prototxt) message, and each of its fields can be either a basic data type or a nested message.
Next, download the model with pretrained weights. The TAO Toolkit Docker container provides access to a repository of pretrained models that serve as a great starting point when training deep neural networks. Because these models are hosted on the NGC catalog, you must first download and install the NGC CLI. For more information, see the NGC documentation.
After you’ve installed the CLI, you can see the list of pretrained computer vision models on the NGC repo, and download pretrained models.
ngc registry model list nvidia/tao/pretrained_*
ngc registry model download-version /path/to/model_on_NGC_repo/ -dest /path/to/model_download_dir/
With the model downloaded and spec file updated, you can now generate pre-annotations by running the inference subtask.
The output of the inference subtask is a series of annotations in the KITTI format, saved in the specified output directory. Figure 6 shows two examples of these annotations:
Figure 6. Example annotations generated by the TAO Toolkit using the pretrained YOLOv4 model
Upload the preannotations into the Innotescus platform manually through the web-based user interface or using the API. Because the KITTI format is one of the many accepted by Innotescus, no preprocessing is needed.
project_name: The name of the project containing the affected dataset and task.
dataset_name: The name of the dataset to which these annotations are to be applied.
task_type: The type of annotation task being created with these annotations. Accepted values from the TaskType class:
CLASSIFICATION
OBJECT_DETECTION
SEGMENTATION
INSTANCE_SEGMENTATION
data_type: The type of data to which the annotations correspond. Accepted values:
DataType.IMAGE
DataType.VIDEO
annotation_format: The format in which these annotations are stored. Accepted values from the AnnotationFormat class:
COCO
KITTI
MASKS_PER_CLASS
PASCAL
CSV
MASKS_SEMANTIC
MASKS_INSTANCE
INNOTESCUS_JSON
YOLO_DARKNET
YOLO_KERAS
file_paths: A list of file paths containing the annotation files to upload.
task_name: The name of the task to which these annotations belong; if the task does not exist, it is created and populated with these annotations.
task_description: A description of the task being created, if the task does not exist yet.
overwrite_existing_annotations: If the task already exists, this flag allows you to overwrite existing annotations.
pre_annotate: Allows you to import annotations as pre-annotations.
With the pre-annotations imported to the platform and a significant amount of the initial annotation work saved, move into Innotescus to further correct, refine, and analyze the data.
Review and correction
With the pre-annotations successfully imported, head over to the platform to perform review and correction of the pre-annotations. While the pretrained model saves a significant amount of annotation time, it’s still not perfect and needs a bit of human in the loop interaction to ensure high-quality training data. Figure 8 shows an example of a typical correction that you might make.
Figure 8. Error in the pre-annotations generated with the pretrained YOLOv4
Beyond a first pass at fixing and submitting pre-annotations, Innotescus enables a more focused sampling of images and annotations for multistage review. This enables large teams to ensure high quality throughout the dataset systematically and efficiently.
Figure 9. Process on Innotescus
Exploratory data analysis
Exploratory data analysis, or EDA, is the process of investigating and visualizing datasets from multiple statistical angles to get a holistic understanding of the underlying patterns, anomalies, and biases present in the data. It is an effective and necessary step to take before thoughtfully addressing the statistical imbalances your dataset contains.
Innotescus provides precalculated metrics for understanding class, color, spatial, and complexity distributions for both data and annotations, and enables you to add your own layer of information in image and annotation metadata to incorporate application-specific information into the analytics.
Here’s how you can use the Innotescus’s dive visualization to understand some of the patterns and biases present in the dataset. The following scatter plot shows the distribution of image entropy, which is the average information or degree of randomness in an image, within the dataset along the x-axis. You can see a clear pattern, but you can also spot anomalies like images with low entropy or information content.
Figure 10. Dataset graph on Innotescus
Figure 11. Examples of (left) low- and (right) high-entropy images
Outliers like these raise questions of how to handle anomalies within a dataset. Recognizing anomalies enables you to ask some crucial questions:
Do you expect the model, when deployed, to encounter low-entropy input?
If so, do you need more such examples in the training dataset?
If not, are these examples going to be detrimental for training, and should they be removed from the training dataset?
In another example, look at each annotation’s area, relative to the image that it’s in.
Figure 12. Using dive charts to investigate a number of metrics calculated by Innotescus
Figure 13. Variation in relative annotation size within the dataset
In Figure 13, the two images show the variation in annotation sizes within the dataset. While some annotations capture people that take up lots of the image, most show people far away from the camera.
Here, a large percentage of annotations are between 0 and 10% of their respective image sizes. This means that the dataset is biased towards small objects, or people that are far from the camera. Do you then need more examples in the training data that have larger annotations to represent people closer to the camera? Understanding the data distribution in this way helps you to begin thinking about the plan for data augmentation.
With Innotescus, EDA is made intuitive. It provides you with the information that you need to make powerful augmentations to your dataset and eliminate bias early in the development process.
Cluster rebalancing with dataset augmentation
The idea behind augmentation for cluster rebalancing is powerful. This technique showed a 21% boost in performance in the recent datacentric AI competition hosted by Andrew Ng and DeepLearning.AI.
You generate an N-dimensional feature vector for each data point (each bounding box annotation), and cluster all data points in higher dimensional space. When you cluster objects with similar features, you augment the dataset such that each cluster has equal representation.
We chose to use [red channel mean, green channel mean, blue channel mean, gray image std, gray image entropy, relative area] as the N-dimensional feature vector. These metrics were exported from Innotescus, which automatically calculated them. You could also use the embeddings generated by the pretrained model to populate the feature vector, which would arguably be more robust.
You use k-means clustering with k=4 as the clustering algorithm and UMAP for reducing the dimensions to two for visualization. The following code example generates the graph that shows the UMAP plot, color-coded with these four clusters.
import umap
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
# k-means on the feature vector
kmeans = KMeans(n_clusters=4, random_state=0).fit(featureVector)
# UMAP for dim reduction and visualization
fit = umap.UMAP(n_neighbors=5,
min_dist=0.2,
n_components=2,
metric=’manhattan’)
u = fit.fit_transform(featureVector)
# Plot UMAP components
plt.scatter(u[:,0], u[:,1], c=(kmeans.labels_))
plt.title(‘UMAP embedding of kmeans colours’)
Figure 14. Four clusters, plotted on two dimensions
When you look at the number of objects in each cluster, you can clearly see the imbalance, which informs how you should augment the data for retraining. The four clusters represent 854, 1523, 1481 and 830 images, respectively. Where an image has objects in more than one cluster, group that image in the cluster with most of its objects for augmentation.
clusters = {}
for file, cluster in zip(filename, kmeans.labels_):
if cluster not in clusters.keys():
clusters[cluster] = []
clusters[cluster].append(file)
else:
clusters[cluster].append(file)
for numCls in range(0, len(clusters)):
print(‘Cluster {}: {} objects, {} images’.format(numCls+1, len(clusters[numCls]), len(list(set(clusters[numCls])))))
With the clusters well defined, you use the imgaug Python library to introduce augmentation techniques to enhance the training data: translation, image brightness adjustment, and scale augmentation. You augment such that each cluster contains 2,000 images for a total of 8,000. As you augment images, imgaug ensures that the annotation coordinates are altered appropriately as well.
import imgaug as ia
import imgaug.augmenters as iaa
# augment images
seq = iaa.Sequential([
iaa.Multiply([1.1, 1.5]), # change brightness, doesn’t affect BBs
iaa.Affine(
translate_px={“x”:60, “y”:60},
scale=(0.5, 0.8)
) # translate by 60px on x/y axes & scale to 50-80%, includes BBs
])
# augment BBs and images
image_aug, bbs_aug = seq(image=I, bounding_boxes=boundingBoxes)
Using the same UMAP visualization technique, with augmented data points now in red, you see that the dataset is now much more balanced, as it more closely resembles a Gaussian distribution.
Figure 15. Rebalanced clusters
Model training
With the well-balanced, high-quality training data, the final step is to train the model.
YOLOv4 retraining on TAO Toolkit
To start retraining the model, first ensure that the spec file contains the classes of interest, as well as the correct directory paths for the pretrained model and training data. Change the training parameters in the training_config section. Reserve 30% of the augmented dataset as a test dataset to compare the performance of the pretrained model and the performance of the retrained model.
tao yolo_v4 train -e /path/to/specFile.txt -r /path/to/result -k $KEY
Results
As you can see, you achieved a 14.93% improvement in the mean average precision, a 21.37% boost from the mAP of the pretrained model:
Model
mAP50
Yolov4 pretrained model
69.86%
Yolov4 retrained model with cluster-rebalanced augmentation
84.79%
Table 1. Model performance before and after applying transfer learning with the curated dataset
Summary
Using NVIDIA TAO Toolkit for pre-annotation and model training and Innotescus for data refinement, analysis, and curation, you improved YOLOv4’s mean average precision on the person class by a substantial amount: over 20%. Not only did you improve the performance on a selected class, you used less time and data than you would have without significant benefits of transfer learning.
Transfer learning is a great way to produce high-performing, application-specific models in settings with constrained resources. Using tools like the TAO toolkit and Innotescus makes it feasible for teams of all sizes and backgrounds.
Researchers combine CT imaging with deep learning to evaluate dinosaur fossils. The approach could change how paleontologists study ancient remains.
Applying new technology to studying ancient history, researchers are looking to expand their understanding of dinosaurs with a new AI algorithm. The study, published in Frontiers in Earth Science, uses high-resolution Computed Tomography (CT) imaging combined with deep learning models to scan and evaluate dinosaur fossils. The research is a step toward creating a new tool that would vastly change the way paleontologists study ancient remains.
“Computed Tomography as well we other imaging techniques have revealed previously hidden structures in fossils, but the high-resolution images require paleontologists spending weeks to even months in post-processing, usually segmenting fossils from rock matrices. The introduction of AI can not only accelerate data processing in fossil studies, but also establish benchmarks for more objective and more reproducible studies,” said lead author Congyu Yu, a Ph.D. student at the Richard Gilder Graduate School at the American Museum of Natural History.
For a complete picture of ancient vertebrates, paleontologists focus on internal anatomy such as cranial capacity, inner ears, or vascular spaces. To do this, researchers use a technique called thin sectioning. Removing a small piece (as thin as several micrometers) from a fossil, examining it under a microscope, and annotating the structures they find, helps them piece together the morphology of a dinosaur. However, this technique is destructive to the remains and can be extremely time consuming.
Computed tomography (CT) scans have given scientists the ability to look inside a sample while leaving the fossil unscathed. The technology essentially examines a fossil section, capturing thousands of images of it. Software then reconstructs the images and generates a three-dimensional graphic, resulting in an internal snapshot of the sample. Scientists can then examine and label identifiable morphology in the graphic to learn more about a specimen.
Imaging has given scientists a tool for revealing hidden internal structures and advancing 3D models of dinosaurs. Studies have helped researchers estimate body mass, analyze skulls, and even understand dental morphology along with tooth replacement patterns.
However, with this approach, scientists still manually choose segments, examine, and label images, which beyond being time intensive, is subjective, and can introduce errors. Plus, scans have limitations differentiating between the rock that may be coating a fossil and the bones themselves, making it difficult to determine where a rock ends and fossil begins.
AI has proven capable of quick image segmentation in the medical world, ranging from identifying brain lesions to skin cancer. The researchers saw an opportunity to apply similar deep learning models to CT fossil images.
They tested this new approach using deep neural networks and over 10,000 annotated CT scans of three well-preserved embryonic skulls of Protoceratop dinosaurs. Recovered in the 1990s from the Mongolian Gobi Desert, these fossils come from early horned dinosaurs and are a smaller relative of the better-known Triceratops.
The team used a classic U-net deep neural network for processing fossil segmentation, teaching the algorithm to recognize rock from the fossils. A modified DeepLab v3+ network was used for training feature identification, categorizing parts of the CT images, and 3D rendering.
The models were trained using 7,986 manually annotated bone structure CT slices on the cuDNN-accelerated TensorFlow deep learning framework with dual NVIDIA GeForce RTX 2080 Ti GPUs.
Testing the results against a dataset of 3,329, they found that while the segmentation model reached high accuracy of around 97%, the 3D feature renderings were not as meticulous or accurate as humans. While the results showed that the features models did not perform as accurately as the scientists, the segmentation models worked smoothly and did so in record time. The models segmented each slice in seconds—manually segmenting the same piece took minutes or even hours in some cases. This could help paleontologists reduce their time spent working to differentiate fossils from rock.
Figure 1. Comparison of different 3D renderings from left: raw reconstruction, manual segmentation, and deep learning segmentation.
The researchers suggest that larger data sets incorporating other dinosaur species and different sediment types could help create a high-performing algorithm down the line.
”We are confident that a segmentation model for fossils from the Gobi Desert is not far away, but a more generalized model needs not only more training dataset but innovations in algorithms,” Yu said in a press release. “I believe deep learning can eventually process imagery better than us, and there have already been various examples in deep learning performance exceeding humans, including Go playing and protein 3D-structure prediction.”
The dataset used in the study, CT Segmentation of Dinosaur Fossils by Deep Learning, is available for download.
Posted by Brendan McMahan and Abhradeep Thakurta, Research Scientists, Google Research
In 2017, Google introduced federated learning (FL), an approach that enables mobile devices to collaboratively train machine learning (ML) models while keeping the raw training data on each user’s device, decoupling the ability to do ML from the need to store the data in the cloud. Since its introduction, Google has continued to actively engage in FL research and deployed FL to power many features in Gboard, including next word prediction, emoji suggestion and out-of-vocabulary word discovery. Federated learning is improving the “Hey Google” detection models in Assistant, suggesting replies in Google Messages, predicting text selections, and more.
While FL allows ML without raw data collection, differential privacy (DP) provides a quantifiable measure of data anonymization, and when applied to ML can address concerns about models memorizing sensitive user data. This too has been a top research priority, and has yielded one of the first production uses of DP for analytics with RAPPOR in 2014, our open-source DP library, Pipeline DP, and TensorFlow Privacy.
Through a multi-year, multi-team effort spanning fundamental research and product integration, today we are excited to announce that we have deployed a production ML model using federated learning with a rigorous differential privacy guarantee. For this proof-of-concept deployment, we utilized the DP-FTRL algorithm to train a recurrent neural network to power next-word-prediction for Spanish-language Gboard users. To our knowledge, this is the first production neural network trained directly on user data announced with a formal DP guarantee (technically ρ=0.81 zero-Concentrated-Differential-Privacy, zCDP, discussed in detail below). Further, the federated approach offers complimentary data minimization advantages, and the DP guarantee protects all of the data on each device, not just individual training examples.
Data Minimization and Anonymization in Federated Learning Along with fundamentals like transparency and consent, the privacy principles of data minimization and anonymization are important in ML applications that involve sensitive data.
Federated learning systems structurally incorporate the principle of data minimization. FL only transmits minimal updates for a specific model training task (focused collection), limits access to data at all stages, processes individuals’ data as early as possible (early aggregation), and discards both collected and processed data as soon as possible (minimal retention).
Another principle that is important for models trained on user data is anonymization, meaning that the final model should not memorize information unique to a particular individual’s data, e.g., phone numbers, addresses, credit card numbers. However, FL on its own does not directly tackle this problem.
The mathematical concept of DP allows one to formally quantify this principle of anonymization. Differentially private training algorithms add random noise during training to produce a probability distribution over output models, and ensure that this distribution doesn’t change too much given a small change to the training data; ρ-zCDP quantifies how much the distribution could possibly change. We call this example-level DP when adding or removing a single training example changes the output distribution on models in a provably minimal way.
Showing that deep learning with example-level differential privacy was even possible in the simpler setting of centralized training was a major step forward in 2016. Achieved by the DP-SGD algorithm, the key was amplifying the privacy guarantee by leveraging the randomness in sampling training examples (“amplification-via-sampling”).
However, when users can contribute multiple examples to the training dataset, example-level DP is not necessarily strong enough to ensure the users’ data isn’t memorized. Instead, we have designed algorithms for user-level DP, which requires that the output distribution of models doesn’t change even if we add/remove all of the training examples from any one user (or all the examples from any one device in our application). Fortunately, because FL summarizes all of a user’s training data as a single model update, federated algorithms are well-suited to offering user-level DP guarantees.
Both limiting the contributions from one user and adding noise can come at the expense of model accuracy, however, so maintaining model quality while also providing strong DP guarantees is a key research focus.
The Challenging Path to Federated Learning with Differential Privacy In 2018, we introduced the DP-FedAvg algorithm, which extended the DP-SGD approach to the federated setting with user-level DP guarantees, and in 2020 we deployed this algorithm to mobile devices for the first time. This approach ensures the training mechanism is not too sensitive to any one user’s data, and empirical privacy auditing techniques rule out some forms of memorization.
However, the amplification-via-samping argument is essential to providing a strong DP guarantee for DP-FedAvg, but in a real-world cross-device FL system ensuring devices are subsampled precisely and uniformly at random from a large population would be complex and hard to verify. One challenge is that devices choose when to connect (or “check in”) based on many external factors (e.g., requiring the device is idle, on unmetered WiFi, and charging), and the number of available devices can vary substantially.
Achieving a formal privacy guarantee requires a protocol that does all of the following:
Makes progress on training even as the set of devices available varies significantly with time.
Maintains privacy guarantees even in the face of unexpected or arbitrary changes in device availability.
For efficiency, allows client devices to locally decide whether they will check in to the server in order to participate in training, independent of other devices.
Initial work on privacy amplification via random check-ins highlighted these challenges and introduced a feasible protocol, but it would have required complex changes to our production infrastructure to deploy. Further, as with the amplification-via-sampling analysis of DP-SGD, the privacy amplification possible with random check-ins depends on a large number of devices being available. For example, if only 1000 devices are available for training, and participation of at least 1000 devices is needed in each training step, that requires either 1) including all devices currently available and paying a large privacy cost since there is no randomness in the selection, or 2) pausing the protocol and not making progress until more devices are available.
Achieving Provable Differential Privacy for Federated Learning with DP-FTRL To address this challenge, the DP-FTRL algorithm is built on two key observations: 1) the convergence of gradient-descent-style algorithms depends primarily not on the accuracy of individual gradients, but the accuracy of cumulative sums of gradients; and 2) we can provide accurate estimates of cumulative sums with a strong DP guarantee by utilizing negatively correlated noise, added by the aggregating server: essentially, adding noise to one gradient and subtracting that same noise from a later gradient. DP-FTRL accomplishes this efficiently using the Tree Aggregation algorithm [1, 2].
The graphic below illustrates how estimating cumulative sums rather than individual gradients can help. We look at how the noise introduced by DP-FTRL and DP-SGD influence model training, compared to the true gradients (without added noise; in black) which step one unit to the right on each iteration. The individual DP-FTRL gradient estimates (blue), based on cumulative sums, have larger mean-squared-error than the individually-noised DP-SGD estimates (orange), but because the DP-FTRL noise is negatively correlated, some of it cancels out from step to step, and the overall learning trajectory stays closer to the true gradient descent steps.
To provide a strong privacy guarantee, we limit the number of times a user contributes an update. Fortunately, sampling-without-replacement is relatively easy to implement in production FL infrastructure: each device can remember locally which models it has contributed to in the past, and choose to not connect to the server for any later rounds for those models.
Production Training Details and Formal DP Statements For the production DP-FTRL deployment introduced above, each eligible device maintains a local training cache consisting of user keyboard input, and when participating computes an update to the model which makes it more likely to suggest the next word the user actually typed, based on what has been typed so far. We ran DP-FTRL on this data to train a recurrent neural network with ~1.3M parameters. Training ran for 2000 rounds over six days, with 6500 devices participating per round. To allow for the DP guarantee, devices participated in training at most once every 24 hours. Model quality improved over the previous DP-FedAvg trained model, which offered empirically-tested privacy advantages over non-DP models, but lacked a meaningful formal DP guarantee.
The training mechanism we used is available in open-source in TensorFlow Federated and TensorFlow Privacy, and with the parameters used in our production deployment it provides a meaningfully strong privacy guarantee. Our analysis gives ρ=0.81 zCDP at the user level (treating all the data on each device as a different user), where smaller numbers correspond to better privacy in a mathematically precise way. As a comparison, this is stronger than the ρ=2.63 zCDP guarantee chosen by the 2020 US Census.
Next Steps While we have reached the milestone of deploying a production FL model using a mechanism that provides a meaningfully small zCDP, our research journey continues. We are still far from being able to say this approach is possible (let alone practical) for most ML models or product applications, and other approaches to private ML exist. For example, membership inference tests and other empirical privacy auditing techniques can provide complimentary safeguards against leakage of users’ data. Most importantly, we see training models with user-level DP with even a very large zCDP as a substantial step forward, because it requires training with a DP mechanism that bounds the sensitivity of the model to any one user’s data. Further, it smooths the road to later training models with improved privacy guarantees as better algorithms or more data become available. We are excited to continue the journey toward maximizing the value that ML can deliver while minimizing potential privacy costs to those who contribute training data.
Acknowledgements The authors would like to thank Alex Ingerman and Om Thakkar for significant impact on the blog post itself, as well as the teams at Google that helped develop these ideas and bring them to practice:
Core research team: Galen Andrew, Borja Balle, Peter Kairouz, Daniel Ramage, Shuang Song, Thomas Steinke, Andreas Terzis, Om Thakkar, Zheng Xu
FL infrastructure team: Katharine Daly, Stefan Dierauf, Hubert Eichner, Igor Pisarev, Timon Van Overveldt, Chunxiang Zheng
Gboard team: Angana Ghosh, Xu Liu, Yuanbo Zhang
Speech team: Françoise Beaufays, Mingqing Chen, Rajiv Mathews, Vidush Mukund, Igor Pisarev, Swaroop Ramaswamy, Dan Zivkovic
I am trying to use an empirical metric as a loss function to train a Tensorflow model. Calculating the metric function is slow, but I can train a regression neural network to accurately and quickly predict the metric score after it is trained. Is there a straightforward way (or tutorial?) to use a trained Tensorflow or scikit-learn model as a custom loss function for a Tensorflow model?
Edit: I have found this StackOverflow entry as a starting point. I will try it out and report back.
Posted by Abhishek Kumar and Ehsan Amid, Research Scientists, Google Research, Brain Team
Over the past several years, deep neural networks (DNNs) have been quite successful in driving impressive performance gains in several real-world applications, from image recognition to genomics. However, modern DNNs often have far more trainable model parameters than the number of training examples and the resulting overparameterized networks can easily overfit to noisy or corrupted labels (i.e., examples that are assigned a wrong class label). As a consequence, training with noisy labels often leads to degradation in accuracy of the trained model on clean test data. Unfortunately, noisy labels can appear in several real-world scenarios due to multiple factors, such as errors and inconsistencies in manual annotation and the use of inherently noisy label sources (e.g., the internet or automated labels from an existing system).
Earlierwork has shown that representations learned by pre-training large models with noisy data can be useful for prediction when used in a linear classifier trained with clean data. In principle, it is possible to directly train machine learning (ML) models on noisy data without resorting to this two-stage approach. To be successful, such alternative methods should have the following properties: (i) they should fit easily into standard training pipelines with little computational or memory overhead; (ii) they should be applicable in “streaming” settings where new data is continuously added during training; and (iii) they should not require data with clean labels.
In “Constrained Instance and Class Reweighting for Robust Learning under Label Noise”, we propose a novel and principled method, named Constrained Instance reWeighting (CIW), with these properties that works by dynamically assigning importance weights both to individual instances and to class labels in a mini-batch, with the goal of reducing the effect of potentially noisy examples. We formulate a family of constrained optimization problems that yield simple solutions for these importance weights. These optimization problems are solved per mini-batch, which avoids the need to store and update the importance weights over the full dataset. This optimization framework also provides a theoretical perspective for existing label smoothing heuristics that address label noise, such as labelbootstrapping. We evaluate the method with varying amounts of synthetic noise on the standard CIFAR-10 and CIFAR-100 benchmarks and observe considerable performance gains over several existing methods.
Method Training ML models involves minimizing a loss function that indicates how well the current parameters fit to the given training data. In each training step, this loss is approximately calculated as a (weighted) sum of the losses of individual instances in the mini-batch of data on which it is operating. In standard training, each instance is treated equally for the purpose of updating the model parameters, which corresponds to assigning uniform (i.e., equal) weights across the mini-batch.
However, empirical observations made in earlierworks reveal that noisy or mislabeled instances tend to have higher loss values than those that are clean, particularly during early to mid-stages of training. Thus, assigning uniform importance weights to all instances means that due to their higher loss values, the noisy instances can potentially dominate the clean instances and degrade the accuracy on clean test data.
Motivated by these observations, we propose a family of constrained optimization problems that solve this problem by assigning importance weights to individual instances in the dataset to reduce the effect of those that are likely to be noisy. This approach provides control over how much the weights deviate from uniform, as quantified by a divergence measure. It turns out that for several types of divergence measures, one can obtain simple formulae for the instance weights. The final loss is computed as the weighted sum of individual instance losses, which is used for updating the model parameters. We call this the Constrained Instance reWeighting (CIW) method. This method allows for controlling the smoothness or peakiness of the weights through the choice of divergence and a corresponding hyperparameter.
Schematic of the proposed Constrained Instance reWeighting (CIW) method.
Illustration with Decision Boundary on a 2D Dataset As an example to illustrate the behavior of this method, we consider a noisy version of the Two Moons dataset, which consists of randomly sampled points from two classes in the shape of two half moons. We corrupt 30% of the labels and train a multilayer perceptron network on it for binary classification. We use the standard binary cross-entropy loss and an SGD with momentum optimizer to train the model. In the figure below (left panel), we show the data points and visualize an acceptable decision boundary separating the two classes with a dotted line. The points marked red in the upper half-moon and those marked green in the lower half-moon indicate noisy data points.
The baseline model trained with the binary cross-entropy loss assigns uniform weights to the instances in each mini-batch, thus eventually overfitting to the noisy instances and resulting in a poor decision boundary (middle panel in the figure below).
The CIW method reweights the instances in each mini-batch based on their corresponding loss values (right panel in the figure below). It assigns larger weights to the clean instances that are located on the correct side of the decision boundary and damps the effect of noisy instances that incur a higher loss value. Smaller weights for noisy instances help in preventing the model from overfitting to them, thus allowing the model trained with CIW to successfully converge to a good decision boundary by avoiding the impact of label noise.
Illustration of decision boundary as the training proceeds for the baseline and the proposed CIW method on the Two Moons dataset. Left: Noisy dataset with a desirable decision boundary. Middle: Decision boundary for standard training with cross-entropy loss. Right: Training with the CIW method. The size of the dots in (middle) and (right) are proportional to the importance weights assigned to these examples in the minibatch.
<!–
Illustration of decision boundary as the training proceeds for the baseline and the proposed CIW method on the Two Moons dataset. Left: Noisy dataset with a desirable decision boundary. Middle: Decision boundary for standard training with cross-entropy loss. Right: Training with the CIW method. The size of the dots in (middle) and (right) are proportional to the importance weights assigned to these examples in the minibatch.
–>
Constrained Class reWeighting Instance reweighting assigns lower weights to instances with higher losses. We further extend this intuition to assign importance weights over all possible class labels. Standard training uses a one-hot label vector as the class weights, assigning a weight of 1 to the labeled class and 0 to all other classes. However, for the potentially mislabeled instances, it is reasonable to assign non-zero weights to classes that could be the true label. We obtain these class weights as solutions to a family of constrained optimization problems where the deviation of the class weights from the label one-hot distribution, as measured by a divergence of choice, is controlled by a hyperparameter.
Again, for several divergence measures, we can obtain simple formulae for the class weights. We refer to this as Constrained Instance and Class reWeighting (CICW). The solution to this optimization problem also recovers the earlierproposed methods based on static label bootstrapping (also referred as label smoothing) when the divergence is taken to be total variation distance. This provides a theoretical perspective on the popular method of static label bootstrapping.
Using Instance Weights with Mixup We also propose a way to use the obtained instance weights with mixup, which is a popular method for regularizing models and improving prediction performance. It works by sampling a pair of examples from the original dataset and generating a new artificial example using a random convex combination of these. The model is trained by minimizing the loss on these mixed-up data points. Vanilla mixup is oblivious to the individual instance losses, which might be problematic for noisy data because mixup will treat clean and noisy examples equally. Since a high instance weight obtained with our CIW method is more likely to indicate a clean example, we use our instance weights to do a biased sampling for mixup and also use the weights in convex combinations (instead of random convex combinations in vanilla mixup). This results in biasing the mixed-up examples towards clean data points, which we refer to as CICW-Mixup.
We apply these methods with varying amounts of synthetic noise (i.e., the label for each instance is randomly flipped to other labels) on the standard CIFAR-10 and CIFAR-100 benchmark datasets. We show the test accuracy on clean data with symmetric synthetic noise where the noise rate is varied between 0.2 and 0.8.
We observe that the proposed CICW outperforms several methods and matches the results of dynamic mixup, which maintains the importance weights over the full training set with mixup. Using our importance weights with mixup in CICW-M, resulted in significantly improved performance vs these methods, particularly for larger noise rates (as shown by lines above and to the right in the graphs below).
Summary and Future Directions We formulate a novel family of constrained optimization problems for tackling label noise that yield simple mathematical formulae for reweighting the training instances and class labels. These formulations also provide a theoretical perspective on existing label smoothing–basedmethods for learning with noisy labels. We also propose ways for using the instance weights with mixup that results in further significant performance gains over instance and class reweighting. Our method operates solely at the level of mini-batches, which avoids the extra overhead of maintaining dataset-level weights as in some of the recent methods.
As a direction for future work, we would like to evaluate the method on realistic noisy labels that are encountered in large scale practical settings. We also believe that studying the interaction of our framework with label smoothing is an interesting direction that can result in a loss adaptive version of label smoothing. We are also excited to release the code for CICW, now available on Github.
Acknowledgements We’d like to thank Kevin Murphy for providing constructive feedback during the course of the project.
The NCCL 2.12 release significantly improves all2all communication collective performance, with the PXN feature.
Collective communications are a performance-critical ingredient of modern distributed AI training workloads such as recommender systems and natural language processing.
NVIDIA Collective Communication Library (NCCL), a Magnum IO Library, implements GPU-accelerated collective operations:
all-gather
all-reduce
broadcast
reduce
reduce-scatter
point-to-point send and receive
NCCL is topology-aware and is optimized to achieve high bandwidth and low latency over PCIe, NVLink, Ethernet, and InfiniBand interconnect. NCCL GCP plugin and NCCL AWS plugin enable high-performance NCCL operations in popular cloud environments with custom network connectivity.
NCCL releases have been relentlessly focusing on improving collective communication performance. This post focuses on the improvements that come with the NCCL 2.12 release.
Combining NVLink and network communication
The new feature introduced in NCCL 2.12 is called PXN, as PCI × NVLink, as it enables a GPU to communicate with a NIC on the node through NVLink and then PCI. This is instead of going through the CPU using QPI or other inter-CPU protocols, which would not be able to deliver full bandwidth. That way, even though each GPU still tries to use its local NIC as much as possible, it can reach other NICs if required.
Instead of preparing a buffer on its local memory for the local NIC to send, the GPU prepares a buffer on an intermediate GPU, writing to it through NVLink. It then notifies the CPU proxy managing that NIC that the data is ready, instead of notifying its own CPU proxy. The GPU-CPU synchronization might be a little slower because it may have to cross CPU sockets, but the data itself only uses NVLink and PCI switches, guaranteeing maximum bandwidth.
Figure 1. Rail-optimized topology
In the topology in Figure 1, NIC-0 from each DGX system is connected to the same leaf switch (L0), NIC-1s are connected to the same leaf switch (L1), and so on. Such a design is often called rail-optimized. Rail-optimized network topology helps maximize all-reduce performance while minimizing network interference between flows. It can also reduce the cost of the network by having lighter connections between rails.
PXN leverages NVIDIA NVSwitch connectivity between GPUs within the node to first move data on a GPU on the same rail as the destination, then send it to the destination without crossing rails. That enables message aggregation and network traffic optimization.
Figure 2. Example message path from GPU0 in DGX-A to GPU3 in DGX-B
Before NCCL 2.12, the message in Figure X would have traversed through three hops of network switches (L0, S1, and L3), potentially causing contention and being slowed down by other traffic. The messages passed between the same pair of NICs are aggregated to maximize effective message rate and network bandwidth.
Message aggregation
With PXN, all GPUs on a given node move their data onto a single GPU for a given destination. This enables the network layer to aggregate messages, by implementing a new multireceive function. The function enables the remote CPU proxy to send all messages as one as soon as they are all ready.
For example, if a GPU on a node is performing an all2all operation and is to receive data from all eight GPUs from a remote node, NCCL calls a multireceive with eight buffers and sizes. On the sender side, the network layer can then wait until all eight sends are ready, then send all eight messages at one time, which can have a significant effect on the message rate.
Another aspect of message aggregation is that connections are now shared between all GPUs of a node for a given destination. This means fewer connections to establish. It can also affect the routing efficiency, if the routing algorithm was relying on having a lot of different connections to get good entropy.
PXN improves all2all performance
Figure 3. all2all collective operation across four participating processes
Figure 3 shows that all2all entails communication from each process to every other process. In other words, the number of messages exchanged as part of an all2all operation in an N-GPU cluster is $latex O(N^{2})$.
The messages exchanged between the GPUs are distinct and can’t be optimized using algorithms such as tree/ring (used for allreduce). When you run billion+ parameter models across 100s of GPUs, the number of messages can trigger congestion, create network hotspots, and adversely affect performance.
As discussed earlier, PXN combines NVLink and PCI communications to reduce traffic flow through the second-tier spine switches and optimizes network traffic. It also improves message rates by aggregating up to eight messages into one. Both improvements significantly improve all2all performance.
all-reduce on 1:1 GPU:NIC topologies
Another problem that PXN solves is the case of topologies where there is a single GPU close to each NIC. The ring algorithm requires two GPUs to be close to each NIC. Data must go from the network to a first GPU, go around all GPUs through NVLink, and then exit from the last GPU onto the network. The first and last GPUs must both be close to the NIC. The first GPU must be able to receive from the network efficiently, and the last GPU must be able to send through the network efficiently. If only one GPU is close to a given NIC, then you cannot close the ring and must send data through the CPU, which can heavily affect performance.
With PXN, as long as the last GPU can access the first GPU through NVLink, it can move its data to the first GPU. The data is sent from there to the NIC, keeping all transfers local to PCI switches.
This case is not only relevant for PCI topologies featuring one GPU and one NIC per PCI switch but can also happen on other topologies when an NCCL communicator only includes a subset of GPUs. Consider a node with 8xGPUs interconnected with an NVLink hypercube mesh.
Figure 4. Network topology in a NVIDIA DGX-1 system
Figure 5 shows a ring that can be formed by leveraging the high-bandwidth NVLink connections that are available in the topology when the communicator includes all the 8xGPUs in the system. This is possible as both GPU0 and GPU1 share access to the same local NIC.
Figure 5. Example ring path used by NCCL
The communicator can just include a subset of the GPUs. For example, it can just include GPUs 0, 2, 4, and 6. In that case, creating rings is impossible without crossing rails: rings entering the node from GPU 0 would have to exit from GPUs 2, 4, or 6, which do not have direct access to the local NICs of GPUs 0 (NICs 0 and 1).
On the other hand, PXN enables rings to be formed as GPU 2 could move data back to GPU 0 before going through NIC 0/1.
This case is common with model parallelism, depending on how the model is split. If for example a model is split between GPUs 0-3, then another model runs on GPUs 4-7. That means GPU 0 and 4 take care of the same part of the model, and an NCCL communicator is created with all GPUs 0 and 4 on all nodes, to perform all-reduce operations for the corresponding layers. Those communicators can’t perform all-reduce operations efficiently without PXN.
The only way to have efficient model parallelism so far was to split the model on GPUs 0, 2, 4, 6 and 1, 3, 5, 7 so that NCCL subcommunicators would include GPUs [0,1], [2,3], [4,5] and [6,7] instead of [0,4], [1,5], [2,6], and [3,7]. The new PXN feature gives you more flexibility and eases the use of model parallelism.
Figure 6. NCCL 2.12 PXN performance improvements
Figure 6 contrasts the time to complete alltoall collective operations with and without the PXN. In addition, PXN enables a more flexible choice of GPUs for all-reduce operations.
Summary
The NCCL 2.12 release significantly improves all2all communication collective performance. Download the latest NCCL release and experience the improved performance firsthand.