Categories
Misc

Curating Data for Transfer Learning with the NVIDIA TAO Toolkit and Innotescus

Using NVIDIA TAO Toolkit and Innotescus’ data curation and analysis platform to improve a popular object detection model’s performance on the person class by over 20%.

AI applications are powered by machine learning models that are trained to predict outcomes accurately based on input data such as images, text, or audio. Training a machine learning model from scratch requires vast amounts of data and a considerable amount of human expertise, often making the process too expensive and time-consuming for most organizations.

Transfer learning is the happy medium between building a custom model from scratch and choosing an off-the-shelf commercial model to integrate into an ML application. With transfer learning, you can select a pretrained model that’s related to your solution and retrain it on data reflecting your specific use case. Transfer learning strikes the right balance between the custom-everything approach (often too expensive)  and an off-the-shelf approach (often too rigid) and enables you to build tailored solutions with fewer resources.

The NVIDIA TAO Toolkit enables you to apply transfer learning to pretrained models and create custom, production-ready models without the complexity of AI frameworks. To train these models, high-quality data is a must. While TAO focuses on the model-centric steps of the development process, Innotescus focuses on the data-centric steps. 

Innotescus is a web-based platform for annotating, analyzing, and curating robust, unbiased datasets for computer vision–based machine learning. Innotescus helps teams scale operations without sacrificing quality. The platform includes automated and assisted annotation for both images and videos, consensus and review features for QA processes, and interactive analytics for proactive dataset analysis and balancing. Together, Innotescus and the TAO toolkit make it cost effective for organizations to apply transfer learning successfully in custom applications, arriving at high performing solutions in little time.

In this post, we address the challenges of building a robust object detection model by integrating the NVIDIA TAO Toolkit with Innotescus. This solution alleviates several common pain points that businesses encounter when building and deploying commercial solutions.

YOLO object detection model

Your goal in this project is to apply transfer learning to the YOLO object detection model in the TAO toolkit using data curated on Innotescus.

Object detection is the ability to localize and classify objects with a bounding box in an image or video. It is the most widely used application of computer vision technology. Object detection solves many complex, real-world challenges, such as the following:

  • Context and scene understanding
  • Automating solutions for smart retail
  • Autonomous driving
  • Precision agriculture

Why should you use YOLO for this model? Traditionally, deep learning–based object detectors operate through a two-stage process. In the first stage, the model identifies regions of interest in an image. In the second stage, each of these regions are classified.

Typically, many regions are sent to the classification stage, and because classification is an expensive operation, two-stage object detectors are extremely slow. YOLO stands for “You only look once.” As the name suggests, YOLO can localize and classify simultaneously, leading to highly accurate real-time performance, which is essential for most deployable solutions. In April 2020, the fourth iteration of YOLO was published. It has been tested on a multitude of applications and industries and has proven to be robust. 

Figure 1 shows the general pipeline for training object detection models. For each step of this more traditional development pipeline, we discuss the typical challenges that people encounter and how the combination of TAO and Innotescus solves these problems.

A high-level AI workflow includes data collection, followed by curation to ensure high-quality training data. The data then is used to train an AI model, which is then tested and deployed for inference.
Figure 1. Typical AI development workflow

Before you begin, install the TAO toolkit and authenticate your instance of the Innotescus API.

Installing TAO Toolkit

The TAO toolkit brings together a collection of NVIDIA’s technologies such as cuDNN, CUDA, and TensorRT. Start with an NVIDIA pretrained model and an NVIDIA-optimized model architecture and then train, adapt, and optimize it with custom data. The optimized model can then be deployed with DeepStream for inference.
Figure 2. TAO Toolkit stack

The TAO toolkit can be run as a CLI or in a Jupyter notebook. It’s only compatible with Python3 (3.6.9 and 3.7), so first install the prerequisites.

Install docker-ce.

  • On Linux, check the post-installation steps to ensure that Docker can be run without sudo.
  • Install nvidia-container-toolkit.
  • Create an NGC account and generate an API key for authentication.
  • Log in to the NGC Docker registry by running the command docker login nvcr.io and enter your credentials for authentication.

After the prerequisites are installed, install the TAO toolkit. NVIDIA recommends installing the package in a virtual environment using virtualenvwrapper. To install the TAO launcher Python package, run the following commands:

pip3 install nvidia-pyindex
pip3 install nvidia-tao

Check whether you’ve gone through the installation correctly by running tao --help.

Accessing the Innotescus API

Innotescus is accessible as a web-based application, but you’ll also use its API to demonstrate how to accomplish the same tasks programmatically. To begin, install the Innotescus library.

pip install innotescus

Next, authenticate the API instance using the client_id and client_secret values retrieved from the platform.

Screenshot of API interface.
Figure 3. Generate and retrieve API keys
from innotescus import client_factory
client = client_factory(client_id=’client_id’, client_secret=’client_secret’)

Now you’re ready to interact with the platform through the API, which you’ll do as you walk through each step of the pipeline that follows.

Data collection

You need data to train the model. Though it’s often overlooked, data collection is arguably the most important step in the development process. While collecting data, you should ask yourself a few questions:

  • Is the training data adequately representative of each object of interest?
  • Are you accounting for all the scenarios in which you expect the model to be deployed?
  • Do you have enough data to train the model?

You can’t always answer these questions completely but having a well-rounded game plan for data collection helps you avoid issues during subsequent steps in the development process. Data collection is a time-consuming and expensive process. Because the models provided by TAO are pretrained, the data requirements for retraining are much smaller, saving organizations significant resources in this phase.

For this experiment, you use images and annotations from the MS COCO Validation 2017 dataset. This dataset has 5,000 images with 80 different classes, but you only use the 2,685 images containing at least one person.

%matplotlib inline
from pycocotools.coco import COCO
import matplotlib.pyplot as plt

dataDir=’Your Data Directory’
dataType=’val2017’
annFile=’{}/annotations/instances_{}.json’.format(dataDir,dataType)

coco=COCO(annFile)

catIds = coco.getCatIds(catNms=[‘person’]) # only using ‘person’ category
imgIds = coco.getImgIds(catIds=catIds)

for num_imgs in len(imgIds): 
	img = coco.loadImgs(imgIds[num_imgs])[0]
	I = io.imread(img[‘coco_url’])
A collage of images showing examples of the ‘person’ class.
Figure 4. Examples of images from the dataset that include one or more ‘person’ objects

With the authenticated instance of the Innotescus client, begin setting up a project and uploading the human-focused dataset.

#create a new project
client.create_project(project_name)
#upload data to the new project
client.upload_data(project_name, dataset_name, file_paths, data_type, storage_type)
  • data_type: The type of data this dataset holds. Accepted values:
    • DataType.IMAGE
    • DataType.VIDEO
  • storage_type: The source of the data. Accepted values:
    • StorageType.FILE_SYSTEM
    • StorageType.URL

This dataset is now accessible through the Innotescus user interface.

Pictures show the interface for browsing, editing, and labeling the dataset.
Figure 5. Gallery view of the human-centric Coco Validation 2017 dataset from within the Innotescus platform

Data curation

Now that you have your initial dataset, begin curating it to ensure a well-balanced dataset. Studies have repeatedly shown that this phase of the process takes around 80% of the time spent on a machine learning project.

Using TAO and Innotescus, we highlight techniques like pre-annotation and review that save time during this step without sacrificing dataset size or quality.

Pre-annotation

Pre-annotation enables you to use model-generated annotations to remove a significant amount of the time and manual effort necessary to label the subset of 2,685 images accurately. You use YOLOv4—the same model that you’re retraining—to generate pre-annotations for the annotators to refine.

Because pre-annotation saves you so much time on the easier components of the annotation task, you can focus your attention on the harder examples that the model can’t yet handle.

YOLOv4 is included in the TAO toolkit and supports k-means clustering, training, evaluation, inference, pruning, and exporting. To use the model, you must first create a YOLOv4 spec file, which has the following major components:

  • yolov4_config
  • training_config
  • eval_config
  • nms_config
  • augmentation_config
  • dataset_config

The spec file is a protobuf text (prototxt) message, and each of its fields can be either a basic data type or a nested message.

Next, download the model with pretrained weights. The TAO Toolkit Docker container provides access to a repository of pretrained models that serve as a great starting point when training deep neural networks. Because these models are hosted on the NGC catalog, you must first download and install the NGC CLI. For more information, see the NGC documentation.

After you’ve installed the CLI, you can see the list of pretrained computer vision models on the NGC repo, and download pretrained models.

ngc registry model list nvidia/tao/pretrained_*
ngc registry model download-version /path/to/model_on_NGC_repo/ -dest /path/to/model_download_dir/

With the model downloaded and spec file updated, you can now generate pre-annotations by running the inference subtask.

tao yolo_v4 inference [-h] -i /path/to/imgFolder/ -l /path/to/annotatedOutput/ -e /path/to/specFile.txt -m /path/to/model/ -k $KEY

The output of the inference subtask is a series of annotations in the KITTI format, saved in the specified output directory. Figure 6 shows two examples of these annotations:

The pretrained YOLOv4 model generates pre-annotations for the dataset to save you time in the manual annotation process.
Figure 6. Example annotations generated by the TAO Toolkit using the pretrained YOLOv4 model

Upload the preannotations into the Innotescus platform manually through the web-based user interface or using the API. Because the KITTI format is one of the many accepted by Innotescus, no preprocessing is needed.

Screenshot of the annotation import process in the Innotescus UI.
Figure 7. Pre-annotation upload process
#upload pre-annotations generated by YOLOv4
Response = client.upload_annotations(project_name, dataset_name, task_type, data_type, annotation_format, file_paths, task_name, task_description, overwrite_existing_annotations, pre_annotate)
  • project_name: The name of the project containing the affected dataset and task.
  • dataset_name: The name of the dataset to which these annotations are to be applied.
  • task_type: The type of annotation task being created with these annotations. Accepted values from the TaskType class:
    • CLASSIFICATION
    • OBJECT_DETECTION
    • SEGMENTATION
    • INSTANCE_SEGMENTATION
  • data_type: The type of data to which the annotations correspond. Accepted values:
    • DataType.IMAGE
    • DataType.VIDEO
  • annotation_format: The format in which these annotations are stored. Accepted values from the AnnotationFormat class:

    • COCO
    • KITTI
    • MASKS_PER_CLASS
    • PASCAL
    • CSV
    • MASKS_SEMANTIC
    • MASKS_INSTANCE
    • INNOTESCUS_JSON
    • YOLO_DARKNET
    • YOLO_KERAS
  • file_paths: A list of file paths containing the annotation files to upload.
  • task_name: The name of the task to which these annotations belong; if the task does not exist, it is created and populated with these annotations.
  • task_description: A description of the task being created, if the task does not exist yet.
  • overwrite_existing_annotations: If the task already exists, this flag allows you to overwrite existing annotations.
  • pre_annotate: Allows you to import annotations as pre-annotations.

With the pre-annotations imported to the platform and a significant amount of the initial annotation work saved, move into Innotescus to further correct, refine, and analyze the data.

Review and correction

With the pre-annotations successfully imported, head over to the platform to perform review and correction of the pre-annotations. While the pretrained model saves a significant amount of annotation time, it’s still not perfect and needs a bit of human in the loop interaction to ensure high-quality training data. Figure 8 shows an example of a typical correction that you might make.

Image shows an extra bounding box placed over the crowd, rather than just one bounding box per person.
Figure 8. Error in the pre-annotations generated with the pretrained YOLOv4

Beyond a first pass at fixing and submitting pre-annotations, Innotescus enables a more focused sampling of images and annotations for multistage review. This enables large teams to ensure high quality throughout the dataset systematically and efficiently.

Review canvas where Innotescus users can edit, accept, and reject annotations with comments to support a robust, collaborative review process.
Figure 9. Process on Innotescus

Exploratory data analysis

Exploratory data analysis, or EDA, is the process of investigating and visualizing datasets from multiple statistical angles to get a holistic understanding of the underlying patterns, anomalies, and biases present in the data. It is an effective and necessary step to take before thoughtfully addressing the statistical imbalances your dataset contains.

Innotescus provides precalculated metrics for understanding class, color, spatial, and complexity distributions for both data and annotations, and enables you to add your own layer of information in image and annotation metadata to incorporate application-specific information into the analytics.

Here’s how you can use the Innotescus’s dive visualization to understand some of the patterns and biases present in the dataset. The following scatter plot shows the distribution of image entropy, which is the average information or degree of randomness in an image, within the dataset along the x-axis. You can see a clear pattern, but you can also spot anomalies like images with low entropy or information content.

Figure shows a graph of the dataset in the Innotescus UI, with the image entropy of each image graphed along the x-axis.
Figure 10. Dataset graph on Innotescus

Outliers like these raise questions of how to handle anomalies within a dataset. Recognizing anomalies enables you to ask some crucial questions:

  • Do you expect the model, when deployed, to encounter low-entropy input?
  • If so, do you need more such examples in the training dataset?
  • If not, are these examples going to be detrimental for training, and should they be removed from the training dataset?

In another example, look at each annotation’s area, relative to the image that it’s in.

The images show a graph of each annotation’s size relative to the image that it’s in, to reveal a bias towards relatively small ‘person’ objects within the annotated dataset.
Figure 12. Using dive charts to investigate a number of metrics calculated by Innotescus

In Figure 13, the two images show the variation in annotation sizes within the dataset. While some annotations capture people that take up lots of the image, most show people far away from the camera.

Here, a large percentage of annotations are between 0 and 10% of their respective image sizes. This means that the dataset is biased towards small objects, or people that are far from the camera. Do you then need more examples in the training data that have larger annotations to represent people closer to the camera? Understanding the data distribution in this way helps you to begin thinking about the plan for data augmentation.

With Innotescus, EDA is made intuitive. It provides you with the information that you need to make powerful augmentations to your dataset and eliminate bias early in the development process.

Cluster rebalancing with dataset augmentation

The idea behind augmentation for cluster rebalancing is powerful. This technique showed a 21% boost in performance in the recent datacentric AI competition hosted by Andrew Ng and DeepLearning.AI.

You generate an N-dimensional feature vector for each data point (each bounding box annotation), and cluster all data points in higher dimensional space. When you cluster objects with similar features, you augment the dataset such that each cluster has equal representation.

We chose to use [red channel mean, green channel mean, blue channel mean, gray image std, gray image entropy, relative area] as the N-dimensional feature vector. These metrics were exported from Innotescus, which automatically calculated them. You could also use the embeddings generated by the pretrained model to populate the feature vector, which would arguably be more robust.

You use k-means clustering with k=4 as the clustering algorithm and UMAP for reducing the dimensions to two for visualization. The following code example generates the graph that shows the UMAP plot, color-coded with these four clusters.

import umap
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# k-means on the feature vector
kmeans = KMeans(n_clusters=4, random_state=0).fit(featureVector)

# UMAP for dim reduction and visualization
fit = umap.UMAP(n_neighbors=5,
		min_dist=0.2,
		n_components=2,
		metric=’manhattan’)

u = fit.fit_transform(featureVector)

# Plot UMAP components
plt.scatter(u[:,0], u[:,1], c=(kmeans.labels_))
plt.title(‘UMAP embedding of kmeans colours’)
The plot shows the four unbalanced clusters of annotations.
Figure 14. Four clusters, plotted on two dimensions

When you look at the number of objects in each cluster, you can clearly see the imbalance, which informs how you should augment the data for retraining. The four clusters represent 854, 1523, 1481 and 830 images, respectively. Where an image has objects in more than one cluster, group that image in the cluster with most of its objects for augmentation.

clusters = {}

for file, cluster in zip(filename, kmeans.labels_):
	if cluster not in clusters.keys():
		clusters[cluster] = []
		clusters[cluster].append(file)
	else:
		clusters[cluster].append(file)

for numCls in range(0, len(clusters)):
	print(‘Cluster {}: {} objects, {} images’.format(numCls+1, len(clusters[numCls]), len(list(set(clusters[numCls])))))

Output:

Cluster 1: 2234 objects, 854 images
Cluster 2: 3490 objects, 1523 images
Cluster 3: 3629 objects, 1481 images
Cluster 4: 1588 objects, 830 images

With the clusters well defined, you use the imgaug Python library to introduce augmentation techniques to enhance the training data: translation, image brightness adjustment, and scale augmentation. You augment such that each cluster contains 2,000 images for a total of 8,000. As you augment images, imgaug ensures that the annotation coordinates are altered appropriately as well.

import imgaug as ia
import imgaug.augmenters as iaa

# augment images
seq = iaa.Sequential([
	iaa.Multiply([1.1, 1.5]), # change brightness, doesn’t affect BBs
	iaa.Affine(
		translate_px={“x”:60, “y”:60},
		scale=(0.5, 0.8)
	) # translate by 60px on x/y axes & scale to 50-80%, includes BBs
])

# augment BBs and images
image_aug, bbs_aug = seq(image=I, bounding_boxes=boundingBoxes)

Using the same UMAP visualization technique, with augmented data points now in red, you see that the dataset is now much more balanced, as it more closely resembles a Gaussian distribution.

A plot showing the newly rebalanced dataset. With the augmented dataset, the distribution of annotations is much closer to a normal distribution than before.
Figure 15. Rebalanced clusters

Model training

With the well-balanced, high-quality training data, the final step is to train the model.

YOLOv4 retraining on TAO Toolkit

To start retraining the model, first ensure that the spec file contains the classes of interest, as well as the correct directory paths for the pretrained model and training data. Change the training parameters in the training_config section. Reserve 30% of the augmented dataset as a test dataset to compare the performance of the pretrained model and the performance of the retrained model.

ttraining_config {
	batch_size_per_gpu: 8
	num_epochs: 80
	enable_qat: false
	checkpoint_interval: 10
	learning_rate {
		soft_start_cosine_annealing_schedule {
			min_learning_rate: 1e-7
			max_learning_rate: 1e-4
			soft_start: 0.3
		}
	}
	regularizer {
		type: L1
		weight: 3e-5
	}
	optimizer {
		adam {
			epsilon: 1e-7
			beta1: 0.9
			beta2: 0.999
			amsgrad: false
		}
	}
	pretrain_model_path: “path/to/model/model.hdf5”
}

Run the training command.

tao yolo_v4 train -e /path/to/specFile.txt -r /path/to/result -k $KEY

Results

As you can see, you achieved a 14.93% improvement in the mean average precision, a 21.37% boost from the mAP of the pretrained model:

Model mAP50
Yolov4 pretrained model 69.86%
Yolov4 retrained model with cluster-rebalanced augmentation 84.79%
Table 1. Model performance before and after applying transfer learning with the curated dataset

Summary

Using NVIDIA TAO Toolkit for pre-annotation and model training and Innotescus for data refinement, analysis, and curation, you improved YOLOv4’s mean average precision on the person class by a substantial amount: over 20%. Not only did you improve the performance on a selected class, you used less time and data than you would have without significant benefits of transfer learning.

Transfer learning is a great way to produce high-performing, application-specific models in settings with constrained resources. Using tools like the TAO toolkit and Innotescus makes it feasible for teams of all sizes and backgrounds.

Try it for yourself

Interested in using Innotescus to enhance and refine your own dataset? Sign up for a free trial. Get started with the TAO toolkit for your AI model training by downloading the sample resources.

Categories
Misc

Deep Learning Study Could Spark New Dinosaur Discoveries

Researchers combine CT imaging with deep learning to evaluate dinosaur fossils. The approach could change how paleontologists study ancient remains.

Applying new technology to studying ancient history, researchers are looking to expand their understanding of dinosaurs with a new AI algorithm. The study, published in Frontiers in Earth Science, uses high-resolution Computed Tomography (CT) imaging combined with deep learning models to scan and evaluate dinosaur fossils. The research is a step toward creating a new tool that would vastly change the way paleontologists study ancient remains. 

“Computed Tomography as well we other imaging techniques have revealed previously hidden structures in fossils, but the high-resolution images require paleontologists spending weeks to even months in post-processing, usually segmenting fossils from rock matrices. The introduction of AI can not only accelerate data processing in fossil studies, but also establish benchmarks for more objective and more reproducible studies,” said lead author Congyu Yu, a Ph.D. student at the Richard Gilder Graduate School at the American Museum of Natural History. 

For a complete picture of ancient vertebrates, paleontologists focus on internal anatomy such as cranial capacity, inner ears, or vascular spaces. To do this, researchers use a technique called thin sectioning. Removing a small piece (as thin as several micrometers) from a fossil, examining it under a microscope, and annotating the structures they find, helps them piece together the morphology of a dinosaur. However, this technique is destructive to the remains and can be extremely time consuming.

Computed tomography (CT) scans have given scientists the ability to look inside a sample while leaving the fossil unscathed. The technology essentially examines a fossil section, capturing thousands of images of it. Software then reconstructs the images and generates a three-dimensional graphic, resulting in an internal snapshot of the sample. Scientists can then examine and label identifiable morphology in the graphic to learn more about a specimen.

Imaging has given scientists a tool for revealing hidden internal structures and advancing 3D models of dinosaurs. Studies have helped researchers estimate body mass, analyze skulls, and even understand dental morphology along with tooth replacement patterns.

However, with this approach, scientists still manually choose segments, examine, and label images, which beyond being time intensive, is subjective, and can introduce errors. Plus, scans have limitations differentiating between the rock that may be coating a fossil and the bones themselves, making it difficult to determine where a rock ends and fossil begins. 

AI has proven capable of quick image segmentation in the medical world, ranging from identifying brain lesions to skin cancer. The researchers saw an opportunity to apply similar deep learning models to CT fossil images.  

They tested this new approach using deep neural networks and over 10,000 annotated CT scans of three well-preserved embryonic skulls of Protoceratop dinosaurs. Recovered in the 1990s from the Mongolian Gobi Desert, these fossils come from early horned dinosaurs and are a smaller relative of the better-known Triceratops.

The team used a classic U-net deep neural network for processing fossil segmentation, teaching the algorithm to recognize rock from the fossils. A modified DeepLab v3+ network was used for training feature identification, categorizing parts of the CT images, and 3D rendering. 

The models were trained using 7,986 manually annotated bone structure CT slices on the cuDNN-accelerated TensorFlow deep learning framework with dual NVIDIA GeForce RTX 2080 Ti GPUs.

Testing the results against a dataset of 3,329, they found that while the segmentation model reached high accuracy of around 97%, the 3D feature renderings were not as meticulous or accurate as humans. While the results showed that the features models did not perform as accurately as the scientists, the segmentation models worked smoothly and did so in record time. The models segmented each slice in seconds—manually segmenting the same piece took minutes or even hours in some cases. This could help paleontologists reduce their time spent working to differentiate fossils from rock. 

Comparison of 3d renderings of raw reconstruction, manual segmentation, and deep learning segmentation of skulls showing the raw reconstruction worked best.
Figure 1. Comparison of different 3D renderings from left: raw reconstruction, manual segmentation, and deep learning segmentation.

The researchers suggest that larger data sets incorporating other dinosaur species and different sediment types could help create a high-performing algorithm down the line. 

​​”We are confident that a segmentation model for fossils from the Gobi Desert is not far away, but a more generalized model needs not only more training dataset but innovations in algorithms,” Yu said in a press release. “I believe deep learning can eventually process imagery better than us, and there have already been various examples in deep learning performance exceeding humans, including Go playing and protein 3D-structure prediction.”

The dataset used in the study, CT Segmentation of Dinosaur Fossils by Deep Learning, is available for download.

Read more. >
Read the study in Frontiers in Earth Science. >>

Categories
Offsites

Federated Learning with Formal Differential Privacy Guarantees

In 2017, Google introduced federated learning (FL), an approach that enables mobile devices to collaboratively train machine learning (ML) models while keeping the raw training data on each user’s device, decoupling the ability to do ML from the need to store the data in the cloud. Since its introduction, Google has continued to actively engage in FL research and deployed FL to power many features in Gboard, including next word prediction, emoji suggestion and out-of-vocabulary word discovery. Federated learning is improving the “Hey Google” detection models in Assistant, suggesting replies in Google Messages, predicting text selections, and more.

While FL allows ML without raw data collection, differential privacy (DP) provides a quantifiable measure of data anonymization, and when applied to ML can address concerns about models memorizing sensitive user data. This too has been a top research priority, and has yielded one of the first production uses of DP for analytics with RAPPOR in 2014, our open-source DP library, Pipeline DP, and TensorFlow Privacy.

Through a multi-year, multi-team effort spanning fundamental research and product integration, today we are excited to announce that we have deployed a production ML model using federated learning with a rigorous differential privacy guarantee. For this proof-of-concept deployment, we utilized the DP-FTRL algorithm to train a recurrent neural network to power next-word-prediction for Spanish-language Gboard users. To our knowledge, this is the first production neural network trained directly on user data announced with a formal DP guarantee (technically ρ=0.81 zero-Concentrated-Differential-Privacy, zCDP, discussed in detail below). Further, the federated approach offers complimentary data minimization advantages, and the DP guarantee protects all of the data on each device, not just individual training examples.

Data Minimization and Anonymization in Federated Learning
Along with fundamentals like transparency and consent, the privacy principles of data minimization and anonymization are important in ML applications that involve sensitive data.

Federated learning systems structurally incorporate the principle of data minimization. FL only transmits minimal updates for a specific model training task (focused collection), limits access to data at all stages, processes individuals’ data as early as possible (early aggregation), and discards both collected and processed data as soon as possible (minimal retention).

Another principle that is important for models trained on user data is anonymization, meaning that the final model should not memorize information unique to a particular individual’s data, e.g., phone numbers, addresses, credit card numbers. However, FL on its own does not directly tackle this problem.

The mathematical concept of DP allows one to formally quantify this principle of anonymization. Differentially private training algorithms add random noise during training to produce a probability distribution over output models, and ensure that this distribution doesn’t change too much given a small change to the training data; ρ-zCDP quantifies how much the distribution could possibly change. We call this example-level DP when adding or removing a single training example changes the output distribution on models in a provably minimal way.

Showing that deep learning with example-level differential privacy was even possible in the simpler setting of centralized training was a major step forward in 2016. Achieved by the DP-SGD algorithm, the key was amplifying the privacy guarantee by leveraging the randomness in sampling training examples (“amplification-via-sampling”).

However, when users can contribute multiple examples to the training dataset, example-level DP is not necessarily strong enough to ensure the users’ data isn’t memorized. Instead, we have designed algorithms for user-level DP, which requires that the output distribution of models doesn’t change even if we add/remove all of the training examples from any one user (or all the examples from any one device in our application). Fortunately, because FL summarizes all of a user’s training data as a single model update, federated algorithms are well-suited to offering user-level DP guarantees.

Both limiting the contributions from one user and adding noise can come at the expense of model accuracy, however, so maintaining model quality while also providing strong DP guarantees is a key research focus.

The Challenging Path to Federated Learning with Differential Privacy
In 2018, we introduced the DP-FedAvg algorithm, which extended the DP-SGD approach to the federated setting with user-level DP guarantees, and in 2020 we deployed this algorithm to mobile devices for the first time. This approach ensures the training mechanism is not too sensitive to any one user’s data, and empirical privacy auditing techniques rule out some forms of memorization.

However, the amplification-via-samping argument is essential to providing a strong DP guarantee for DP-FedAvg, but in a real-world cross-device FL system ensuring devices are subsampled precisely and uniformly at random from a large population would be complex and hard to verify. One challenge is that devices choose when to connect (or “check in”) based on many external factors (e.g., requiring the device is idle, on unmetered WiFi, and charging), and the number of available devices can vary substantially.

Achieving a formal privacy guarantee requires a protocol that does all of the following:

  • Makes progress on training even as the set of devices available varies significantly with time.
  • Maintains privacy guarantees even in the face of unexpected or arbitrary changes in device availability.
  • For efficiency, allows client devices to locally decide whether they will check in to the server in order to participate in training, independent of other devices.

Initial work on privacy amplification via random check-ins highlighted these challenges and introduced a feasible protocol, but it would have required complex changes to our production infrastructure to deploy. Further, as with the amplification-via-sampling analysis of DP-SGD, the privacy amplification possible with random check-ins depends on a large number of devices being available. For example, if only 1000 devices are available for training, and participation of at least 1000 devices is needed in each training step, that requires either 1) including all devices currently available and paying a large privacy cost since there is no randomness in the selection, or 2) pausing the protocol and not making progress until more devices are available.

Achieving Provable Differential Privacy for Federated Learning with DP-FTRL
To address this challenge, the DP-FTRL algorithm is built on two key observations: 1) the convergence of gradient-descent-style algorithms depends primarily not on the accuracy of individual gradients, but the accuracy of cumulative sums of gradients; and 2) we can provide accurate estimates of cumulative sums with a strong DP guarantee by utilizing negatively correlated noise, added by the aggregating server: essentially, adding noise to one gradient and subtracting that same noise from a later gradient. DP-FTRL accomplishes this efficiently using the Tree Aggregation algorithm [1, 2].

The graphic below illustrates how estimating cumulative sums rather than individual gradients can help. We look at how the noise introduced by DP-FTRL and DP-SGD influence model training, compared to the true gradients (without added noise; in black) which step one unit to the right on each iteration. The individual DP-FTRL gradient estimates (blue), based on cumulative sums, have larger mean-squared-error than the individually-noised DP-SGD estimates (orange), but because the DP-FTRL noise is negatively correlated, some of it cancels out from step to step, and the overall learning trajectory stays closer to the true gradient descent steps.

To provide a strong privacy guarantee, we limit the number of times a user contributes an update. Fortunately, sampling-without-replacement is relatively easy to implement in production FL infrastructure: each device can remember locally which models it has contributed to in the past, and choose to not connect to the server for any later rounds for those models.

Production Training Details and Formal DP Statements
For the production DP-FTRL deployment introduced above, each eligible device maintains a local training cache consisting of user keyboard input, and when participating computes an update to the model which makes it more likely to suggest the next word the user actually typed, based on what has been typed so far. We ran DP-FTRL on this data to train a recurrent neural network with ~1.3M parameters. Training ran for 2000 rounds over six days, with 6500 devices participating per round. To allow for the DP guarantee, devices participated in training at most once every 24 hours. Model quality improved over the previous DP-FedAvg trained model, which offered empirically-tested privacy advantages over non-DP models, but lacked a meaningful formal DP guarantee.

The training mechanism we used is available in open-source in TensorFlow Federated and TensorFlow Privacy, and with the parameters used in our production deployment it provides a meaningfully strong privacy guarantee. Our analysis gives ρ=0.81 zCDP at the user level (treating all the data on each device as a different user), where smaller numbers correspond to better privacy in a mathematically precise way. As a comparison, this is stronger than the ρ=2.63 zCDP guarantee chosen by the 2020 US Census.

Next Steps
While we have reached the milestone of deploying a production FL model using a mechanism that provides a meaningfully small zCDP, our research journey continues. We are still far from being able to say this approach is possible (let alone practical) for most ML models or product applications, and other approaches to private ML exist. For example, membership inference tests and other empirical privacy auditing techniques can provide complimentary safeguards against leakage of users’ data. Most importantly, we see training models with user-level DP with even a very large zCDP as a substantial step forward, because it requires training with a DP mechanism that bounds the sensitivity of the model to any one user’s data. Further, it smooths the road to later training models with improved privacy guarantees as better algorithms or more data become available. We are excited to continue the journey toward maximizing the value that ML can deliver while minimizing potential privacy costs to those who contribute training data.

Acknowledgements
The authors would like to thank Alex Ingerman and Om Thakkar for significant impact on the blog post itself, as well as the teams at Google that helped develop these ideas and bring them to practice:

  • Core research team: Galen Andrew, Borja Balle, Peter Kairouz, Daniel Ramage, Shuang Song, Thomas Steinke, Andreas Terzis, Om Thakkar, Zheng Xu
  • FL infrastructure team: Katharine Daly, Stefan Dierauf, Hubert Eichner, Igor Pisarev, Timon Van Overveldt, Chunxiang Zheng
  • Gboard team: Angana Ghosh, Xu Liu, Yuanbo Zhang
  • Speech team: Françoise Beaufays, Mingqing Chen, Rajiv Mathews, Vidush Mukund, Igor Pisarev, Swaroop Ramaswamy, Dan Zivkovic

Categories
Misc

Using a tensorflow model as a loss function

I am trying to use an empirical metric as a loss function to train a Tensorflow model. Calculating the metric function is slow, but I can train a regression neural network to accurately and quickly predict the metric score after it is trained. Is there a straightforward way (or tutorial?) to use a trained Tensorflow or scikit-learn model as a custom loss function for a Tensorflow model?

Edit: I have found this StackOverflow entry as a starting point. I will try it out and report back.

submitted by /u/baudie
[visit reddit] [comments]

Categories
Offsites

Constrained Reweighting for Training Deep Neural Nets with Noisy Labels

Over the past several years, deep neural networks (DNNs) have been quite successful in driving impressive performance gains in several real-world applications, from image recognition to genomics. However, modern DNNs often have far more trainable model parameters than the number of training examples and the resulting overparameterized networks can easily overfit to noisy or corrupted labels (i.e., examples that are assigned a wrong class label). As a consequence, training with noisy labels often leads to degradation in accuracy of the trained model on clean test data. Unfortunately, noisy labels can appear in several real-world scenarios due to multiple factors, such as errors and inconsistencies in manual annotation and the use of inherently noisy label sources (e.g., the internet or automated labels from an existing system).

Earlier work has shown that representations learned by pre-training large models with noisy data can be useful for prediction when used in a linear classifier trained with clean data. In principle, it is possible to directly train machine learning (ML) models on noisy data without resorting to this two-stage approach. To be successful, such alternative methods should have the following properties: (i) they should fit easily into standard training pipelines with little computational or memory overhead; (ii) they should be applicable in “streaming” settings where new data is continuously added during training; and (iii) they should not require data with clean labels.

In “Constrained Instance and Class Reweighting for Robust Learning under Label Noise”, we propose a novel and principled method, named Constrained Instance reWeighting (CIW), with these properties that works by dynamically assigning importance weights both to individual instances and to class labels in a mini-batch, with the goal of reducing the effect of potentially noisy examples. We formulate a family of constrained optimization problems that yield simple solutions for these importance weights. These optimization problems are solved per mini-batch, which avoids the need to store and update the importance weights over the full dataset. This optimization framework also provides a theoretical perspective for existing label smoothing heuristics that address label noise, such as label bootstrapping. We evaluate the method with varying amounts of synthetic noise on the standard CIFAR-10 and CIFAR-100 benchmarks and observe considerable performance gains over several existing methods.

Method
Training ML models involves minimizing a loss function that indicates how well the current parameters fit to the given training data. In each training step, this loss is approximately calculated as a (weighted) sum of the losses of individual instances in the mini-batch of data on which it is operating. In standard training, each instance is treated equally for the purpose of updating the model parameters, which corresponds to assigning uniform (i.e., equal) weights across the mini-batch.

However, empirical observations made in earlier works reveal that noisy or mislabeled instances tend to have higher loss values than those that are clean, particularly during early to mid-stages of training. Thus, assigning uniform importance weights to all instances means that due to their higher loss values, the noisy instances can potentially dominate the clean instances and degrade the accuracy on clean test data.

Motivated by these observations, we propose a family of constrained optimization problems that solve this problem by assigning importance weights to individual instances in the dataset to reduce the effect of those that are likely to be noisy. This approach provides control over how much the weights deviate from uniform, as quantified by a divergence measure. It turns out that for several types of divergence measures, one can obtain simple formulae for the instance weights. The final loss is computed as the weighted sum of individual instance losses, which is used for updating the model parameters. We call this the Constrained Instance reWeighting (CIW) method. This method allows for controlling the smoothness or peakiness of the weights through the choice of divergence and a corresponding hyperparameter.

Schematic of the proposed Constrained Instance reWeighting (CIW) method.

Illustration with Decision Boundary on a 2D Dataset
As an example to illustrate the behavior of this method, we consider a noisy version of the Two Moons dataset, which consists of randomly sampled points from two classes in the shape of two half moons. We corrupt 30% of the labels and train a multilayer perceptron network on it for binary classification. We use the standard binary cross-entropy loss and an SGD with momentum optimizer to train the model. In the figure below (left panel), we show the data points and visualize an acceptable decision boundary separating the two classes with a dotted line. The points marked red in the upper half-moon and those marked green in the lower half-moon indicate noisy data points.

The baseline model trained with the binary cross-entropy loss assigns uniform weights to the instances in each mini-batch, thus eventually overfitting to the noisy instances and resulting in a poor decision boundary (middle panel in the figure below).

The CIW method reweights the instances in each mini-batch based on their corresponding loss values (right panel in the figure below). It assigns larger weights to the clean instances that are located on the correct side of the decision boundary and damps the effect of noisy instances that incur a higher loss value. Smaller weights for noisy instances help in preventing the model from overfitting to them, thus allowing the model trained with CIW to successfully converge to a good decision boundary by avoiding the impact of label noise.

Illustration of decision boundary as the training proceeds for the baseline and the proposed CIW method on the Two Moons dataset. Left: Noisy dataset with a desirable decision boundary. Middle: Decision boundary for standard training with cross-entropy loss. Right: Training with the CIW method. The size of the dots in (middle) and (right) are proportional to the importance weights assigned to these examples in the minibatch.

<!–

Illustration of decision boundary as the training proceeds for the baseline and the proposed CIW method on the Two Moons dataset. Left: Noisy dataset with a desirable decision boundary. Middle: Decision boundary for standard training with cross-entropy loss. Right: Training with the CIW method. The size of the dots in (middle) and (right) are proportional to the importance weights assigned to these examples in the minibatch.

–>

Constrained Class reWeighting
Instance reweighting assigns lower weights to instances with higher losses. We further extend this intuition to assign importance weights over all possible class labels. Standard training uses a one-hot label vector as the class weights, assigning a weight of 1 to the labeled class and 0 to all other classes. However, for the potentially mislabeled instances, it is reasonable to assign non-zero weights to classes that could be the true label. We obtain these class weights as solutions to a family of constrained optimization problems where the deviation of the class weights from the label one-hot distribution, as measured by a divergence of choice, is controlled by a hyperparameter.

Again, for several divergence measures, we can obtain simple formulae for the class weights. We refer to this as Constrained Instance and Class reWeighting (CICW). The solution to this optimization problem also recovers the earlier proposed methods based on static label bootstrapping (also referred as label smoothing) when the divergence is taken to be total variation distance. This provides a theoretical perspective on the popular method of static label bootstrapping.

Using Instance Weights with Mixup
We also propose a way to use the obtained instance weights with mixup, which is a popular method for regularizing models and improving prediction performance. It works by sampling a pair of examples from the original dataset and generating a new artificial example using a random convex combination of these. The model is trained by minimizing the loss on these mixed-up data points. Vanilla mixup is oblivious to the individual instance losses, which might be problematic for noisy data because mixup will treat clean and noisy examples equally. Since a high instance weight obtained with our CIW method is more likely to indicate a clean example, we use our instance weights to do a biased sampling for mixup and also use the weights in convex combinations (instead of random convex combinations in vanilla mixup). This results in biasing the mixed-up examples towards clean data points, which we refer to as CICW-Mixup.

We apply these methods with varying amounts of synthetic noise (i.e., the label for each instance is randomly flipped to other labels) on the standard CIFAR-10 and CIFAR-100 benchmark datasets. We show the test accuracy on clean data with symmetric synthetic noise where the noise rate is varied between 0.2 and 0.8.

We observe that the proposed CICW outperforms several methods and matches the results of dynamic mixup, which maintains the importance weights over the full training set with mixup. Using our importance weights with mixup in CICW-M, resulted in significantly improved performance vs these methods, particularly for larger noise rates (as shown by lines above and to the right in the graphs below).

Test accuracy on clean data while varying the amount of symmetric synthetic noise in the training data for CIFAR-10 and CIFAR-100. Methods compared are: standard Cross-Entropy Loss (CE), Bi-tempered Loss, Active-Passive Normalized Loss, the proposed CICW, Mixup, Dynamic Mixup, and the proposed CICW-Mixup.

Summary and Future Directions
We formulate a novel family of constrained optimization problems for tackling label noise that yield simple mathematical formulae for reweighting the training instances and class labels. These formulations also provide a theoretical perspective on existing label smoothing–based methods for learning with noisy labels. We also propose ways for using the instance weights with mixup that results in further significant performance gains over instance and class reweighting. Our method operates solely at the level of mini-batches, which avoids the extra overhead of maintaining dataset-level weights as in some of the recent methods.

As a direction for future work, we would like to evaluate the method on realistic noisy labels that are encountered in large scale practical settings. We also believe that studying the interaction of our framework with label smoothing is an interesting direction that can result in a loss adaptive version of label smoothing. We are also excited to release the code for CICW, now available on Github.

Acknowledgements
We’d like to thank Kevin Murphy for providing constructive feedback during the course of the project.

Categories
Misc

Doubling all2all Performance with NVIDIA Collective Communication Library 2.12

The NCCL 2.12 release significantly improves all2all communication collective performance, with the PXN feature.

Collective communications are a performance-critical ingredient of modern distributed AI training workloads such as recommender systems and natural language processing.

NVIDIA Collective Communication Library (NCCL), a Magnum IO Library, implements GPU-accelerated collective operations:

  • all-gather
  • all-reduce
  • broadcast
  • reduce
  • reduce-scatter
  • point-to-point send and receive

NCCL is topology-aware and is optimized to achieve high bandwidth and low latency over PCIe, NVLink, Ethernet, and InfiniBand interconnect. NCCL GCP plugin and NCCL AWS plugin enable high-performance NCCL operations in popular cloud environments with custom network connectivity.

NCCL releases have been relentlessly focusing on improving collective communication performance. This post focuses on the improvements that come with the NCCL 2.12 release.

Combining NVLink and network communication

The new feature introduced in NCCL 2.12 is called PXN, as PCI × NVLink, as it enables a GPU to communicate with a NIC on the node through NVLink and then PCI. This is instead of going through the CPU using QPI or other inter-CPU protocols, which would not be able to deliver full bandwidth. That way, even though each GPU still tries to use its local NIC as much as possible, it can reach other NICs if required.

Instead of preparing a buffer on its local memory for the local NIC to send, the GPU prepares a buffer on an intermediate GPU, writing to it through NVLink. It then notifies the CPU proxy managing that NIC that the data is ready, instead of notifying its own CPU proxy. The GPU-CPU synchronization might be a little slower because it may have to cross CPU sockets, but the data itself only uses NVLink and PCI switches, guaranteeing maximum bandwidth.

Topology that shows NIC0s of all DGXs connected to the same switch, NIC1s to another leaf switch and so on.
Figure 1. Rail-optimized topology

In the topology in Figure 1, NIC-0 from each DGX system is connected to the same leaf switch (L0), NIC-1s are connected to the same leaf switch (L1), and so on. Such a design is often called rail-optimized. Rail-optimized network topology helps maximize all-reduce performance while minimizing network interference between flows. It can also reduce the cost of the network by having lighter connections between rails.

PXN leverages NVIDIA NVSwitch connectivity between GPUs within the node to first move data on a GPU on the same rail as the destination, then send it to the destination without crossing rails. That enables message aggregation and network traffic optimization.

Topology shows PXN avoiding second-tier spine switches.
Figure 2. Example message path from GPU0 in DGX-A to GPU3 in DGX-B

Before NCCL 2.12, the message in Figure X would have traversed through three hops of network switches (L0, S1, and L3), potentially causing contention and being slowed down by other traffic. The messages passed between the same pair of NICs are aggregated to maximize effective message rate and network bandwidth.

Message aggregation

With PXN, all GPUs on a given node move their data onto a single GPU for a given destination. This enables the network layer to aggregate messages, by implementing a new multireceive function. The function enables the remote CPU proxy to send all messages as one as soon as they are all ready.

For example, if a GPU on a node is performing an all2all operation and is to receive data from all eight GPUs from a remote node, NCCL calls a multireceive with eight buffers and sizes. On the sender side, the network layer can then wait until all eight sends are ready, then send all eight messages at one time, which can have a significant effect on the message rate.

Another aspect of message aggregation is that connections are now shared between all GPUs of a node for a given destination. This means fewer connections to establish. It can also affect the routing efficiency, if the routing algorithm was relying on having a lot of different connections to get good entropy.

PXN improves all2all performance

Diagram shows All2all is like a matrix transpose operation using a 4x4 matrix example.
Figure 3. all2all collective operation across four participating processes

Figure 3 shows that all2all entails communication from each process to every other process. In other words, the number of messages exchanged as part of an all2all operation in an N-GPU cluster is $latex  O(N^{2})$.

The messages exchanged between the GPUs are distinct and can’t be optimized using algorithms such as tree/ring (used for allreduce). When you run billion+ parameter models across 100s of GPUs, the number of messages can trigger congestion, create network hotspots, and adversely affect performance.

As discussed earlier, PXN combines NVLink and PCI communications to reduce traffic flow through the second-tier spine switches and optimizes network traffic. It also improves message rates by aggregating up to eight messages into one. Both improvements significantly improve all2all performance. 

all-reduce on 1:1 GPU:NIC topologies

Another problem that PXN solves is the case of topologies where there is a single GPU close to each NIC. The ring algorithm requires two GPUs to be close to each NIC. Data must go from the network to a first GPU, go around all GPUs through NVLink, and then exit from the last GPU onto the network. The first and last GPUs must both be close to the NIC. The first GPU must be able to receive from the network efficiently, and the last GPU must be able to send through the network efficiently. If only one GPU is close to a given NIC, then you cannot close the ring and must send data through the CPU, which can heavily affect performance.

With PXN, as long as the last GPU can access the first GPU through NVLink, it can move its data to the first GPU. The data is sent from there to the NIC, keeping all transfers local to PCI switches.

This case is not only relevant for PCI topologies featuring one GPU and one NIC per PCI switch but can also happen on other topologies when an NCCL communicator only includes a subset of GPUs. Consider a node with 8xGPUs interconnected with an NVLink hypercube mesh. 

Diagram shows the DGX-1 hypercube mesh topology with GPUs, NVSWITCHes and PCIe switches.
Figure 4. Network topology in a NVIDIA DGX-1 system

Figure 5 shows a ring that can be formed by leveraging the high-bandwidth NVLink connections that are available in the topology when the communicator includes all the 8xGPUs in the system. This is possible as both GPU0 and GPU1 share access to the same local NIC.

Diagram shows an example ring path used by NCCL touching each GPU in the system exactly once only using NVLINKs.
Figure 5. Example ring path used by NCCL

The communicator can just include a subset of the GPUs. For example, it can just include GPUs 0, 2, 4, and 6. In that case, creating rings is impossible without crossing rails: rings entering the node from GPU 0 would have to exit from GPUs 2, 4, or 6, which do not have direct access to the local NICs of GPUs 0 (NICs 0 and 1).

On the other hand, PXN enables rings to be formed as GPU 2 could move data back to GPU 0 before going through NIC 0/1.

This case is common with model parallelism, depending on how the model is split. If for example a model is split between GPUs 0-3, then another model runs on GPUs 4-7. That means GPU 0 and 4 take care of the same part of the model, and an NCCL communicator is created with all GPUs 0 and 4 on all nodes, to perform all-reduce operations for the corresponding layers. Those communicators can’t perform all-reduce operations efficiently without PXN.

The only way to have efficient model parallelism so far was to split the model on GPUs 0, 2, 4, 6 and 1, 3, 5, 7 so that NCCL subcommunicators would include GPUs [0,1], [2,3], [4,5] and [6,7] instead of [0,4], [1,5], [2,6], and [3,7]. The new PXN feature gives you more flexibility and eases the use of model parallelism.

Histogram shows more than 2X improvement when using PXN.
Figure 6. NCCL 2.12 PXN performance improvements

Figure 6 contrasts the time to complete alltoall collective operations with and without the PXN. In addition, PXN enables a more flexible choice of GPUs for all-reduce operations. 

Summary

The NCCL 2.12 release significantly improves all2all communication collective performance. Download the latest NCCL release and experience the improved performance firsthand.

For more information see the following resources:

Categories
Misc

TensorFlow 1.15 C++ API Documentation

Hello,

I train TensorFlow models using the Python version but I need them to run using the C++ api. I had previously compiled libtensorflow.so version 1.11 and used the models successfully, however, when trying to use the same program with the 1.15 version it fails with an segmentation fault. I assume the way a model loads changed between 1.11 and 1.15 using the C++ api, however, I cannot for the life of me find documentation on the C++ API version 1.15 since the official site always redirects to version 2.8.

Could anyone please point me to the v1.15 documentation of the C++ api, please?

submitted by /u/quinseptopol
[visit reddit] [comments]

Categories
Misc

TinyML Monitoring Air Quality an 8-bit Microcontroller

TinyML Monitoring Air Quality an 8-bit Microcontroller

I’d like to share my experiment on how to easily create your own tiny machine learning model and run inferences on a microcontroller to detect the concentration of various gases. I will illustrate the whole process with my example of detecting the concentration of benzene (С6H6(GT)) based on the concentration of other recorded compounds.

Things I used in this project: Arduino Mega 2560, Neuton Tiny ML software

To my mind, such simple solutions may contribute to improving the air pollution problem which now causes serious concerns. In fact, the World Health Organization estimates that over seven million people die prematurely each year from diseases caused by air pollution. Can you imagine that?

As such, more and more organizations, responsible for monitoring emissions, need to have effective tools at their disposal to monitor the air quality in a timely way, and TinyML solutions seem to be the best technology for that. They are quite low-energy and cheap to produce, as well as they don’t require a permanent Internet connection. I believe these factors will promote the mass implementation of TinyML as a great opportunity to create AI-based devices and successfully solve various challenges.

Therefore, in my experiment, I take the most primitive 8-bit MCU to show that even such a device today can have ML models in it.

Dataset description:

My dataset contained 5875 rows of hourly averaged responses from an array of oxide chemical sensors that were located on the field in a polluted area in Italy, at road level. Hourly averaged concentrations for CO, Non-Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx), and Nitrogen Dioxide (NO2) were provided.

It is a regression problem.

Target metric – MAE (Mean Absolute Error). Target – C6H6(GT).

Attribute Information:RH – Relative Humidity

AH – Absolute Humidity

T – Temperature in °C;

PT08.S3(NOx) – Tungsten oxide. Hourly averaged sensor response (nominally NOx targeted);

PT08.S4(NO2) – Tungsten oxide. Hourly averaged sensor response (nominally NO2 targeted);

PT08.S5(O3) – Indium oxide. Hourly averaged sensor response (nominally O3 targeted);

PT08.S1(CO) – (Tin oxide) hourly averaged sensor response (nominally CO targeted);

CO(GT) – True hourly averaged concentration CO in mg/m^3 (reference analyzer);

PT08.S2(NMHC) – Titania. hourly averaged sensor response (nominally NMHC targeted);

You can see more details and download the dataset here: ​​https://archive.ics.uci.edu/ml/datasets/air+qualityProcedure:

Step 1: Model Training

The model was created and trained with a free tool, Neuton TinyML, as I needed a super compact model that would fit into a tiny microcontroller with 8-bit precision. I tried to make such a model with the help of TensorFlow before, but it was too large to run operations on 8 bit.

To train the model, I converted the dataset into a CSV file, uploaded it to the platform, and selected the column that should be trained to make predictions.

https://preview.redd.it/yhbwvcy2qjk81.png?width=1899&format=png&auto=webp&s=49d568a9aea5bb64885c5e7a3d1170acb4209bc2

https://preview.redd.it/lwhmfey3qjk81.png?width=1901&format=png&auto=webp&s=04f7e50945f153e7f4db600f93015c3c8ec9fb48

The trained model had the following characteristics:
The model turned out to be super compact, having only 38 coefficients and 0.234 KB in size!

https://preview.redd.it/gkizygs5qjk81.png?width=1900&format=png&auto=webp&s=1a689e0fb926995fe223e7591ac64cb9c428abe7

Additionally, I created models with TF and TF Lite and measured metrics on the same dataset. The comparison speaks louder than words. Also, as I said above, TF models still cannot run operations on 8 bits, but it was interesting for me to use just such a primitive device.

https://preview.redd.it/y7l0ibr8qjk81.png?width=1497&format=png&auto=webp&s=8580bedc3436d719bd2246249b1af4e9ba482e44

Step 2: Embedding into a Microcontroller

Upon completion of training, I downloaded the archive which contained all the necessary files, including meta-information about the model in two formats (binary, and HEX), calculator, Neuton library, and the implementation file.

https://preview.redd.it/ftuagzt9qjk81.png?width=1900&format=png&auto=webp&s=679c238c137f9e1418fb5c58ef7a72a1d5c20118

Since I couldn’t run the experiment in field conditions with real gases, I developed a simple protocol to stream data from a computer.

Step 3: Running Inference on the Microcontroller

I connected a microcontroller on which the prediction was performed to a computer via a serial port, so signals were received in a binary format.

The microcontroller was programmed to turn on the red LED if the concentration of benzene was exceeded, and the green LED – if the concentration was within permitted limits. Check out the videos below to see how it worked.

https://reddit.com/link/t3c29p/video/l1j4qk1ypjk81/player

In this case, the concentration of benzene is within reasonable bounds (<15 mg/m3).

https://reddit.com/link/t3c29p/video/wm4agr7zpjk81/player

In this case, the concentration of benzene exceeds the limits (>15 mg/m3).

Conclusion

My example vividly illustrates how everyone can easily use the TinyML approach to create compact but smart devices, even with 8-bit precision. I’m convinced that the low production costs and high efficiency of TinyML open up enormous opportunities for its worldwide implementation.

Due to the absence of the need to involve technical specialists, in this particular case, even non-data scientists can rapidly build super compact models and locate smart AI-driven devices throughout the area to monitor air quality in real-time. To my mind, it’s really inspiring that such small solutions can help us improve the environmental situation on a global scale!

submitted by /u/literallair
[visit reddit] [comments]

Categories
Misc

[tf.js] Is there an equivalent to Keras’ Resizing layer?

I’m using tensorflow.js and I need a layer that can take in an image and output the image resized at a new resolution (bilinear filtering is fine.) I can’t find one in the tf.js API so I’m not sure what I can use. I need to make sure the model can still be serialized to disk, so I think writing a custom layer class might be off the table.

Any help would be appreciated.

submitted by /u/SaltyKoopa
[visit reddit] [comments]

Categories
Misc

Tensorflow not working in Jupyter Notebook (Anaconda)

I did this in Linux Mint, a Ubuntu variant, and I did do pip install tensorflow and all of that, granted when I try pip3 install tensorflow, I get a wall of red text.

Relevant code is:

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras import layers

#Defining the model using default linear activation function

model=keras.Sequential()

model.add(layers.Dense(14))

model.add(layers.Dense(4))

model.add(layers.Dense(1))

2022-02-27 11:34:48.204164: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2022-02-27 11:34:48.204436: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library ‘libcuda.so.1’; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2022-02-27 11:34:48.204446: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) 2022-02-27 11:34:48.204464: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (term-IdeaPad-Flex): /proc/driver/nvidia/version does not exist 2022-02-27 11:34:48.204624: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-02-27 11:34:48.204886: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set

I have an all AMD laptop, but I don’t see how you need NVIDIA when I see people using it in virtual machines. If you know of a way for me to fix this, or do something where I can upload to a different site, and have that work, let me know.

submitted by /u/Term_Grecos
[visit reddit] [comments]