Misc

Choosing the Right Storage for Enterprise AI Workloads

Post author By
Post date July 21, 2022
No Comments on Choosing the Right Storage for Enterprise AI Workloads

Artificial intelligence (AI) is becoming pervasive in the enterprise. Speech recognition, recommenders, and fraud detection are just a few applications among hundreds being driven…

Artificial intelligence (AI) is becoming pervasive in the enterprise. Speech recognition, recommenders, and fraud detection are just a few applications among hundreds being driven by AI and deep learning (DL)

To support these AI applications, businesses look toward optimizing AI servers and performance networks. Unfortunately, storage infrastructure requirements are often overlooked in the development of enterprise AI. Yet for the successful adoption of AI, it is vital to consider a comprehensive storage deployment strategy that considers AI growth, future proofing, and interoperability.

This post highlights important factors that enterprises should consider when planning data storage infrastructure for AI applications to maximize business results. I discuss cloud compared to on-premise storage solutions as well as the need for higher-performance storage within GPU-enabled virtual machines (VMs).

Why AI storage decisions are needed for enterprise deployment

The popular phrase, “You can pay me now—or pay me later” implies that it’s best to think about the future when making current decisions. Too often, storage solutions for supporting an AI or DL app only meet the immediate needs of the app without full consideration of the future cost and flexibility.

Spending money today to future-proof your AI environment from a storage standpoint can be more cost-effective in the long run. Decision-makers must ask themselves:

Can my AI storage infrastructure adapt to a cloud or hybrid model?
Will choosing object, block, or file storage limit flexibility in future enterprise deployments?
Is it possible to use lower-cost storage tiers or a hybrid model for archiving, or for datasets that do not require expensive, fast storage?

The impact of enterprise storage decisions on AI deployment is not always obvious without a direct A/B comparison. Wrong decisions today can result in lower performance and the inability to efficiently scale-out business operations in the future.

Main considerations when planning AI storage infrastructure

Following are a variety of factors to consider when deploying and planning storage. Figure 1 shows an overview of data center, budget, interoperability, and storage type considerations.

Data center	Budget	Interoperability	Storage type
DPU	Existing vs. new	Cloud and data center	Object/Block/File
Network	All Flash/HDD/Hybrid	VM environments	Flash/HDD/Hybrid

Table 1. Storage considerations for IT when deploying AI solutions on GPU-accelerated AI applications

AI performance and the GPU

Before evaluating storage performance, consider that a key element of AI performance is having high-performance enterprise GPUs to accelerate training for machine-learning, DL, and inferencing apps.

Many data center servers do not have GPUs to accelerate AI apps, so it’s best to first look at GPU resources when looking at performance.

Large datasets do not always fit within GPU memory. This is important because GPUs deliver less performance when the complete data set does not fit within GPU memory. In such cases, data is swapped to and from GPU memory, thus impacting performance. Model training takes longer, and inference performance can be impacted.

Certain apps, such as fraud detection, may have extreme real-time requirements that are affected when GPU memory is waiting for data.

Storage considerations

Storage is always an important consideration. Existing storage solutions may not work well when deploying a new AI app.

It may be that you now require the speed of NVMe flash storage or direct GPU memory access for desired performance. However, you may not know what tomorrow’s storage expectations will be, as demands for AI data from storage increase over time. For certain applications, there is almost no such thing as too much storage performance, especially in the case of real-time use cases such as pre-transaction fraud detection.

There is no “one-size-fits-all” storage solution for AI-driven apps.

Performance is only one storage consideration. Another is scale-out ability. Training data is growing. Inferencing data is growing. Storage must be able to scale in both capacity and performance—and across multiple storage nodes in many cases. Simply put, a storage device that meets your needs today may not always scale for tomorrow’s challenges.

The bottom-line: as training and inference workloads grow, capacity and performance must also grow. IT should only consider scalable storage solutions with the performance to keep GPUs busy for the best AI performance.

Data center considerations

The data processing unit (DPU) is a recent addition to infrastructure technology that takes data center and AI storage to a completely new level.

Although not a storage product, the DPU redefines data center storage. It is designed to integrate storage, processing, and networks such that whole data centers act as a computer for enterprises.

It’s important to understand DPU functionality when planning and deploying storage as the DPU offloads storage services from data center processors and storage devices. For many storage products, a DPU interconnected data center enables a more efficient scale-out.

As an example, the NVIDIA BlueField DPU supports the following functionality:

NVMe over Fabrics (NVMe-oF)
GPUDirect Storage
Encryption
Elastic block storage
Erasure coding (for data integrity)
Decompression
Deduplication

Storage performance for remote storage access is as if the storage is directly attached to the AI server. The DPU helps to enable scalable software-defined storage, in addition to networking and cybersecurity acceleration.

Budget considerations

Cost remains a critical factor. While deploying the highest throughput and lowest latency storage is desirable, it is not always necessary depending on the AI app.

To extend your storage budget further, IT must understand the storage performance requirements of each AI app (bandwidth, IOPs, and latency).

For example, if an AI app has a large dataset but minimal performance requirements, traditional hard disk drives (HDD) may be sufficient while lowering storage costs substantially. This is especially true when the “hot” data of the dataset fits wholly within GPU memory.

Another cost-saving option is to use hybrid storage that uses flash as a cache to accelerate performance while lowering storage costs for infrequently accessed data residing on HDDs. There are hybrid flash/HDD storage products that perform nearly as well as all-flash, so exploring hybrid storage options can make a lot of sense for apps that don’t have extreme performance requirements.

Older, archived, and infrequently used data and datasets may still have future value, but are not cost-effective residing on expensive primary storage.

HDDs can still make a lot of financial sense, especially if data can be seamlessly accessed when needed. A two-tiered cloud and on-premises storage solution can also make financial sense depending on the size and frequency of access. There are many of these solutions on the market.

Interoperability factors

Evaluating cloud and data center interoperability from a storage perspective is important. Even within VM-driven data centers, there are interoperability factors to evaluate.

Cloud and data center considerations

Will the AI app run on-premises, in the cloud, or both? Even if the app can be run in either place, there is no guarantee that the performance of the app won’t change with location. For example, there may be performance problems if the class of storage used in the cloud differs from the storage class used on-premises. Storage class must be considered.

Assume that a job retraining a large recommender model completes within a required eight-hour window using data center GPU-enabled servers that use high-performance flash storage. Moving the same application to the cloud with equivalent GPU horsepower may cause training to complete in 24 hours, well outside the required eight-hour window. Why?

Some AI apps require a certain class of storage (fast flash, large storage cache, DMA storage access, storage class memory (SCM) read performance, and so on) that is not always available through cloud services.

The point is that certain AI applications will yield similar results regardless of data center or cloud storage choices. Other applications can be storage-sensitive.

Just because an app is containerized and orchestrated by Kubernetes in the cloud, it does not guarantee similar data center results. When viewed in this way, containers do not always provide cross–data center and cloud interoperability when performance is considered. For effective data center and cloud interoperability, ensure that storage choices in both domains yield good results.

VM considerations

Today, most data center servers do not have GPUs to accelerate AI and creative workloads. Tomorrow, the data center landscape may look quite different. Businesses are being forced to use AI to be competitive, whether conversational AI, fraud detection, recommender systems, video analytics, or a host of other use cases.

GPUs are common on workstations, but the acceleration provided by GPU workstations cannot easily be shared within an organization.

The paradigm shift that enterprises must prepare for is the sharing of server-based, GPU-enabled resources within VM environments. The availability of solutions such as NVIDIA AI Enterprise enables sharing GPU-enabled VMs with anyone in the enterprise.

Put simply, it is now possible for anyone in an enterprise to easily run power-hungry AI apps within a VM in the vSphere environment.

So what does this mean for VM storage? Storage for GPU-enabled VMs must address the shared performance requirement of both the AI apps and users of the shared VM. This implies higher storage performance for a given VM than would be required in an unshared environment.

It also means that physical storage allocated for such VMs will likely be more scalable in capacity and performance. Within a heavily shared VM, it can make sense to use dedicated all-flash storage-class memory (SCM) arrays connected to the GPU-enabled servers through RDMA over Converged Ethernet for the highest performance and scale-out.

Storage type

An in-depth discussion on the choice of object, block, or file storage for AI apps goes beyond the scope of this post. That said, I mention it here because it’s an important consideration but not always a straightforward decision.

Object storage

If a desired app requires object storage, for example, the required storage type is obvious. Some AI apps take advantage of object metadata while also benefiting from the infinite scale of a flat address space object storage architecture. AI analytics can take advantage of rich object metadata to enable precision data categorization and organization, making data more useful and easier to manage and understand.

Block storage

Although block storage is supported in the cloud, truly massive cloud datasets tend to be object-based. Block storage can yield higher performance for structured data and transactional applications.

Block storage lacks metadata information, which prevents the use of block storage for any app that is designed to provide benefit from metadata. Many traditional enterprise apps were built on a block storage foundation, but the advent of object storage in the cloud has caused many modern applications to be designed specifically for native cloud deployment using object storage.

File storage

When an AI app accesses data across common file protocols, the obvious storage choice will be file-based. For example, AI-driven image recognition and categorization engines may require access to file-based images.

Deployment options can vary from dedicated file servers to NAS heads built on top of an object or block storage architecture. NAS heads can export NFS or SMB file protocols for file access to an underlying block or object storage architecture. This can provide a high level of flexibility and future-proofing with block or object storage used as a common foundation for file storage access by AI and data center network clients.

Storage type decisions for AI must be based on a good understanding of what is needed today as well as a longer-term AI deployment strategy. Fully evaluate the pros and cons of each storage type. There is frequently no one-size-fits-all answer, and there will also be cases where all three storage types (object, block, and file) make sense.

Key takeaways on enterprise storage decision making

There is no single approach to addressing storage requirements for AI solutions. However, here are a few core principles by which wise AI storage decisions can be made:

Any storage choice for AI solutions may be pointless if training and inference are not GPU-accelerated.
Prepare for the possibility of needing IT resources and related storage that is well beyond current estimates.
Don’t assume that existing storage is “good enough” for new or expanded AI solutions. Storage with higher cost, performance, and scalability may actually be more effective and efficient, over time, compared to existing storage.
Always consider interoperability with the cloud as on-premises storage options may not be available with your cloud provider.
Strategic IT planning should consider the infrastructure and storage benefits of DPUs.

As you plan for AI in your enterprise, don’t put storage at the bottom of the list. The impact of storage on your AI success may be greater than you think. For more information about setting up your enterprise for success with AI storage, see the following resources

Misc

Shifting Into High Gear: Lunit, Maker of FDA-Cleared AI for Cancer Analysis, Goes Public in Seoul

Post author By
Post date July 21, 2022
No Comments on Shifting Into High Gear: Lunit, Maker of FDA-Cleared AI for Cancer Analysis, Goes Public in Seoul

South Korean startup Lunit, developer of two FDA-cleared AI models for healthcare, went public this week on the country’s Kosdaq stock market. The move marks the maturity of the Seoul-based company — which was founded in 2013 and has for years been part of the NVIDIA Inception program that nurtures cutting-edge startups. Lunit’s AI software Read article >

The post Shifting Into High Gear: Lunit, Maker of FDA-Cleared AI for Cancer Analysis, Goes Public in Seoul appeared first on NVIDIA Blog.

Misc

Researchers Use GPUs to Give Earbud Users a ‘Mute Button’ for Background Noise

Post author By
Post date July 21, 2022
No Comments on Researchers Use GPUs to Give Earbud Users a ‘Mute Button’ for Background Noise

Thanks to earbuds you can have calls anywhere while doing anything. The problem: those on the other end of the call hear it all, too, from your roommate’s vacuum cleaner to background conversations at the cafe you’re working from. Now, work by a trio of graduate students at the University of Washington who spent the Read article >

The post Researchers Use GPUs to Give Earbud Users a ‘Mute Button’ for Background Noise appeared first on NVIDIA Blog.

Misc

Get Battle Ready With New GeForce NOW Fortnite Reward

Post author By
Post date July 21, 2022
No Comments on Get Battle Ready With New GeForce NOW Fortnite Reward

<Incoming Transmission> Epic Games is bringing a new Fortnite reward to GeForce NOW, available to all members. Drop from the Battle Bus in Fortnite on GeForce NOW between today and Thursday, Aug. 4, to earn “The Dish-stroyer Pickaxe” in game for free. <Transmission continues> Members can earn this item by streaming Fortnite on GeForce NOW Read article >

The post Get Battle Ready With New GeForce NOW Fortnite Reward appeared first on NVIDIA Blog.

Misc

Edge Computing Is the Next Big Cybersecurity Challenge

Post author By
Post date July 20, 2022
No Comments on Edge Computing Is the Next Big Cybersecurity Challenge

F5 joined the NVIDIA GTC to discuss the real-time pre-processing of telemetry data generated from BlueField DPUs The acceleration of digital transformation within data centers and the associated application proliferation is exposing new attack surfaces to potential security threats. These new…

The acceleration of digital transformation within data centers and the associated application proliferation is exposing new attack surfaces to potential security threats. These new attacks typically bypass the well-established perimeter security controls such as traditional and web application firewalls, making detection and remediation of cybersecurity threats more challenging.

Defending against these threats is becoming more challenging due to modern applications not being built entirely within a single data center—whether physical, virtual, or in the cloud. Today’s applications often span multiple servers in public clouds, CDN networks, edge platforms, and as-a-service components for which the location is not even known. 

On top of this, each service or microservice may have multiple instances for scale-out purposes, straining the ability of traditional network security functions to isolate them from the outside world to protect them. 

Finally, the number of data sources and locations is large and growing both because of the distributed nature of modern applications and the effects of scale-out architecture. There is no longer a single gate in the data center, such as an ingress gateway or firewall, that can observe and secure all data traffic.

Diagram lists multiple attack surfaces, such as social media, work from home devices, hacker toolkits, partner access, social engineering, and weak passwords. All increase the potential for cyber threats — *Figure 1. Facing a world of increased cyberthreats and higher cybercrime costs*

The consequence of these changes is the much larger sheer volume of data that must be collected to provide a holistic view of the application and to detect advanced threats. The number of data sources that must be monitored and the diversity in terms of data types is also growing, making effective cybersecurity data collection extremely challenging.

Detection requires a large amount of contextual information that can be correlated in near real time to determine the advanced threat activity in progress.

F5 is researching techniques to augment well-established security measures for web, application, firewall, and fraud mitigation. Detecting such advanced threats, which require contextual analysis of several of these data points through large-scale telemetry and with near real-time analysis, requires machine learning (ML) and AI algorithms.

ML and AI are used to detect anomalous activity in and around applications, as well as cloud environments, to tackle the risks upfront. This is where the NVIDIA BlueField-2 data processing unit (DPU) real-time telemetry and NVIDIA GPU-powered Morpheus cybersecurity framework come into play.

NVIDIA Morpheus provides an open application framework that enables cybersecurity developers to create optimized AI pipelines for filtering, processing, and classifying large volumes of real-time data. Morpheus offers pretrained AI models that provide powerful tools to simplify workflows and help detect and mitigate security threats.

Cybersecurity poses unique requirements for AI/ML processing

From a solution perspective, a robust telemetry collection strategy is a must and the telemetry data must have specific requirements:

A secure—encrypted and authenticated—means of transmitting data to a centralized data collector.
The ability to ingest telemetry with support for all the commonly used data paradigms: 
- Asynchronously occurring security-relevant events
- Application logs
- Statistics and status-related metrics
- Entity-specific trace records
A well-defined vocabulary that can map the data collected from diverse data sources into a canonical consumable representation

Finally, all this must be done in a highly scalable way, agnostic to the source location, which may be from a data center, the edge, a CDN, a client device, or even out-of-band metadata, such as threat intelligence feeds.

NVIDIA Morpheus-optimized AI pipelines

With a unique history and expertise in building networking software capable of harnessing the benefits of hardware, F5 is one of the first to join the NVIDIA Morpheus Early Access program.

Morpheus is an open application framework that enables cybersecurity developers to create optimized AI pipelines for filtering, processing, and classifying large volumes of real-time data.

F5 is leveraging Morpheus, which couples BlueField DPUs with NVIDIA certified EGX servers, to provide a powerful solution to detect and eliminate security threats.

NVIDIA's cybersecurity framework is AI driven to provide a powerful solution to detect and eliminate security threats. — *Figure 2. NVIDIA AI-Driven cybersecurity framework*

Morpheus allows F5 to accelerate access to embedded analytics and provide security across the cloud and emerging edge from their Shape Enterprise Defense application. The joint solution brings a new level of security to data centers and enables dynamic protection, real-time telemetry, and an adaptive defense for detecting and remediating cybersecurity threats.

Learn more

For more information about how F5 accelerates cybersecurity protection through real-time, DPU-enhanced telemetry and AI-powered analytics using NVIDIA GPU-powered Morpheus, see the Redefining Cybersecurity at the Distributed Cloud Edge with AI and Real-time Telemetry GTC session.

Misc

Explore the RTX Platform within Game Engines at New ‘Level Up with NVIDIA’ Webinars

Post author By
Post date July 20, 2022
No Comments on Explore the RTX Platform within Game Engines at New ‘Level Up with NVIDIA’ Webinars

The new ‘Level Up with NVIDIA’ webinar series offers creators and developers the opportunity to learn more about the NVIDIA RTX platform, interact with NVIDIA experts, and ask…

The new ‘Level Up with NVIDIA’ webinar series offers creators and developers the opportunity to learn more about the NVIDIA RTX platform, interact with NVIDIA experts, and ask questions about game integrations.

Kicking off in early August, the series features one 60-minute webinar each month, with the first half dedicated to NVIDIA experts discussing the session’s topic and the remaining time dedicated to Q&A.

We’ll focus on the NVIDIA RTX platform within popular game engines, explore what NVIDIA technologies and SDKs are in Unreal Engine 5 and Unity, and how you can successfully leverage the latest tools in your games.

Join us for the first webinar in the series on August 10 at 10 AM, Pacific time, with NVIDIA experts Richard Cowgill and Zach Lo discussing RTX in Unreal Engine 5.

Learn about NVIDIA technologies integrated into Unreal Engine, get insights into available ray tracing technologies, and see how you can get the most out of NVIDIA technologies across all game engines.

Misc

Optimizing and Serving Models with NVIDIA TensorRT and NVIDIA Triton

Post author By
Post date July 20, 2022
No Comments on Optimizing and Serving Models with NVIDIA TensorRT and NVIDIA Triton

Imagine that you have trained your model with PyTorch, TensorFlow, or the framework of your choice, are satisfied with its accuracy, and are considering deploying it as a service….

Imagine that you have trained your model with PyTorch, TensorFlow, or the framework of your choice, are satisfied with its accuracy, and are considering deploying it as a service. There are two important objectives to consider: maximizing model performance and building the infrastructure needed to deploy it as a service. This post discusses both objectives.

You can squeeze better performance out of a model by accelerating it across three stack levels:

Hardware acceleration
Software acceleration
Algorithmic or network acceleration.

NVIDIA GPUs are the leading choice for hardware acceleration among deep learning practitioners, and their merit is widely discussed in the industry.

The conversation about GPU software acceleration typically revolves around libraries like cuDNN, NCCL, TensorRT, and other CUDA-X libraries.

Algorithmic or network acceleration revolves around the use of techniques like quantization and knowledge distillation that essentially make modifications to the network itself, applications of which are highly dependent on your models.

This need for acceleration is driven primarily by business concerns like reducing costs or improving the end-user experience by reducing latency and tactical considerations like deploying on models on edge devices having fewer compute resources.

Serving deep learning models

After the models are accelerated, the next step is to build a serving service to deploy your model, which comes with its own unique set of challenges. This is a nonexhaustive list:

Will the service work on different hardware platforms?
Will it handle other models that I have to deploy simultaneously?
Will the service be robust?
How do I reduce latency?
Models are trained with different frameworks and tech stacks; how do I cater to this?
How do I scale?

These are all valid questions and addressing each of them presents a challenge.

A model trained with TensorFlow, PyTorch, or any other framework can be optimized, quantized, and pruned with TensorRT and its framework integrations. The optimized model is then served with NVIDIA Triton. — *Figure 1. Optimizing and deploying DL models with TensorRT and NVIDIA Triton*

Solution overview

This post discusses using NVIDIA TensorRT, its framework integrations for PyTorch and TensorFlow, NVIDIA Triton Inference Server, and NVIDIA GPUs to accelerate and deploy your models.

NVIDIA TensorRT

NVIDIA TensorRT is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications.

With its framework integrations with PyTorch and TensorFlow, you can speed up inference up to 6x faster with just one line of code.

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is an open-source inference-serving software that provides a single standardized inference platform. It can support running inference on models from multiple frameworks on any GPU or CPU-based infrastructure in the data center, cloud, embedded devices, or virtualized environments.

For more information, see the following videos:

Workflow overview

Before we dive into the details, here’s the overall workflow. To follow along, see the following resources:

Figure 1 shows the steps that you must go through.

First, optimize the model using TensorRT CLI or the API. Second, build a model repository for NVIDIA Triton. Third, spin up the NVIDIA Triton server. Lastly, for inference, query the server through gRPC or HTTP. — *Figure 2. Overall workflow for optimizing a model with TensorRT and serving with NVIDIA Triton*

Before you start following along, be ready with your trained model.

Step 1: Optimize the models. You can do this with either TensorRT or its framework integrations. If you choose TensorRT, you can use the trtexec command line interface. For the framework integrations with TensorFlow or PyTorch, you can use the one-line API.
Step 2: Build a model repository. Spinning up an NVIDIA Triton Inference Server requires a model repository. This repository contains the models to serve, a configuration file that specifies the details, and any required metadata.
Step 3: Spin up the server.
Step 4: Finally, we provide simple and robust HTTP and gRPC APIs that you can use to query the server!

Throughout this post, use the Docker containers from NGC. You may need to create an account and get the API key to access these containers. Now, here are the details!

Accelerating models with TensorRT

TensorRT accelerates models through graph optimization and quantization. You can access these benefits in any of the following ways:

trtexec CLI tool
TensorRT Python/C++ API
Torch-TensorRT (integration with PyTorch)
TensorFlow-TensorRT (integration with TensorFlow)

TensorRT and its framework integrations provide CLI and API support. If you are using TensorFlow or PyTorch, you can choose between TensorRT and the respective framework integration to optimize your model. — *Figure 3. Optimize your model with TensorRT or its framework integrations*

While TensorRT natively enables greater customization in graph optimizations, the framework integration provides ease of use for developers new to the ecosystem. As choosing the route a user might adopt is subject to the specific needs of their network, we would like to lay out all the options. For more information, see Speeding Up Deep Learning Inference Using NVIDIA TensorRT (Updated).

For TensorRT, there are several ways to build a TensorRT engine. For this post, use the trtexec CLI tool. If you want a script to export a pretrained model to follow along, use the export_resnet_to_onnx.py example. For more information, see the TensorRT documentation.

docker run -it --gpus all -v /path/to/this/folder:/trt_optimize nvcr.io/nvidia/tensorrt:-py3

trtexec --onnx=resnet50.onnx 
        --saveEngine=resnet50.engine 
        --explicitBatch 
        --useCudaGraph

To use FP16, add --fp16 in the command. Before proceeding to the next step, you must know the names of your network’s input and output layers, which is required while defining the config for the NVIDIA Triton model repository. One easy way is to use polygraphy, which comes packaged with the TensorRT container.

polygraphy inspect model resnet50.engine --mode=basic

ForTorch-TensorRT, pull the NVIDIA PyTorch container, which has both TensorRT and Torch TensorRT installed. To follow along, use the sample. For more examples, visit the Torch-TensorRT GitHub repo.

#  is the yy:mm for the publishing tag for NVIDIA's Pytorch 
# container; eg. 21.12

docker run -it --gpus all -v /path/to/this/folder:/resnet50_eg nvcr.io/nvidia/pytorch:-py3

python torch_trt_resnet50.py

To expand on the specifics, you are essentially using Torch-TensorRT to compile your PyTorch model with TensorRT. Behind the scenes, your model gets converted to a TorchScript module, and then TensorRT-supported ops undergo optimizations. For more information, see the Torch-TensorRT documentation.

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")

# Compile with Torch TensorRT;
trt_model = torch_tensorrt.compile(model, 
    inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
    enabled_precisions= { torch_tensorrt.dtype.float32} # Runs with FP32; can use FP16
)

# Save the model
torch.jit.save(trt_model, "model.pt")

For TensorFlow-TensorRT, the process is pretty much the same. First, pull the NVIDIA TensorFlow container, which comes with TensorRT and TensorFlow-TensorRT. We made a short script tf_trt_resnet50.py as an example. For more examples, see the TensorFlow TensorRT GitHub repo.

#  is the yy:mm for the publishing tag for the NVIDIA Tensorflow
# container; eg. 21.12

docker run -it --gpus all -v /path/to/this/folder:/resnet50_eg nvcr.io/nvidia/tensorflow:-tf2-py3

python tf_trt_resnet50.py

Again, you are essentially using TensorFlow-TensorRT to compile your TensorFlow model with TensorRT. Behind the scenes, your model gets segmented into subgraphs containing operations supported by TensorRT, which then undergo optimizations. For more information, see the TensorFlow-TensorRT documentation.

# Load model
model = ResNet50(weights='imagenet')
model.save('resnet50_saved_model') 

# Optimize with tftrt

converter = trt.TrtGraphConverterV2(input_saved_model_dir='resnet50_saved_model')
converter.convert()

# Save the model
converter.save(output_saved_model_dir='resnet50_saved_model_TFTRT_FP32')

Now that you have optimized your model with TensorRT, you can proceed to the next step, setting up NVIDIA Triton.

Setting up NVIDIA Triton Inference Server

NVIDIA Triton Inference Server is built to simplify the deployment of a model or a collection of models at scale in a production environment. To achieve ease of use and provide flexibility, using NVIDIA Triton revolves around building a model repository that houses the models, configuration files for deploying those models, and other necessary metadata.

Look at the simplest case. Figure 4 has four key points. The config.pbtxt file (a) is the previously mentioned configuration file that contains, well, configuration information for the model.

Setting up NVIDIA Triton involved two important steps, building the model repository and spinning up NVIDIA Triton. The model repository must be built with the model and a config file that describes the model’s metadata and other important details required by NVIDIA Triton. You can then use our prebuilt NVIDIA Triton Docker container and spin up the server. — *Figure 4. Setting up NVIDIA Triton workflow*

There are several key points to note in this configuration file:

Name: This field defines the model’s name and must be unique within the model repository.
Platform: (c)This field is used to define the type of the model: is it a TensorRT engine, PyTorch model, or something else.
Input and Output: (d)These fields are required as NVIDIA Triton needs metadata about the model. Essentially, it requires the names of your network’s input and output layers and the shape of said inputs and outputs. In the case of TorchScript, as the name of input and output layers are absent, use input__0. Datatype is set to FP32, and the input format is specified as (Channel, Height, Width) of 3, 224, 224.

There are minor differences between TensorRT, Torch-TensorRT, and TensorFlow-TensorRT workflows in this set, which boils down to specifying the platform and changing the name for the input and output layers. We made sample config files for all three (TensorRT, Torch-TensorRT, or TensorFlow-TensorRT). Lastly, you add the trained model (b).

Now that the model repository has been built, you spin up the server. For this, all you must do is pull the container and specify the location of your model repository. For more Information about scaling this solution with Kubernetes, see Deploying NVIDIA Triton at Scale with MIG and Kubernetes.

docker run --gpus=1 --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models

With your server up and running, you can finally build a client to fulfill inference requests!

Setting up NVIDIA Triton Client

The final step in the pipeline is to query the NVIDIA Triton Inference Server. You can send inference requests to the server through an HTTP or a gRPC request. Before diving into the specifics, install the required dependencies and download a sample image.

pip install torchvision
pip install attrdict
pip install nvidia-pyindex
pip install tritonclient[all]

wget  -O img1.jpg "https://bit.ly/3phN2jy"

In this post, use Torchvision to transform a raw image into a format that would suit the ResNet-50 model. It isn’t necessarily needed for a client. We have a much more comprehensive image client and a plethora of varied clients premade for standard use cases available in the triton-inference-server/client GitHub repo. However, for this explanation, we are going over a much simpler and skinny client to demonstrate the core of the API.

Okay, now you are ready to look at an HTTP client (Figure 5). Download the client script:

Building the client is quite simple, which can be done with the API as described. — *Figure 5. Client workflow*

Building the client has the following steps. First, establish a connection between the NVIDIA Triton Inference Server and the client.

triton_client = httpclient.InferenceServerClient(url="localhost:8000")

Second, pass the image and specify the names of the input and output layers of the model. These names should be consistent with the specifications defined in the config file that you built while making the model repository.

test_input = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
test_input.set_data_from_numpy(transformed_img, binary_data=True)

test_output = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000)

Finally, send an inference request to the NVIDIA Triton Inference Server.

results = triton_client.infer(model_name="resnet50", inputs=[test_input], outputs=[test_output])

These code examples discuss the specifics of the Torch-TensorRT models. The only differences among different models (when building a client) would be the input and output layer names. We have built NVIDIA Triton clients with Python, C++, Go, Java, and JavaScript. For more examples, see the triton-inference-server/client GitHub repo.

Conclusion

This post covered an end-to-end pipeline for inference where you first optimized trained models to maximize inference performance using TensorRT, Torch-TensorRT, and TensorFlow-TensorRT. You then proceeded to model serving by setting up and querying an NVIDIA Triton Inference Server. All the software, including TensorRT, Torch-TensorRT, TensorFlow-TensorRT, and Triton discussed in this tutorial, are available today to download as a Docker container from NGC.

Misc

Dealing with Outliers Using Three Robust Linear Regression Models

Post author By
Post date July 20, 2022
No Comments on Dealing with Outliers Using Three Robust Linear Regression Models

Photo by Ricardo Gomez Angel on Unsplash Linear regression is one of the simplest machine learning models out there. It is often the starting point not only for learning about data science but also for building quick and…

Linear regression is one of the simplest machine learning models out there. It is often the starting point not only for learning about data science but also for building quick and simple minimum viable products (MVPs), which then serve as benchmarks for more complex algorithms.

In general, linear regression fits a line (in two dimensions) or a hyperplane (in three and more dimensions) that best describes the linear relationship between the features and the target value. The algorithm also assumes that the probability distributions of the features are well-behaved; for example, they follow the Gaussian distribution.

Outliers are values that are located far outside of the expected distribution. They cause the distributions of the features to be less well-behaved. As a consequence, the model can be skewed towards the outlier values, which, as I’ve already established, are far away from the central mass of observations. Naturally, this leads to the linear regression finding a worse and more biased fit with inferior predictive performance.

It is important to remember that the outliers can be found both in the features and the target variable, and all the scenarios can worsen the performance of the model.

There are many possible approaches to dealing with outliers: removing them from the observations, treating them (capping the extreme observations at a reasonable value, for example), or using algorithms that are well-suited for dealing with such values on their own. This post focuses on these robust methods.

Setup

I use fairly standard libraries: numpy, pandas, scikit-learn. All the models I work with here are imported from the linear_model module of scikit-learn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.linear_model import (LinearRegression, HuberRegressor,
                              	RANSACRegressor, TheilSenRegressor)

Data

Given that the goal is to show how different robust algorithms deal with outliers, the first step is to create a tailor-made dataset to show clearly the differences in the behavior. To do so, use the functionalities available in scikit-learn.

Start with creating a dataset of 500 observations, with one informative feature. With only one feature and the target, plot the data, together with the models’ fits. Also, specify the noise (standard deviation applied to the output) and create a list containing the coefficient of the underlying linear model; that is, what the coefficient would be if the linear regression model was fit to the generated data. In this example, the value of the coefficient is 64.6. Extract those coefficients for all the models and use them to compare how well they fit the data.

Next, replace the first 25 observations (5% of the observations) with outliers, far outside of the mass of generated observations. Bear in mind that the coefficient stored earlier comes from the data without outliers. Including them makes a difference.

N_SAMPLES = 500
N_OUTLIERS = 25

X, y, coef = datasets.make_regression(
	n_samples=N_SAMPLES,
	n_features=1,
	n_informative=1,
	noise=20,
	coef=True,
	random_state=42
)
coef_list = [["original_coef", float(coef)]]

# add outliers          	 
np.random.seed(42)
X[:N_OUTLIERS] = 10 + 0.75 * np.random.normal(size=(N_OUTLIERS, 1))
y[:N_OUTLIERS] = -15 + 20 * np.random.normal(size=N_OUTLIERS)

plt.scatter(X, y);

Graph showing the generated data, together with the outliers, which are far away from the main bulk of the observations. — *Figure 1. The generated data and the outliers that have been manually added*

Linear regression

Start with the good old linear regression model, which is likely highly influenced by the presence of the outliers. Fit the model to the data using the following example:

lr = LinearRegression().fit(X, y)
coef_list.append(["linear_regression", lr.coef_[0]])

Then prepare an object to use for plotting the fits of the models. The plotline_X object is a 2D array containing evenly spaced values within the interval dictated by the generated data set. Use this object for getting the fitted values for the models. It must be a 2D array, given it is the expected input of the models in scikit-learn. Then create a fit_df DataFrame in which to store the fitted values, created by fitting the models to the evenly spaced values.

plotline_X = np.arange(X.min(), X.max()).reshape(-1, 1)

fit_df = pd.DataFrame(
	index = plotline_X.flatten(),
	data={"linear_regression": lr.predict(plotline_X)}
)

Having prepared the DataFrame, plot the fit of the linear regression model to the data with outliers.

fix, ax = plt.subplots()
fit_df.plot(ax=ax)
plt.scatter(X, y, c="k")
plt.title("Linear regression on data with outliers");

Figure 2 shows the significant impact that outliers have on the linear regression model.

Graph showing the impact of the outliers on the linear regression model. — *Figure 2. The fit of the linear regression model to the data with outliers*

The benchmark model has been obtained using linear regression. Now it is time to move toward robust regression algorithms.

Huber regression

Huber regression is an example of a robust regression algorithm that assigns less weight to observations identified as outliers. To do so, it uses the Huber loss in the optimization routine. Here’s a better look at what is actually happening in this model.

Huber regression minimizes the following loss function:

$minlimits_{omega,sigma}sumlimits_{i=1}^{n}(sigma+H_{epsilon}(frac{X_iomega-y_i}{sigma})sigma)+alpha|omega|2^2$

Where $sigma$ denotes the standard deviation, $X_i$ represents the set of features, $y_i$ is the regression’s target variable, $omega$ is a vector of the estimated coefficients and $alpha$ is the regularization parameter. The formula also indicates that outliers are treated differently from the regular observations according to the Huber loss:

$H_{epsilon}(z)=begin{cases} z^2, & text{if}|z|<epsilon \ 2epsilon|z|-epsilon^2, & text{otherwise}end{cases}$

The Huber loss identifies outliers by considering the residuals, denoted by $z$ . If the observation is considered to be regular (because the absolute value of the residual is smaller than some threshold $epsilon$ , then apply the squared loss function. Otherwise, the observation is considered to be an outlier and you apply the absolute loss. Having said that, Huber loss is basically a combination of the squared and absolute loss functions.

An inquisitive reader might notice that the first equation is similar to Ridge regression, that is, including the L2 regularization. The difference between Huber regression and Ridge regression lies in the treatment of outliers.

You might recognize this approach to loss functions from analyzing the differences between two of the popular regression evaluation metrics: mean squared error (MSE) and mean absolute error (MAE). Similar to what the Huber loss implies, I recommend using MAE when you are dealing with outliers, as it does not penalize those observations as heavily as the squared loss does.

Connected to the previous point is the fact that optimizing the squared loss results in an unbiased estimator around the mean, while the absolute difference leads to an unbiased estimator around the median. The median is much more robust to outliers than the mean, so expect this to provide a less biased estimate.

Use the default value of 1.35 for $epsilon$ , which determines the regression’s sensitivity to outliers. Huber (2004) shows that when the errors follow a normal distribution with $sigma$ = 1 and $epsilon$ = 1.35, an efficiency of 95% is achieved relative to the OLS regression.

For your own use cases, I recommend tuning the hyperparameters alpha and epsilon, using a method such as grid search.

Fit the Huber regression to the data using the following example:

huber = HuberRegressor().fit(X, y)
fit_df["huber_regression"] = huber.predict(plotline_X)
coef_list.append(["huber_regression", huber.coef_[0]])

Figure 3 presents the fitted model’s best fit line.

Graph showing the fit of the Huber regression model to the data with outliers. — *Figure 3. The fit of the Huber regression model to the data with outliers*

RANSAC regression

Random sample consensus (RANSAC) regression is a non-deterministic algorithm that tries to separate the training data into inliers (which may be subject to noise) and outliers. Then, it estimates the final model only using the inliers.

RANSAC is an iterative algorithm in which iteration consists of the following steps:

Select a random subset from the initial data set.
Fit a model to the selected random subset. By default, that model is a linear regression model; however, you can change it to other regression models.
Use the estimated model to calculate the residuals for all the data points in the initial data set. All observations with absolute residuals smaller than or equal to the selected threshold are considered inliers and create the so-called consensus set. By default, the threshold is defined as the median absolute deviation (MAD) of the target values.
The fitted model is saved as the best one if sufficiently many points have been classified as part of the consensus set. If the current estimated model has the same number of inliers as the current best one, it is only considered to be better if it has a better score.

The steps are performed iteratively either a maximum number of times or until a special stop criterion is met. Those criteria can be set using three dedicated hyperparameters. As I mentioned earlier, the final model is estimated using all inlier samples.

Fit the RANSAC regression model to the data.

ransac = RANSACRegressor(random_state=42).fit(X, y)
fit_df["ransac_regression"] = ransac.predict(plotline_X)
ransac_coef = ransac.estimator_.coef_
coef_list.append(["ransac_regression", ransac.estimator_.coef_[0]])

As you can see, the procedure for recovering the coefficient is a bit more complex, as it’s first necessary to access the final estimator of the model (the one trained using all the identified inliers) using estimator_. As it is a LinearRegression object, proceed to recover the coefficient as you did earlier. Then, plot the fit of the RANSAC regression (Figure 4).

Graph showing the fit of the RANSAC regression model to the data with outliers. — *Figure 4. The fit of the RANSAC regression model to the data with outliers*

With RANSAC regression, you can also inspect the observations that the model considered to be inliers and outliers. First, check how many outliers the model identified in total and then how many of those that were manually introduced overlap with the models’ decision. The first 25 observations of the training data are all the outliers that have been introduced.

inlier_mask = ransac.inlier_mask_
outlier_mask = ~inlier_mask
print(f"Total outliers: {sum(outlier_mask)}")
print(f"Outliers you added yourself: {sum(outlier_mask[:N_OUTLIERS])} / {N_OUTLIERS}")

Running the example prints the following summary:

Total outliers: 51
Outliers you added yourself: 25 / 25

Roughly 10% of data was identified as outliers and all the observations introduced were correctly classified as outliers. It’s then possible to quickly visualize the inliers compared to outliers to see the remaining 26 observations flagged as outliers.

plt.scatter(X[inlier_mask], y[inlier_mask], color="blue", label="Inliers")
plt.scatter(X[outlier_mask], y[outlier_mask], color="red", label="Outliers")
plt.title("RANSAC - outliers vs inliers");

Figure 5 shows that the observations located farthest from the hypothetical best-fit line of the original data are considered outliers.

Graph showing inliers compared to outliers as identified by the RANSAC algorithm — *Figure 5. Inliers compared to outliers as identified by the RANSAC algorithm*

Theil-Sen regression

The last of the robust regression algorithms available in scikit-learn is the Theil-Sen regression. It is a non-parametric regression method, which means that it makes no assumption about the underlying data distribution. In short, it involves fitting multiple regression models on subsets of the training data and then aggregating the coefficients at the last step.

Here’s how the algorithm works. First, it calculates the least square solutions (slopes and intercepts) on subsets of size p (hyperparameter n_subsamples) created from all the observations in the training set X. If you calculate the intercept (it is optional), then the following condition must be satisfied p >= n_features + 1. The final slope of the line (and possibly the intercept) is defined as the (spatial) median of all the least square solutions.

A possible downside of the algorithm is its computational complexity, as it can consider a total number of least square solutions equal to n_samples choose n_subsamples, where n_samples is the number of observations in X. Given that this number can quickly explode in size, there are a few things that can be done:

Use the algorithm only for small problems in terms of the number of samples and features. However, for obvious reasons, this might not always be feasible.
Tune the n_subsamples hyperparameter. A lower value leads to higher robustness to outliers at the cost of lower efficiency, while a higher value leads to lower robustness and higher efficiency.
Use the max_subpopulation hyperparameter. If the total value of n_samples choose n_subsamples is larger than max_subpopulation, the algorithm only considers a stochastic subpopulation of a given maximal size. Naturally, using only a random subset of all the possible combinations leads to the algorithm losing some of its mathematical properties.

Also, be aware that the estimator’s robustness decreases quickly with the dimensionality of the problem. To see how that works out in practice, estimate the Theil-Sen regression using the following example:

theilsen = TheilSenRegressor(random_state=42).fit(X, y)
fit_df["theilsen_regression"] = theilsen.predict(plotline_X)
coef_list.append(["theilsen_regression", theilsen.coef_[0]])

Graph showing the Theil-Sen regression results in a similar fit to the RANSAC model. — *Figure 6. The fit of the Theil-Sen regression model to the data with outliers*

Comparison of the models

So far, three robust regression algorithms have been fitted to the data containing outliers and the individual best fit lines have been identified. Now it is time for a comparison.

Start with the visual inspection of Figure 7. To show too many lines, the fit line of the original data is not printed. However, it is quite easy to imagine what it looks like, given the direction of the majority of the data points. Clearly, the RANSAC and Theil-Sen regressions have resulted in the most accurate best fit lines.

Graph showing a comparison of all the considered regression models. — *Figure 7. Comparison of all the considered regression models*

To be more precise, look at the estimated coefficients. Table 1 shows that the RANSAC regression results in the fit closest to the one of the original data. It is also interesting to see how big of an impact the 5% of outliers had on the regular linear regression’s fit.

	model	coefficient
0	`original_coef`	64.59
1	`linear_regression`	8.77
2	`huber_regression`	37.52
3	`ransac_regression`	62.85
4	`theilsen_regression`	59.49

Table 1. The comparison of the coefficients of the different models fitted to the data with outliers

You might ask which robust regression algorithm is the best? As is often the case, the answer is, “It depends.” Here are some guidelines that might help you find the right model for your specific problem:

In general, robust fitting in a high-dimensional setting is difficult.
In contrast to Theil-Sen and RANSAC, Huber regression is not trying to completely filter out the outliers. Instead, it lessens their effect on the fit.
Huber regression should be faster than RANSAC and Theil-Sen, as the latter ones fit on smaller subsets of the data.
Theil-Sen and RANSAC are unlikely to be as robust as the Huber regression using the default hyperparameters.
RANSAC is faster than Theil-Sen and it scales better with the number of samples.
RANSAC should deal better with large outliers in the y-direction, which is the most common scenario.

Taking all the preceding information into consideration, you might also empirically experiment with all three robust regression algorithms and see which one fits your data best.

You can find the code used in this post in my /erykml GitHub repo. I look forward to hearing from you in the comments.

Misc

Lucid Motors’ Mike Bell on Software-Defined Innovation for the Luxury EV Brand

Post author By
Post date July 20, 2022
No Comments on Lucid Motors’ Mike Bell on Software-Defined Innovation for the Luxury EV Brand

AI and electric vehicle technology breakthroughs are transforming the automotive industry. These developments pave the way for new innovators, attracting technical prowess and design philosophies from Silicon Valley. Mike Bell, senior vice president of digital at Lucid Motors, sees continuous innovation coupled with over-the-air updates as key to designing sustainable, award-winning intelligent vehicles that provide Read article >

The post Lucid Motors’ Mike Bell on Software-Defined Innovation for the Luxury EV Brand appeared first on NVIDIA Blog.

Offsites

Simplified Transfer Learning for Chest Radiography Model Development

Post author By
Post date July 19, 2022
No Comments on Simplified Transfer Learning for Chest Radiography Model Development

Posted by Akib Uddin, Product Manager and Andrew Sellergren, Software Engineer, Google Health

Every year, nearly a billion chest X-ray (CXR) images are taken globally to aid in the detection and management of health conditions ranging from collapsed lungs to infectious diseases. Generally, CXRs are cheaper and more accessible than other forms of medical imaging. However, existing challenges continue to impede the optimal use of CXRs. For example, in some areas, trained radiologists that can accurately interpret CXR images are in short supply. In addition, interpretation variability between experts, workflow differences between institutions, and the presence of rare conditions familiar only to subspecialists all contribute to making high-quality CXR interpretation a challenge.

Recent research has leveraged machine learning (ML) to explore potential solutions for some of these challenges. There is significant interest and effort devoted to building deep learning models that detect abnormalities in CXRs and improve access, accuracy, and efficiency to identify diseases and conditions that affect the heart and lungs. However, building robust CXR models requires large labeled training datasets, which can be prohibitively expensive and time-consuming to create. In some cases, such as working with underrepresented populations or studying rare medical conditions, only limited data are available. Additionally, CXR images vary in quality across populations, geographies, and institutions, making it difficult to build robust models that perform well globally.

In “Simplified Transfer Learning for Chest Radiography Models Using Less Data”, published in the journal Radiology, we describe how Google Health utilizes advanced ML methods to generate pre-trained “CXR networks” that can convert CXR images to embeddings (i.e., information-rich numerical vectors) to enable the development of CXR models using less data and fewer computational resources. We demonstrate that even with less data and compute, this approach has enabled performance comparable to state-of-the-art deep learning models across various prediction tasks. We are also excited to announce the release of CXR Foundation, a tool that utilizes our CXR-specific network to enable developers to create custom embeddings for their CXR images. We believe this work will help accelerate the development of CXR models, aiding in disease detection and contributing to more equitable health access throughout the world.

Developing a Chest X-ray Network
A common approach to building medical ML models is to pre-train a model on a generic task using non-medical datasets and then refine the model on a target medical task. This process of transfer learning may improve the target task performance or at least speed up convergence by applying the understanding of natural images to medical images. However, transfer learning may still require large labeled medical datasets for the refinement step.

Expanding on this standard approach, our system supports modeling CXR-specific tasks through a three-step model training setup composed of (1) generic image pre-training similar to traditional transfer learning, (2) CXR-specific pre-training, and (3) task-specific training. The first and third steps are common in ML: first pre-training on a large dataset and labels that are not specific to the desired task, and then fine-tuning on the task of interest.

We built a CXR-specific image classifier that employs supervised contrastive learning (SupCon). SupCon pulls together representations of images that have the same label (e.g., abnormal) and pushes apart representations of images that have a different label (e.g., one normal image and one abnormal image). We pre-trained this model on de-identified CXR datasets of over 800,000 images generated in partnership with Northwestern Medicine and Apollo Hospitals in the US and India, respectively. We then leveraged noisy abnormality labels from natural language processing of radiology reports to build our “CXR-specific” network.

This network creates embeddings (i.e., information-rich numerical vectors that can be used to distinguish classes from each other) that can more easily train models for specific medical prediction tasks, such as image finding (e.g., airspace opacity), clinical condition (e.g., tuberculosis), or patient outcome (e.g., hospitalization). For example, the CXR network can generate embeddings for every image in a given CXR dataset. For these images, the generated embeddings and the labels for the desired target task (such as tuberculosis) are used as examples to train a small ML model.

Left: Training a CXR model for a given task generally requires a large number of labeled images and a significant amount of computational resources to create a foundation of neural network layers. Right: With the CXR network and tool providing this foundation, each new task requires only a fraction of the labeled images, computational resources, and neural network parameters compared to rebuilding the entire network from scratch.

Effects of CXR Pre-training
We visualized these embedding layers at each step of the process using airspace opacity as an example (see the figure below). Before SupCon-based pre-training, there was poor separation of normal and abnormal CXR embeddings. After SupCon-based pre-training, the positive examples were grouped more closely together, and the negative examples more closely together as well, indicating that the model had identified that images from each category resembled themselves.

Visualizations of the t-distributed stochastic neighbor embedding for generic vs. CXR-specific network embeddings. Embeddings are information-rich numerical vectors that alone can distinguish classes from each other, in this case, airspace opacity positive vs. negative.

Our research suggests that adding the second stage of pre-training enables high-quality models to be trained with up to 600-fold less data in comparison to traditional transfer learning approaches that leverage pre-trained models on generic, non-medical datasets. We found this to be true regardless of model architecture (e.g., ResNet or EfficientNet) or dataset used for natural image pre-training (e.g., ImageNet or JFT-300M). With this approach, researchers and developers can significantly reduce dataset size requirements.

Top: In a deep learning model, the neural network contains multiple layers of artificial neurons, with the first layer taking the CXR image as input, intermediate layers doing additional computation, and the final layer making the classification (e.g., airspace opacity: present vs. absent). The embedding layer is usually one of the last layers. Bottom left: The traditional transfer learning approach involves a two-step training setup where a generic pre-trained network is optimized directly on a prediction task of interest. Our proposed three-step training setup generates a CXR network using a SupCon ML technique (step 2) before optimization for prediction tasks of interest (step 3). Bottom right: Using the embeddings involves either training smaller models (the first two strategies) or fine-tuning the whole network if there are sufficient data (strategy 3).

Results
After training the initial model, we measured performance using the area under the curve (AUC) metric with both linear and non-linear models applied to CXR embeddings; and a non-linear model produced by fine-tuning the entire network. On public datasets, such as ChestX-ray14 and CheXpert, our work substantially and consistently improved the data-accuracy tradeoff for models developed across a range of training dataset sizes and several findings. For example, when evaluating the tool’s ability to develop tuberculosis models, data efficiency gains were more striking: models trained on the embeddings of just 45 images achieved non-inferiority to radiologists in detecting tuberculosis on an external validation dataset. For both tuberculosis and severe COVID-19 outcomes, we show that non-linear classifiers trained on frozen embeddings outperformed a model that was fine-tuned on the entire dataset.

Comparing CXR-specific networks for transfer learning (red), with a baseline transfer learning approach (blue) across a variety of CXR abnormalities (top left), tuberculosis (bottom left), and COVID-19 outcomes (bottom right). This approach improves performance at the same dataset size, or reduces the dataset size required to reach the same performance. Interestingly, using the CXR network with simpler ML models that are faster to train (red) performs better than training the full network (black) at dataset sizes up to 8⁵ images.

Conclusion and Future Work
To accelerate CXR modeling efforts with low data and computational requirements, we are releasing our CXR Foundation tool, along with scripts to train linear and nonlinear classifiers. Via these embeddings, this tool will allow researchers to jump-start CXR modeling efforts using simpler transfer learning methods. This approach can be particularly useful for predictive modeling using small datasets, and for adapting CXR models when there are distribution shifts in patient populations (whether over time or across different institutions). We are excited to continue working with partners, such as Northwestern Medicine and Apollo Hospitals, to explore the impact of this technology further. By enabling researchers with limited data and compute to develop CXR models, we’re hoping more developers can solve the most impactful problems for their populations.

Acknowledgements
Key contributors to this project at Google include Christina Chen, Yun Liu, Dilip Krishnan, Zaid Nabulsi, Atilla Kiraly, Arnav Agharwal, Eric Wu, Yuanzhen Li, Aaron Maschinot, Aaron Sarna, Jenny Huang, Marilyn Zhang, Charles Lau, Neeral Beladia, Daniel Tse, Krish Eswaran, and Shravya Shetty. Significant contributions and input were also made by collaborators Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia-Vicente, and David Melnick. For the ChestX-ray14 dataset, we thank the NIH Clinical Center for making it publicly available. The authors would also like to acknowledge many members of the Google Health Radiology and labeling software teams. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study; Jonny Wong for coordinating the imaging annotation work; Craig Mermel and Akinori Mitani for providing feedback on the manuscript; Nicole Linton and Lauren Winer for feedback on the blogpost; and Tom Small for the animation.