MURAL: Multimodal, Multi-task Retrieval Across Languages

For many concepts, there is no direct one-to-one translation from one language to another, and even when there is, such translations often carry different associations and connotations that are easily lost for a non-native speaker. In such cases, however, the meaning may be more obvious when grounded in visual examples. Take, for instance, the word “wedding”. In English, one often associates a bride in a white dress and a groom in a tuxedo, but when translated into Hindi (शादी), a more appropriate association may be a bride wearing vibrant colors and a groom wearing a sherwani. What each person associates with the word may vary considerably, but if they are shown an image of the intended concept, the meaning becomes more clear.

The word “wedding” in English and Hindi conveys different mental images. Images are taken from wikipedia, credited to Psoni2402 (left) and David McCandless (right) with CC BY-SA 4.0 license.

With current advances in neural machine translation and image recognition, it is possible to reduce this sort of ambiguity in translation by presenting a text paired with a supporting image. Prior research has made much progress in learning image–text joint representations for high-resource languages, such as English. These representation models strive to encode the image and text into vectors in a shared embedding space, such that the image and the text describing it are close to each other in that space. For example, ALIGN and CLIP have shown that training a dual-encoder model (i.e., one trained with two separate encoders) on image–text pairs using a contrastive learning loss works remarkably well when provided with ample training data.

Unfortunately, such image–text pair data does not exist at the same scale for the majority of languages. In fact, more than 90% of this type of web data belongs to the top-10 highly-resourced languages, such as English and Chinese, with much less data for under-resourced languages. To overcome this issue, one could either try to manually collect image–text pair data for under-resourced languages, which would be prohibitively difficult due to the scale of the undertaking, or one could seek to leverage pre-existing datasets (e.g., translation pairs) that could inform the necessary learned representations for multiple languages.

In “MURAL: Multimodal, Multitask Representations Across Languages”, presented at Findings of EMNLP 2021, we describe a representation model for image–text matching that uses multitask learning applied to image–text pairs in combination with translation pairs covering 100+ languages. This technology could allow users to express words that may not have a direct translation into a target language using images instead. For example, the word “valiha”, refers to a type of tube zither played by the Malagasy people, which lacks a direct translation into most languages, but could be easily described using images. Empirically, MURAL shows consistent improvements over state-of-the-art models, other benchmarks, and competitive baselines across the board. Moreover, MURAL does remarkably well for the majority of the under-resourced languages on which it was tested. Additionally, we discover interesting linguistic correlations learned by MURAL representations.

MURAL Architecture
The MURAL architecture is based on the structure of ALIGN, but employed in a multitask fashion. Whereas ALIGN uses a dual-encoder architecture to draw together representations of images and associated text descriptions, MURAL employs the dual-encoder structure for the same purpose while also extending it across languages by incorporating translation pairs. The dataset of image–text pairs is the same as that used for ALIGN, and the translation pairs are those used for LaBSE.

MURAL solves two contrastive learning tasks: 1) image–text matching and 2) text–text (bitext) matching, with both tasks sharing the text encoder module. The model learns associations between images and text from the image–text data, and learns the representations of hundreds of diverse languages from the translation pairs. The idea is that a shared encoder will transfer the image–text association learned from high-resource languages to under-resourced languages. We find that the best model employs an EfficientNet-B7 image encoder and a BERT-large text encoder, both trained from scratch. The learned representation can be used for downstream visual and vision-language tasks.

The architecture of MURAL depicts dual encoders with a shared text-encoder between the two tasks trained using a contrastive learning loss.

Multilingual Image-to-Text and Text-to-Image Retrieval
To demonstrate MURAL’s capabilities, we choose the task of cross-modal retrieval (i.e., retrieving relevant images given a text and vice versa) and report the scores on various academic image–text datasets covering well-resourced languages, such as MS-COCO (and its Japanese variant, STAIR), Flickr30K (in English) and Multi30K (extended to German, French, Czech), XTD (test-only set with seven well-resourced languages: Italian, Spanish, Russian, Chinese, Polish, Turkish, and Korean). In addition to well-resourced languages, we also evaluate MURAL on the recently published Wikipedia Image–Text (WIT) dataset, which covers 108 languages, with a broad range of both well-resourced (English, French, Chinese, etc.) and under-resourced (Swahili, Hindi, etc.) languages.

MURAL consistently outperforms prior state-of-the-art models, including M3P, UC2, and ALIGN, in both zero-shot and fine-tuned settings evaluated on well-resourced and under-resourced languages. We see remarkable performance gains for under-resourced languages when compared to the state-of-the-art model, ALIGN.

Mean recall on various multilingual image–text retrieval benchmarks. Mean recall is a common metric used to evaluate cross-modal retrieval performance on image–text datasets (higher is better). It measures the Recall@N (i.e., the chance that the ground truth image appears in the first N retrieved images) averaged over six measurements: Image→Text and Text→Image retrieval for N=[1, 5, 10]. Note that XTD scores report Recall@10 for Text→Image retrieval.

Retrieval Analysis
We also analyzed zero-shot retrieved examples on the WIT dataset comparing ALIGN and MURAL for English (en) and Hindi (hi). For under-resourced languages like Hindi, MURAL shows improved retrieval performance compared to ALIGN that reflects a better grasp of the text semantics.

Comparison of the top-5 images retrieved by ALIGN and by MURAL for the Text→Image retrieval task on the WIT dataset for the Hindi text, एक तश्तरी पर बिना मसाले या सब्ज़ी के रखी हुई सादी स्पगॅत्ती”, which translates to the English, “A bowl containing plain noodles without any spices or vegetables”.

Even for Image→Text retrieval in a well-resourced language, like French, MURAL shows better understanding for some words. For example, MURAL returns better results for the query “cadran solaire” (“sundial”, in French) than ALIGN, which doesn’t retrieve any text describing sundials (below).

Comparison of the top-5 text results from ALIGN and from MURAL on the Image→Text retrieval task for the same image of a sundial.

Embeddings Visualization
Previously, researchers have shown that visualizing model embeddings can reveal interesting connections among languages — for instance, representations learned by a neural machine translation (NMT) model have been shown to form clusters based on their membership to a language family. We perform a similar visualization for a subset of languages belonging to the Germanic, Romance, Slavic, Uralic, Finnic, Celtic, and Finno-Ugric language families (widely spoken in Europe and Western Asia). We compare MURAL’s text embeddings with LaBSE’s, which is a text-only encoder.

A plot of LabSE’s embeddings shows distinct clusters of languages influenced by language families. For instance, Romance languages (in purple, below) fall into a different region than Slavic languages (in brown, below). This finding is consistent with prior work that investigates intermediate representations learned by a NMT system.

Visualization of text representations of LaBSE for 35 languages. Languages are color coded based on their genealogical association. Representative languages include: Germanic (red) — German, English, Dutch; Uralic (orange) — Finnish, Estonian; Slavic (brown) — Polish, Russian; Romance (purple) — Italian, Portuguese, Spanish; Gaelic (blue) — Welsh, Irish.

In contrast to LaBSE’s visualization, MURAL’s embeddings, which are learned with a multimodal objective, shows some clusters that are in line with areal linguistics (where elements are shared by languages or dialects in a geographic area) and contact linguistics (where languages or dialects interact and influence each other). Notably, in the MURAL embedding space, Romanian (ro) is closer to the Slavic languages like Bulgarian (bg) and Macedonian (mk), which is in line with the Balkan sprachbund, than it is in LaBSE. Another possible language contact brings Finnic languages, Estonian (et) and Finnish (fi), closer to the Slavic languages cluster. The fact that MURAL pivots on images as well as translations appears to add an additional view on language relatedness as learned in deep representations, beyond the language family clustering observed in a text-only setting.

Visualization of text representations of MURAL for 35 languages. Color coding is the same as the figure above.

Final Remarks
Our findings show that training jointly using translation pairs helps overcome the scarcity of image–text pairs for many under-resourced languages and improves cross-modal performance. Additionally, it is interesting to observe hints of areal linguistics and contact linguistics in the text representations learned by using a multimodal model. This warrants more probing into different connections learned implicitly by multimodal models, such as MURAL. Finally, we hope this work promotes further research in the multimodal, multilingual space where models learn representations of and connections between languages (expressed via images and text), beyond well-resourced languages.

This research is in collaboration with Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, and Jason Baldridge. We thank Zarana Parekh, Orhan Firat, Yuqing Chen, Apu Shah, Anosh Raj, Daphne Luong, and others who provided feedback for the project. We are also grateful for general support from Google Research teams.


Creating Smarter Spaces with NVIDIA Metropolis and Edge AI

Graphic with NVIDIA logo and smart cars.Learn how AI-enabled video analytics is helping companies and employees work smarter and safer.   Graphic with NVIDIA logo and smart cars.

What do a factory floor, retail store, and major roadway have in common? They are a few examples of valuable and constrained infrastructure that need to be optimized. Manufacturers aim for early detection of defects in the assembly process. Retailers seek to better understand their customer journey and deliver more frictionless checkout experiences. Traffic planners look to reduce traffic gridlock.  

Over one billion cameras are deployed worldwide in nearly all of our important spaces, generating tremendous amounts of data but without a system for analyzing this data, valuable insights are lost. Enter AI-powered computer vision, which unlocks insights hidden in the video to generate insights that enable cities and companies to improve their safety and operational efficiency. 

Optimizing AI-enabled video analytics solutions streamlines tasks across industries, from healthcare to manufacturing, helping companies and their employees to work smarter and safer.   

NVIDIA Metropolis is an application framework, set of developer tools, and partner ecosystem that unites visual data and AI to enable greater functionality and efficiency across a range of physical spaces and environments. 

Transit hubs, retail stores, and factories use vision AI applications for more efficient, accessible, and safe operations. The following examples illustrate vision AI applications transforming how we use and manage our most critical spaces. 

Airports: With terminals serving and moving millions of passengers a year, airports are small cities, industrial sites, and transportation hubs. AI-enabled video analytics solutions identify and manage incidents in real time to minimize disruptions to passengers and airport operations. These solutions help airlines accelerate airplane turnarounds, deliver safer airport operations, and provide parking management to passengers. 

Factories: Companies are increasingly automating their manufacturing processes with IoT sensors, the most common of which are video cameras. These cameras capture vast amounts of data that, when combined with the power of AI, produce valuable insights that manufacturers can use to improve operational efficiency. Real-time understanding and responses are critical, such as identifying product defects on assembly lines, scanning for workplace hazards and signaling when machines require maintenance.

Farms: Farmers around the world are turning to vision AI applications to automate and improve their operations and yield quality. These applications help in a wide range of use cases, from counting cows to detecting weeds to the robotic pollination of tomatoes. These computer vision applications help farmers revolutionize food production by improving yield and using less resources.

Stadiums: Millions of people around the world visit stadiums to enjoy live sporting and cultural events. AI-enabled video analytics solutions are used to automate perimeter protection, weapons detection, crowd analytics, parking management, and suspicious behavior monitoring to provide a safer and more cohesive experience for visitors.

Hospitals: AI-enabled video analytics solutions help keep track of operating room procedures, ultimately improving patient care and surgical outcomes. By using accurate action logging, hospital staff can monitor surgical procedures, enforce disinfecting protocols, and check medical supply inventory levels in real time. AI-enabled video analytics reduces the need for human input on certain routine tasks, giving doctors and nurses more time with their patients.

Universities: AI vision helps university administrators better understand how physical spaces, like offices, gyms, and halls, are used. AI applications can also analyze real-time video footage and generate insights that inform better campus management, from detecting crowd flow patterns to creating immediate alerts for abnormal activities like fires, accidents, or water leakage.

A new generation of AI applications at the edge is driving incredible operational efficiency and safety gains across a broad range of spaces. Download a free e-book to learn how Metropolis and edge AI are helping build smarter and safer spaces around the world.


How to get reproducibile results in tensorflow?

I’m working on a project based on a conda environment, by using:

  • tensorflow-gpu=2.4.0,
  • cudatoolkit=10.2.89,
  • cudnn=7.6.5.

I’d like to have reproducibile results, so I tried with:

import os import random import numpy as np from numpy.random import default_rng import tensorflow as tf random.seed(0) rng = default_rng(0) tf.random.set_seed(0) 

And launching the python script from the terminal as:

PYTHONHASHSEED=0 python /path/to/ 

But my results are not reproducible.

Without posting my code (because is long and includes many files), which could be some other aspects that I should consider in order to get reproducibility?

PS: the Artificial Neural Network is a CNN and is created with by adding layers as, e.g.,:tf.keras.layers.Convolution2D(…)

submitted by /u/RainbowRedditForum
[visit reddit] [comments]


whats poppin my dudes

any suggestions on how I could avoid the ‘loading’ aspect of a model in a server that servers client resquests to a web api endpoint? such that the model is permanently ‘loaded’ and only has to make predictions?

# to save compute time that is (duh)

beep bop

submitted by /u/doctor_slimm
[visit reddit] [comments]


NVIDIA BlueField DPU Ecosystem Expands as Partners Introduce Joint Solutions

Learn how these Industry leaders have started to integrate their solutions using the DPU/DOCA architecture as key partners showcase these solutions at the recent NVIDIA GTC.

NVIDIA recently introduced the NVIDIA DOCA 1.2 software framework for NVIDIA BlueField DPUs, the world’s most advanced Data Processing Unit (DPU). This latest release builds on the momentum of the DOCA early access program to enable partners and customers to accelerate the development of applications and holistic zero trust solutions on the DPU.

NVIDIA is working with leading platform vendors and partners to integrate and expand DOCA support for commercial distributions on NVIDIA BlueField DPUs. Learn how these Industry leaders have started to integrate their solutions using the DPU/DOCA architecture as key partners showcase these solutions at the recent NVIDIA GTC.

Red Hat – “Sensitive Information Detection using the NVIDIA Morpheus AI framework
Red Hat and NVIDIA have been working together to bring the security analytics capabilities of the NVIDIA Morpheus AI application framework to the Red Hat infrastructure platforms for cybersecurity developers. This post provides a set of configuration instructions to Red Hat developers working on applications that use the NVIDIA Morpheus AI application framework and NVIDIA BlueField DPUs to secure interservice communication.  

Figure 1. Red Hat and NVIDIA High-level architecture

Juniper Networks – “Extending the Edge of the Network with Juniper Edge Services Platform (JESP)
Earlier this year, Juniper discussed the value of extending the network all the way to the server through DPU, such as the NVIDIA BlueField DPU powered SmartNICs, and how these devices can be used to provide L2-L7 networking and security services. At NVIDIA GTC, Juniper provides a sneak preview of an internal project – Juniper Edge Services Platform (JESP), which enables the extension of the network all the way to the SmartNIC.  

Figure 2. Juniper Edge Services Platform (JESP)

F5 – “Redefining Cybersecurity at the Distributed Cloud Edge with AI and Real-time Telemetry
Augmenting well-established security measures for web, application, firewall, and fraud mitigation techniques, F5 is researching techniques to detect such advanced threats, which require contextual analysis of several of these data points via large-scale telemetry, and with near real-time analysis. This is where NVIDIA BlueField-2 DPU-based real-time telemetry and NVIDIA GPU-powered Morpheus cybersecurity framework come into play.

Figure 3. F5 Advanced Threats Classification

Excelero – “Storage Horsepower for Critical Application Performance
NVMesh technology is a low-latency, distributed storage software that is deployed across machines with very high-speed local drives (NVMe SSDs, to be exact), enabling high-speed compute and high data throughput that far exceeds anything achievable with other storage alternatives – and at a significantly lower cost. Network performance is also critical and this is why Excelero is working with NVIDIA and their BlueField DPU, plus NVIDIA DOCA software platform technology.

DDN – “DDN Supercharges AI Security with NVIDIA
Along with NVIDIA, DDN is helping customers choose a data strategy that supports enterprise-scale AI workloads with a “Storage-as-a-Service” approach. This solution delivers cost-effective centralized infrastructure that meets the performance and scalability needs of complex AI applications and datasets.  

Early access to the DOCA software framework is available now.

To experience accelerated software-defined management services today, click here to register and download the BlueField DPU software package that includes DOCA runtime accelerated libraries for networking, security, and storage.

Additional Resources:
Web: DOCA Home Page
Web: BlueField DPU Home Page
DLI Course: Take the Introduction to NVIDIA DOCA for BlueField DPUs DLI Course
Whitepaper: DPU-Based Hardware Acceleration: A Software Perspective
NVIDIA Corporate Blog: NVIDIA Creates Zero-Trust Cybersecurity Platform
NVIDIA Developer Blog: NVIDIA Introduces BlueField DPU as a Platform for Zero Trust Security with DOCA 1.2


Creating Robust and Generalizable AI Models with NVIDIA FLARE

NVIDIA FLARE v2.0 is an open-source federated learning SDK that is making it easier for data scientists to collaborate to develop more generalizable robust AI models by just sharing model weights rather than private data.

Federated learning (FL) has become a reality for many real-world applications. It enables multinational collaborations on a global scale to build more robust and generalizable machine learning and AI models. For more information, see Federated learning for predicting clinical outcomes in patients with COVID-19.

NVIDIA FLARE v2.0 is an open-source FL SDK that is making it easier for data scientists to collaborate to develop more generalizable robust AI models by just sharing model weights rather than private data.

For healthcare applications, this is particularly beneficial where data is patient protected, data may be sparse for certain patient types and diseases, or data lacks diversity across instrument types, genders, and geographies.


NVIDIA FLARE stands for Federated Learning Application Runtime Environment. It is the engine underlying the NVIDIA Clara Train FL software, which has been used for AI applications in medical imaging, genetic analysis, oncology, and COVID-19 research. The SDK enables researchers and data scientists to adapt their existing machine learning and deep learning workflows to a distributed paradigm and enables platform developers to build a secure, privacy-preserving offering for distributed multiparty collaboration.

NVIDIA FLARE is a lightweight, flexible, and scalable distributed learning framework implemented in Python that is agnostic to your underlying training library. You can bring your own data science workflows implemented in PyTorch, TensorFlow, or even just NumPy, and apply them in a federated setting.

Maybe you’d like to implement the popular federated averaging (FedAvg) algorithm. Starting from an initial global model, each FL client trains the model on their local data for a certain amount of time and sends model updates to the server for aggregation. The server then uses the aggregated updates to update the global model for the next round of training. This process is iterated many times until the model converges.

NVIDIA FLARE provides customizable controller workflows to help you implement FedAvg and other FL algorithms, for example, cyclic weight transfer. It schedules different tasks, such as deep learning training, to be executed on the participating FL clients. The workflows enable you to gather the results, such as model updates, from each client and aggregate them to update the global model and send back the updated global models for continued training. Figure 1 shows the principle.

Each FL client acts as a worker requesting the next task to be executed, such as model training. After the controller provides the task, the worker executes it and returns the results to the controller. At each communication, there can be optional filters that process the task data or results, for example, homomorphic encryption and decryption or differential privacy.

This diagram describes the NVIDIA FLARE workflow.
Figure 1. NVIDIA FLARE workflow

Your task for implementing FedAvg could be a simple PyTorch program that trains a classification model for CIFAR-10. Your local trainer could look something like the following code example. For this post, I skip the full training loop for simplicity.

import torch
import torch.nn as nn
import torch.nn.functional as F

from nvflare.apis.dxo import DXO, DataKind, MetaKey, from_shareable
from nvflare.apis.executor import Executor
from nvflare.apis.fl_constant import ReturnCode
from nvflare.apis.fl_context import FLContext
from nvflare.apis.shareable import Shareable, make_reply
from nvflare.apis.signal import Signal
from nvflare.app_common.app_constant import AppConstants

class SimpleNetwork(nn.Module):
    def __init__(self):
        super(SimpleNetwork, self).__init__()

        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1)  # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class SimpleTrainer(Executor):
    def __init__(self, train_task_name: str = AppConstants.TASK_TRAIN):
        self._train_task_name = train_task_name
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.model = SimpleNetwork()
        self.optimizer = torch.optim.SGD(self.model.parameters(), lr=0.001, momentum=0.9)
        self.criterion = nn.CrossEntropyLoss()

    def execute(self, task_name: str, shareable: Shareable, fl_ctx: FLContext, abort_signal: Signal) -> Shareable:
        This function is an extended function from the superclass.
        As a supervised learning-based trainer, the train function will run
        training based on model weights from `shareable`.
        After finishing training, a new `Shareable` object will be submitted
        to server for aggregation."""

        if task_name == self._train_task_name:
            epoch_len = 1

            # Get current global model weights
            dxo = from_shareable(shareable)

            # Ensure data kind is weights.
            if not dxo.data_kind == DataKind.WEIGHTS:
                self.log_exception(fl_ctx, f"data_kind expected WEIGHTS but got {dxo.data_kind} instead.")
                return make_reply(ReturnCode.EXECUTION_EXCEPTION)  # creates an empty Shareable with the return code

            # Convert weights to tensor and run training
            torch_weights = {k: torch.as_tensor(v) for k, v in}
            self.local_train(fl_ctx, torch_weights, epoch_len, abort_signal)

            # compute the differences between torch_weights and the now locally trained model
            model_diff = ...

            # build the shareable using a Data Exchange Object (DXO)
            dxo = DXO(data_kind=DataKind.WEIGHT_DIFF, data=model_diff)
            dxo.set_meta_prop(MetaKey.NUM_STEPS_CURRENT_ROUND, epoch_len)

            self.log_info(fl_ctx, "Local training finished. Returning shareable")
            return dxo.to_shareable()
            return make_reply(ReturnCode.TASK_UNKNOWN)

    def local_train(self, fl_ctx, weights, epoch_len, abort_signal):
        # Your training routine should respect the abort_signal.
        # Your local training loop ...
        for e in range(epoch_len):
            if abort_signal.triggered:

    def _abort_execution(self, return_code=ReturnCode.ERROR) -> Shareable:
        return make_reply(return_code)

You can see that your task implementations could be doing many different tasks. You could compute summary statistics on each client and share with the server (keeping privacy constraints in mind), perform preprocessing of the local data, or evaluate already trained models.

During FL training, you can plot the performance of the global model at the beginning of each training round. For this example, we ran with eight clients on a heterogenous data split of CIFAR-10. In the following plot (Figure 2), I show the different configurations that are available in NVIDIA FLARE 2.0 by default:

  • FedAvg
  • FedProx
  • FedOpt
  • FedAvg with secure aggregation using homomorphic encryption (FedAvg HE)
This diagram shows the different federated learning models and their accuracies.
Figure 2. Validation accuracy of the global models for different FL algorithms during training

While FedAvg, FedAvg HE, and FedProx perform comparably for this task, you can observe an improved convergence using the FedOpt setting that uses SGD with momentum to update the global model on the server.

The whole FL system can be controlled using the admin API to automatically start and operate differently configured tasks and workflows. NVIDIA also provides a comprehensive provisioning system that enables the easy and secure deployment of FL applications in the real world but also proof-of-concept studies for running local FL simulations.

This diagram shows the components of NVIDIA FLARE and their relationship.
Figure 3. NVIDIA FLARE Provision, start, operate (PSO) components, and their APIs

Get started

NVIDIA FLARE makes FL accessible to a wider range of applications. Potential use cases include helping energy companies analyze seismic and wellbore data, manufacturers optimize factory operations, and financial firms improve fraud detection models.

For more information and step-by-step examples, see NVIDIA/NVFlare on GitHub.


NVIDIA Announces Upcoming Events for Financial Community

SANTA CLARA, Calif., Nov. 29, 2021 (GLOBE NEWSWIRE) — NVIDIA will present at the following events for the financial community: Deutsche Bank’s Virtual AutoTech ConferenceThursday, Dec. 9, at …


AWS Launches First NVIDIA GPU-Accelerated Graviton-Based Instance with Amazon EC2 G5g

The new Amazon EC2 G5g instances feature the AWS Graviton2 processors and NVIDIA T4G Tensor Core GPUs, to power rich android game streaming for mobile devices.

Today at AWS re:Invent 2021, AWS announced the general availability of Amazon EC2 G5g instances—bringing the first NVIDIA GPU-accelerated Arm-based instance to the AWS cloud. The new EC2 G5g instance features AWS Graviton2 processors, based on the 64-bit Arm Neoverse cores, and NVIDIA T4G Tensor Core GPUs, enhanced for graphics-intensive applications. 

This powerful combination creates an optimal development environment for Android game content. It also brings a richer Android gaming experience to be streamed to a diverse set of mobile devices anywhere. 

Unlocking enhanced Android game streaming for mobile devices

EC2 G5g instances enable game developers to support and optimize games for high-quality streaming on a wide range of mobile devices. You can develop Android games natively on Arm-based Graviton2 processors, accelerate graphics rendering and encoding with NVIDIA T4G GPUs, and stream games to mobile devices eliminating the need for emulation software and cross-compilation. 

This brings together breakthrough graphics performance powered by NVIDIA RTX technology, the price performance of AWS Graviton2 processors, and the elastic scaling of the AWS cloud for Android-in-the-Cloud gaming services.

A number of customers are already building cloud game development and gaming platforms on AWS, and are up and running on the new G5g instance.


Initially a simple, fast, developer’s favorite Android emulator, Genymotion has evolved into a full-fledged Android platform, available across multiple channels both in the cloud and on your desktop. NVIDIA has worked closely with Genymobile to accelerate its platform on the G5g instances, improving the performance and density of its solution in the cloud. offers a mobile cloud gaming platform that enables game developers to publish games directly to the cloud. By leveraging the power of the new G5g instances, enables gamers to access and stream high-performance games on mobile devices anywhere without lag or compromising on gaming experience


The company has launched its Anbox Cloud Appliance, a small-scale version of Canonical’s Anbox Cloud, built for rapid prototyping of Android-in-the-Cloud solutions on the new G5g instance. Additionally, AWS Marketplace makes Anbox Cloud readily available with access to a more extensive set of instance types, including support for Arm CPUs and NVIDIA GPUs. Developers can upload their Android apps, configure and virtualize Android devices, and stream graphical output in real time to any web or mobile client. This development environment allows you to unleash your creativity to invent new user experiences.

Accelerating Arm-based HPC and AI 

In addition to being a great gaming and game development platform, AWS’ new G5g instance also brings the NVIDIA Arm HPC SDK to cloud computing. With support for the NVIDIA T4G GPU and the Arm-based Graviton CPU, the NVIDIA Arm HPC SDK provides the tools you need to build NVIDIA GPU-accelerated HPC applications in the cloud.

EC2 G5g instances can also be used to build and deploy high-performance, cost-effective AI-powered applications at scale. Developers can use the NVIDIA Deep Learning Amazon Machine Image on AWS Marketplace. This comes preconfigured with all the necessary NVIDIA drivers, libraries, and dependencies to run Arm-enabled software from the NVIDIA NGC catalog.

Learn more about the G5g instances and get started


Validating AI Models Collaboratively with NVIDIA Clara Imaging and

NVIDIA Clara medical AI models can now run natively on in the cloud, enabling collaborative model validation and rapid annotation projects using modern web browsers.

Medical imaging AI models built with NVIDIA Clara can now run natively on in the cloud, which enables collaborative model validation and rapid annotation projects using modern web browsers. These NVIDIA Clara models are free to use in any project for collaborative research, such as for organ or tumor segmentation. 

AI solutions have been shown to help streamline radiology and enterprise imaging workflows. However, the process to create, share, test, and scale computer vision models is not as streamlined for all modalities, conditions, and findings. Several critical components are needed to create robust models and support the most diverse acquisition devices and patient populations. These critical components can include the ability to create ground truth for unannotated imaging studies and the ability to collaborate worldwide to assess the use of models with validation data.’s real-time collaborative annotation platform and the NVIDIA Clara deep learning training framework are helping to create more robust model building and collaboration.

In this post, we walk through the basics of the Clara Train MMAR and the steps necessary to prepare it for use with In just a few steps, you can deploy any of these pretrained models on for seamless web-based evaluation and collaboration. After they’re deployed on, these models can be used in any existing or new projects.

A diagram of a typical AI model training and validation pipeline from unlabeled data to annotation, training, and validation.
Figure 1. Workflow needed to train and validate an AI model.

NVIDIA Clara Train

The Clara Train training framework is an application package built on the Python-based NVIDIA Clara Train SDK. This framework is designed to enable rapid implementation of deep learning solutions in medical imaging based on optimized, ready-to-use, pretrained medical imaging models built in-house by NVIDIA researchers.

The Clara Training framework uses a standard structure for models, the Medical Model Archive (MMAR), which contains the pretrained model as well as scripts that define end-to-end development workflows for training, fine-tuning, validation, and inference.

Components of Clara Train include pretrained models, AI-assisted annotation, training pipelines and deployment pipelines. MONAI components include data loaders and transforms, network architectures, and training and evaluation engines.
Figure 2. Clara Train SDK and MONAI Deep Learning Framework high-level architecture.

The Clara Train v4.0+ SDK uses a component-based architecture built on the open source, PyTorch-based framework MONAI (Medical Open Network for AI). MONAI provides domain-optimized foundational capabilities in healthcare imaging that can be used to build training workflows in a native PyTorch paradigm. The Clara Train SDK uses these foundational components such as optimized data loaders, transforms, loss functions, optimizers, and metrics to implement end-to-end training workflows packaged as MMARs. provides a web-based and cloud native annotation platform that enables real-time collaboration among teams of clinicians and researchers, with shared workspaces. You can also load multiple deep learning models for real-time evaluation.

The platform provides an easy and seamless interface for dataset construction and AI project creation. It gives users a wide suite of tools for annotating data and building machine-learning algorithms to accelerate the application of AI in medicine, with a particular focus on medical imaging. 

User interface of displaying outputs from a brain segmentation model deployed on the platform.
Figure 3. user interface showing a brain segmentation.

Coupling this capability with the ability to quickly deploy Clara Train model MMARs on the platform gives you an end-to-end workflow that spans rapid model development, model training, fine-tuning, inference, and rapid evaluation and visualization. This end-to-end capability streamlines the process of taking a model from research and development to production.

Solution overview

The starting point for Clara Train is the NGC Clara Train Collection. Here, you find the Clara Train SDK container, a collection of freely available, pretrained models, and a collection of Jupyter notebooks that walk through the main concepts of the SDK. All the Clara Train models share the MMAR format mentioned earlier.

The Clara Train MMAR defines a standard structure for storing the files required for defining the model development workflow, as well the files produced when executing the model for validation and inference. This structure is defined as follows:

    commands                        train with single GPU              train with 2 GPUs                     transfer learning with CKPT                        inference with TS model                     validate with TS model                validate with CKPT           validate with TS model on 2 GPUs      validate with CKPT on 2 GPUs                       export CKPT to TS model
        all evaluation outputs: segmentation / classification results
        metrics reports, etc.

All pretrained models provided for use with Clara Train, as well as custom models developed with the Clara Train framework, use this structure. To prepare an MMAR for use with, we assume a pretrained model and focus on a couple key components for deployment. 

The first component is the environment.json file that defines the common parameters for the model, including dataset paths and model checkpoints. For example, the environment.json file from the Clara Train spleen segmentation task defines the following parameters:

   "DATA_ROOT": "/workspace/data/Task09_Spleen_nii",
   "DATASET_JSON": "/workspace/data/Task09_Spleen_nii/dataset_0.json",
   "PROCESSING_TASK": "segmentation",
   "MMAR_CKPT_DIR": "models",
   "MMAR_CKPT": "models/"
   "MMAR_TORCHSCRIPT": "models/model.ts"

When preparing the model for integration with, make sure that the MMAR contains the trained MMAR_CKPT and MMAR_TORCHSCRIPT in the MMAR’s models/ directory. These are generated by executing the bundled and, respectively. 

  • The script executes model training, which requires DATA_ROOT and DATASET_JSON for the input dataset and generates the MMAR_CKPT. 
  • The script serializes this checkpoint into the MMAR_TORCHSCRIPT used for inference.

With a pretrained model, both the checkpoint and TorchScript are provided, and you can focus on the inference pipeline. Inference is executed using the MMAR’s script:

1    #!/usr/bin/env bash
2    my_dir="$(dirname "$0")"
3    . $my_dir/
4    echo "MMAR_ROOT set to $MMAR_ROOT"
6    CONFIG_FILE=config/config_validation.json
7    ENVIRONMENT_FILE=config/environment.json
8    python3 -u  -m medl.apps.evaluate 
9       -m $MMAR_ROOT 
10       -c $CONFIG_FILE 
11       -e $ENVIRONMENT_FILE 
12       --set 
13       DATASET_JSON=$MMAR_ROOT/config/dataset_0.json 
14       output_infer_result=true 
15       do_validation=false

This script runs inference on the validation subset, defined in config_validation.json, of the full dataset defined in environment.json. If reference test data is provided along with the MMAR, the paths to this data must be defined. When you integrate the MMAR, handles the dataset directly, and these values are overridden as part of the integration.

To deploy your own pretrained AI models on for inference, you must already have an existing project or create a new project on the platform. The project also must contain the dataset on which to test your model. For more information, see Set Up Project.

Next, to deploy your AI model, the inference code must be transformed into a specific format that is compatible with the platform. The following files are the bare minimum for a successful deployment:

  • config.yaml
  • requirements.txt 
  • model-weights

For more information about these files, see Interface Code

For NVIDIA Clara models, we have further streamlined this for you and there is no need to write these files from scratch. We provide skeleton codes for each different category of deep learning models supported by the NGC catalog: classification, segmentation, and so on. You can download the model-specific skeleton code, make a few adjustments that are outlined later in this post, and then upload the models on for inference. 

Inference steps

After you have an MMAR prepared, here’s how to use it directly for running the model on This post walks you through an example segmentation model that’s already deployed on the platform: the skeleton code for running segmentation models on, which is actually the code for a CT spleen segmentation model from NVIDIA. 

Now, to deploy the liver and tumor segmentation model using the same MMAR format, follow these steps:

  1. Download the skeleton code for segmentation models.
  2. Download the MMAR for the liver and tumor segmentation model from the NGC catalog.
  3. In the downloaded skeleton code, replace the /workspace/clara_pt_spleen_ct_segmentation_1 folder with your downloaded MMAR folder.
  4. In the /workspace/config_mdai.json file, make the following changes:
1    {
2        “type” : “segmentation”,
3        “root_folder”: “clara_pt_spleen_ct_segmentation_1”,
4        “out_classes” : 2,
5        “data_list_key”: “test”
6    }
  • root_folder—Replace this key value with the name of your downloaded MMAR folder, such as clara_pt_liver_and_tumor_ct_segmentation_1 for the liver and tumor example.
  • out_classes—Replace this value with the number of output classes for your model, such as 3 in this case (background: 0, liver: 1, and tumor: 2).
  • data_list_key—Replace with the key name mentioned in the data_list_key attribute of your MMAR’s config/config_inference.json file, such as testing.
  1. In the /mdai folder, make the following changes:
  • In the config.yaml file, change the clara_version key to the appropriate version used by your model (for example, 3.1.01 or 4.0).
1    base_image: nvidia
2    clara_version: 4.0
3    device_type: gpu
  • In requirements.txt, add any additional dependencies required, more than those provided by the NVIDIA Clara base image and those already present in the file.

This prepares your model for deployment on Both the spleen and liver tumor segmentation models have been deployed on the platform and are available for evaluation

Similar steps can be done for classification models, though we are working towards further streamlining this integration. For the skeleton code for an example NVIDIA model that classifies chest X-rays into 15 abnormalities, see Example: NVIDIA MMAR for disease classification in chest x-rays on GitHub. This model is also deployed on the public site.

When the code is ready, it must be wrapped in a zip file so that it can be uploaded on for inference. For more information, see Deploying models.

Your model is now ready to be tried within on any dataset of your choice!

Visualization of model outputs from NVIDIA Clara train's spleen segmentation and liver and tumor segmentation models deployed on
Figure 4. Segmentation example on

The best part is that all models need only be deployed one time on the platform. As soon as an NVIDIA Clara model is deployed on, it can be used in any project by using the model cloning feature. Here’s an example of cloning the NVIDIA Liver Segmentation model into a new project by just copying the value:

Example GIF of cloning deployed models on from one project to another.
Figure 5. Model cloning workflow on

Future features

We are working towards streamlining the integration to minimize the steps required to deploy the MMAR, with plans to eliminate all code modification, so that deployment is as easy as just clicking a button. plans to predeploy all the models available on NGC so that you can use them directly by cloning from our public projects, saving you from the process of deploying MMARs on your own. We are also going to create NVIDIA Clara Starter Packs so that you can easily get started with selected models preattached to your project.

Another important plan is to add support for training AI models on When we have that, you can effectively use the NVIDIA AI-assisted annotation product on the platform to help users annotate much faster and much easier, rather than starting from scratch.


In this post, we highlighted key components of each platform and the steps necessary to quickly deploy a medical imaging model built with NVIDIA Clara on

Try out a live demo of the NVIDIA Clara Liver and Spleen Segmentation models in If you have any questions, contact or


Removing Aliasing Artifacts in Ultrasound Color Doppler Imaging with NVIDIA Clara Holoscan and the NVIDIA Clara Developer Kit

The NVIDIA Clara developer kit, NVIDIA Clara Holoscan, and us4us front end help build AI models on streaming data for ultrasounds, to remove artifacts like aliasing.

At RSNA 2021, there are dedicated tracks on ultrasound imaging, which is a cost-effective way to see what is going on inside a patient’s body without exposure to radiation or the need for injections and surgeries. 

Ultrasound imaging is typically done by trained sonographers and needs special expertise to interpret. The probe is a small transducer to both transmit sound waves into the body and record the waves that echo back. It is placed on the skin and as it moves, waves bounce off your blood cells, organs, and other body parts, and then back to the device. A computer then takes all the sound waves and turns them into moving images that you visualize on a screen.

The LITMUS group (Laboratory on Innovative Technology in Medical Ultrasound) at the University of Waterloo, Canada is working on making ultrasound color doppler imaging (CDI) easier to visualize. They used the NVIDIA Clara Holoscan platform, including the NVIDIA Clara AGX Developer Kit and the NVIDIA Clara Holoscan SDK, along with frontend us4us, to remove aliasing artifacts and increase the frame rate 12-fold- from 2 fps to 30 fps. 

  • Clara Holoscan is an AI platform that includes strong deep learning compute ability that can run a model at high frame rates. The Clara Holoscan SDK is designed to facilitate the creation of AI pipelines for the processing of real-time streaming medical data for ultrasound, video, and other imaging applications.
  • The Clara AGX Developer Kit combines the power of an NVIDIA RTX 6000 GPU controlled by an NVIDIA AGX Xavier SoC, with external connectivity provided by two PCIe Gen4 x 8 slots, and a NVIDIA ConnectX-6 SmartNIC with a 100 GbE port.  
Raw sensor data is copied to the GPU where a CUDA and TensorRT based framework is applied for AI based aliasing removal, which results in improved flow visualization in CDI.
Figure 1. Overview of the aliasing-resistant CDI pipeline on the Clara AGX Developer Kit and us4us frontend.

Color Doppler imaging

Color Doppler imaging (CDI) is a non-invasive way to see blood flow in arteries and veins. It is used to identify a blockage, blood clot, or narrowing of the arteries that can lead to deadly clinical outcomes such as a stroke or heart attack. These blockages can occur in a variety of arteries in the body and significantly alter the properties of blood flow. The flow alterations can be captured by CDI and used in the identification and monitoring of diseased conditions. CDI can also be used for detecting aneurysms, where swollen artery walls can also impact blood flow.

Figure 2 shows a typical CDI sequence obtained from a carotid artery model where flow comes in from the left side of the image, then branches out into the upper and lower branches. Flow speed in the artery is shown in shades of blue or red, depending on its direction relative to the probe. The surrounding grayscale image shows the tissue structure.  The CDI sequence also shows how blood flow dynamics can change throughout the cardiac cycle, which is typically less than a second long.

Blood flow speed in the artery is shown using shades of blue and red, with red depicting flow going up towards the probe, and blue depicting flow going down, away from the probe. The brighter colors indicate faster flow according to the color scale on the bottom left. Tissue structures can be visualized using the surrounding gray-scale image, with bright regions indicating strong reflectors such as vessel walls.
Figure 2. Typical CDI sequence on an artery bifurcation model. 

Aliasing problems in CDI 

One recurring issue in CDI is the presence of so-called aliasing artifacts that hinder the visualization of blood flow. Aliasing artifacts occur when blood flow exceeds the maximum flow speed measurable by the CDI system. 

For example, Figure 3 shows that flow in the upper branch is fast and exceeds the maximum measurable flow speed on the color scale (25 cm/s). The color chosen for this region is therefore picked from the opposite end of the color scale and incorrectly indicates that flow is going in the opposite direction. The maximum measurable speed stems from underlying system limitations and imaging considerations

Aliasing is most problematic in tortuous vasculature such as bifurcations and in conditions where a wide range of multidirectional velocities are encountered. CDI in such conditions can become difficult to interpret.

Blood flow speed in the upper branch exceeds the maximum measurable speed (25 cm/s) and wraps around from red to blue, incorrectly indicating that flow is going in the opposite direction.
Figure 3. CDI sequence on an artery bifurcation model with aliasing artifacts

Novel deep learning–based solution

The LITMUS group devised a new deep learning–based solution to address these aliasing artifacts in CDI for the femoral artery bifurcation.  The femoral artery bifurcation in the thigh was chosen due to its diverse flow properties, including a wide range of flow speeds and multidirectional flow. The artery can be a site of blockage in conditions of peripheral artery disease and would be susceptible to aliasing in the bifurcation, even in healthy conditions.

A pretrained U-Net convolutional neural network segments aliasing artifacts in Color Doppler images. The segmented artifacts are then removed by an adaptive phase unwrapping algorithm.
Figure 4. Overview of the CDI aliasing removal pipeline

To address aliasing artifacts in CDI, the LITMUS group devised a two-step process: 

  • Aliasing artifacts in CDI are segmented using a convolutional neural network (CNN) model. 
  • The segmented aliasing artifacts are subsequently removed by an adaptive technique. 

For the aliasing segmentation, a U-Net CNN was trained to detect aliasing artifacts using several relevant ultrasound features that are often computed in typical CDI pipelines and can contain features that are relevant for aliasing detection. The network was trained on 1,136 frames obtained from three real femoral artery bifurcation acquisitions using a us4us ultrasound frontend. The aliasing artifacts in CDI were manually labelled for training and validation. The model definition and training were done in TensorFlow. 

The segmentation maps were then leveraged by an adaptive phase unwrapping algorithm that reverses the aliasing artifact according to flow continuity criteria so that a smooth aliasing-free flow profile is achieved. The framework was then evaluated on a new acquisition from an unseen femoral artery bifurcation acquisition, where it was shown to deal with multidirectional and excessive aliasing. 

The framework was computationally demanding, requiring more than 500 ms per frame for simple de-aliasing, and even slower for excessive aliasing cases.

Clara Holoscan for real-time de-aliasing in bedside applications

CDI is widely expected to be a point-of-care modality that can be used to gain quick and immediate insights into blood flow conditions in patients. Offline processing would disrupt this utility of CDI, so it is important that the aliasing removal framework be run in real time. 

NVIDIA and the LITMUS group collaborated to accelerate the de-aliasing framework to achieve real-time performance that would be suitable in a bedside application, using the NVIDIA Clara Holoscan SDK and the NVIDIA Clara AGX Developer kit. 

A GPU-accelerated CDI platform was implemented on the NVIDIA Clara AGX Developer Kit using a CUDA-based framework previously reported by LITMUS. For more information, see Live Ultrasound Color-Encoded Speckle Imaging Platform for Real-Time Complex Flow Visualization In Vivo

Raw sensor data is continuously copied to the NVIDIA RTX 6000 GPU in the Clara AGX developer kit where custom CUDA-built kernels perform the necessary processing for image formation. The pretrained U-Net TensorFlow model was implemented using the Tensor RT API and the adaptive phase unwrapping algorithm was accelerated using the CUDA-NPP library. Further CUDA and OpenGL functions were used for display. The result was a complete raw-sensor-data-to-de-aliased-CDI package that was run on the Clara AGX Developer Kit with demonstrated real-time performance. 

Figure 5 shows the aliasing resistant CDI framework in action on the Clara AGX developer kit, processing raw sensor data from a femoral bifurcation model to aliasing resistant CDI in live mode. The raw data was acquired using the us4us frontend, which gives researchers access to all the fundamental signals as they arrive from the probe: 

 (Left) screen recording of a conventional CDI processing pipeline showing aliasing in the bottom branch at systole. (Middle) The pre-trained U-Net can correctly segment the aliasing artifacts in a live pipeline. (Right) CDI with aliasing removed on the Clara AGX Developer kit using the deep-learning powered framework.
Figure 5. Screen capture of the aliasing-resistant CDI platform on the Clara AGX Developer kit
  • Left: The aliased CDI sequence is obtained using a conventional processing pipeline. At systole (peak of the cardiac cycle, frozen frame), flow is moving away from the probe and should all be blue. In the bottom branch, however, the flow speed in the direction of the probe exceeds the maximum measurable and therefore appears as a red/orange shaded region that incorrectly suggests flow is going up. 
  • Middle: Aliasing segmentation is obtained by the integrated U-Net model during the live imaging session. You can see how the aliasing artifact in the systolic frame is correctly captured on-site. 
  • Right: The CDI sequence has the aliasing removed. The maximum measurable speed is increased and the visualization of blood flow is made more intuitive. 

The processing time of the de-aliasing module was improved to 30 fps, a 12x improvement from the previous 2-2.5 fps. In building up to this, CuPy was used to prototype and get quick GPU acceleration, giving an intermediate ~15 fps.


The LITMUS group’s workflow showed how the NVIDIA Clara AGX Developer Kit and the NVIDIA Clara Holoscan SDK can resolve aliasing artifacts in CDI, in real time. Removing aliasing makes image visualization and interpretation easier by removing the ambiguity about the blood flow direction. This makes the most impact in tortuous vasculature where flow direction can be difficult to guess by the sonographer. 

For more information, see the following resources: