Categories
Offsites

A Dataset for Studying Gender Bias in Translation

Advances on neural machine translation (NMT) have enabled more natural and fluid translations, but they still can reflect the societal biases and stereotypes of the data they’re trained on. As such, it is an ongoing goal at Google to develop innovative techniques to reduce gender bias in machine translation, in alignment with our AI Principles.

One research area has been using context from surrounding sentences or passages to improve gender accuracy – this is a challenge because traditional NMT methods translate sentences individually, but gendered information is not always explicitly stated in each individual sentence. For example, in the following passage in Spanish (a language where subjects aren’t always explicitly mentioned), the first sentence refers explicitly to Marie Curie as the subject, but the second one doesn’t explicitly mention the subject. In isolation, this second sentence could refer to a person of any gender. When translating to English, however, a pronoun needs to be picked, and the information needed for an accurate translation is in the first sentence.

Spanish Text Translation to English
Marie Curie nació en Varsovia. Fue la primera persona en recibir dos premios Nobel en distintas especialidades. Marie Curie was born in Warsaw. She was the first person to receive two Nobel Prizes in different specialties.

Advancing translation techniques beyond single sentences requires new metrics for measuring progress and new datasets with the most common context-related errors. Adding to this challenge is the fact that translation errors related to gender (such as picking the correct pronoun or having gender agreement) are particularly sensitive because they may directly refer to people and how they self identify.

To help facilitate progress against the common challenges on contextual translation (e.g., pronoun drop, gender agreement and accurate possessives), we are releasing the Translated Wikipedia Biographies dataset, which can be used to evaluate the gender bias of translation models. Our intent with this release is to support long-term improvements on ML systems focused on pronouns and gender in translation by providing a benchmark in which translations’ accuracy can be measured pre- and post-model changes.

A Source of Common Translation Errors
Because they are well-written, geographically diverse, contain multiple sentences, and refer to subjects in the third person (so contain plenty of pronouns), Wikipedia biographies offer a high potential for common translation errors associated with gender. These often occur when articles refer to a person explicitly in early sentences of a paragraph, but there is no explicit mention of the person in later sentences. Some examples:

Translation Error     Text     Translation
Pro-drop in Spanish → English     Marie Curie nació en Varsovia. Recibió el Premio Nobel en 1903 y en 1911.     Marie Curie was born in Warsaw. He received the Nobel Prize in 1903 and in 1911.
Neutral possessives in Spanish → English     Marie Curie nació en Varsovia. Su carrera profesional fue desarrollada en Francia.     Marie Curie was born in Warsaw. His professional career was developed in France.
Gender agreement in English → German     Marie Curie was born in Warsaw. The distinguished scientist received the Nobel Prize in 1903 and in 1911.     Marie Curie wurde in Varsovia geboren. Der angesehene Wissenschaftler erhielt 1903 und 1911 den Nobelpreis.
Gender agreement in English → Spanish     Marie Curie was born in Warsaw. The distinguished scientist received the Nobel Prize in 1903 and in 1911.     Marie Curie nació en Varsovia. El distinguido científico recibió el Premio Nobel en 1903 y en 1911.

Building the Dataset
The Translated Wikipedia Biographies dataset has been designed to analyze common gender errors in machine translation such as those illustrated above. Each instance of the dataset represents a person (identified in the biographies as feminine or masculine), a rock band or a sports team (considered genderless). Each instance is represented by a long text translation of 8 to 15 connected sentences referring to that central subject (the person, rock band, or sports team). Articles are written in native English and have been professionally translated to Spanish and German. For Spanish, translations were optimized for pronoun-drop, so the same set could be used to analyze pro-drop (Spanish → English) and gender agreement (English → Spanish).

The dataset was built by selecting a group of instances that has equal representation across geographies and genders. To do this, we extracted biographies from Wikipedia according to occupation, profession, job and/or activity. To ensure an unbiased selection of occupations, we chose 9 occupations that represented a range of stereotypical gender associations (either feminine, masculine, or neither) based on Wikipedia statistics. Then, to mitigate any geography-based bias, we divided all these instances based on geographical diversity. For each occupation category, we looked to have one candidate per region (using regions from census.gov as a proxy of geographical diversity). When an instance was associated with a region, we checked that the selected person had a relevant relationship with a country that belongs to a designated region (nationality, place of birth, lived for a big portion of their life, etc.). By using this criteria, the dataset contains entries about individuals from more than 90 countries and all regions of the world.

Although gender is non-binary, we focused on having equal representation of “feminine” and “masculine” entities. It’s worth mentioning that because the entities are represented as such on Wikipedia, the set doesn’t include individuals that identify as non-binary, as unfortunately there are not enough instances currently represented in Wikipedia to accurately reflect the non-binary community. To label each instance as “feminine” or “masculine” we relied on the biographical information from Wikipedia, which contained gender-specific references to the person (she, he, woman, son, father, etc.).

After applying all these filters, we randomly selected an instance for each occupation-region-gender triplet. For each occupation, there are 2 biographies (one masculine and one feminine), for each of the 7 geographic regions.

Finally, we added 12 instances with no gender. We picked rock bands and sports teams because they are usually referred to by non-gendered third person pronouns (such as “it” or singular “they”). The purpose of including these instances is to study over triggering (i.e., when models learn that they are rewarded for producing gender-specific pronouns, soproduce these pronouns in cases where they shouldn’t).

Results and Applications
This dataset enables a new method of evaluation for gender bias reduction in machine translations (introduced in a previous post). Because each instance refers to a subject with a known gender, we can compute the accuracy of the gender-specific translations that refer to this subject. This computation is easier when translating into English (cases of languages with prodrop or neutral pronouns) since computation is mainly based on gender-specific pronouns in English. In these cases, the gender datasets have allowed us to observe a 67% reduction in errors on context-aware models vs. previous models. As mentioned before, the neutral entities have allowed us to discover cases of over triggering like the usage of feminine or masculine pronouns to refer to genderless entities. This new dataset also enables new research directions into the performance of different models across types of occupations or geographic regions.

As an example, the dataset allowed us to discover the following improvements in an excerpt of the translated biography of Marie Curie from Spanish.

Translation result with the previous NMT model.
Translation result with the new contextual model.

Conclusion
This Translated Wikipedia Biographies dataset is the result of our own studies and work on identifying biases associated with gender and machine translation. This set focuses on a specific problem related to gender bias and doesn’t aim to cover the whole problem. It’s worth mentioning that by releasing this dataset, we don’t aim to be prescriptive in determining what’s the optimal approach to address gender bias. This contribution aims to foster progress on this challenge across the global research community.

Acknowledgements
The datasets were built with help from Anja Austermann, Melvin Johnson, Michelle Linch, Mengmeng Niu, Mahima Pushkarna, Apu Shah, Romina Stella, and Kellie Webster.

Categories
Misc

Accelerating JPEG 2000 Decoding for Digital Pathology and Satellite Images Using the nvJPEG2000 Library

JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to … Continued

JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to 17 parts, covering areas like Motion JPEG2000 (Part 3) which extends the standard for video, extensions for three-dimensional data (Part 10), and so on.

Features like mathematically lossless compression and large precision and higher dynamic range per component helped JPEG 2000 find adoption in digital cinema applications. JPEG 2000 is also widely used in digital pathology and geospatial imaging, where image dimensions exceed 4K but regions of interest (ROI) stay small.

GPU acceleration using the nvJPEG2000 library

The JPEG 2000 feature set provides ample opportunities for GPU acceleration when compared to its predecessor, JPEG. Through GPU acceleration, images can be decoded in parallel and larger images can be processed quicker. nvJPEG2000 is a new library that accelerates the decoding of JPEG 2000 images on NVIDIA GPUs. It supports codec features commonly used in geospatial imaging, remote sensing, and digital pathology. Figure 1 overviews the decoding stages that nvJPEG2000 accelerates.

The CPU runs the JPEG2000 and Tier 2 stages. GPU stages include Tier 1, dequantization, IDWT, inverse component transform, and the decoded image.
Figure 1. GPU-accelerated JPEG 2000 decode process. Stages run on the CPU are denoted by the first two blue boxes. All remaining stages are offloaded to the GPU as shown in green.

The Tier1 Decode (entropy decode) stage is the most compute-intensive stage of the entire decode process. The entropy decode algorithm used in the legacy JPEG codec was serial in nature and was hard to parallelize.

In JPEG 2000, the entropy decode stage is applied at a block-based granularity (typical block sizes are 64×64 and 32×32) that makes it possible to offload the entropy decode stage entirely to the GPU. For more information about the entropy decode process, see Section C of the JPEG 2000 Core coding system specification.

The JPEG 2000 core coding system allows for two types of wavelet transforms (5-3 Reversible and 9-7 Irreversible), both of which benefit from GPU acceleration. For more information about the wavelet transforms, see Section F of the JPEG 2000 Core coding system specification.

Decoding geospatial images

In this section, we concentrate on the new nvJPEG2000 API tailored for the geospatial domain, which enables decoding specific tiles within an image instead of decoding the full image. 

Sentinel2 image in a batch of 12 used to verify geospatial acceleration.
Figure 2. Sentinel2 Imaging (S2B_17RQK_20190908_0_L2A) JPEG2000 (Image Size 10980×10980, Tile Size 1024×1024, No of Tiles 11×11, Number of components 1).

Imaging data captured by the European Space Agency’s Sentinel 2 satellites are stored as JPEG 2000 bitstreams. Sentinel 2 level 2A data downloaded from the Copernicus hub can be used with the nvJPEG2000 decoding examples. The imaging data has 12 bands or channels and each of them is stored as an independent JPEG 2000 bitstream. The image in Figure 2 is subdivided into 121 tiles. To speed up the decode of multitile images, a new API called nvjpeg2kDecodeTile has been added in nvJPEG2000 v 0.2, which enables you to decode each tile independently.

For multitile images, decoding each tile sequentially would be suboptimal. The GitHub multitile decode sample demonstrates how to decode each tile on a separate cudaStream_t. By taking this approach, you can simultaneously decode multiple tiles on the GPU. Nsight Systems trace in Figure 3 shows the decoding of Sentinel 2 data set consisting of 12 bands. By using 10 CUDA streams, up to 10 tiles are being decoded in parallel at any point during the decode process.

Effective utilization of CUDA streams for multitile decoding
Figure 3. Nsight Systems trace demonstrating the decoding of multiple tiles on separate CUDA streams

Table 1 shows performance data comparing a single stream and multiple streams on a GV100 GPU.

# of CUDA streams Average decode time (ms) Speedup in % over single CUDA stream decode
1 0.888854
10 0.227408 75%
Table 1. Single stream vs multiple streams decode performance on a Quadro GV 100 for Sentinel2 Dataset

Using 10 CUDA streams reduces the total decode time of the entire dataset by about 75% on a Quadro GV100 GPU. For more information, see the Accelerating Geospatial Remote Sensing Workflows Using NVIDIA SDKs [S32150] GTC’21 talk. It discusses geospatial image-processing workflows in more detail and the role nvJPEG2000 plays there.

Decoding digital pathology images

JPEG 2000 is used in digital pathology to store whole slide images (WSI). Figure 4 gives an overview of various deep learning techniques that can be applied to WSI. Deep learning models can be used to distinguish between cancerous and healthy cells. Image segmentation methods can be used to identify a tumor location in the WSI. For more information, see Deep neural network models for computational histopathology: A survey.

Application work-flow in the digital pathology
Figure 4. Digital pathology workflows

Table 2 lists the key parameters and their commonly used values of a whole slide image (WSI) compressed using JPEG 2000​.

Image size 92000×201712
Tile size 92000×201712
# of tiles 1
# of components 3
Precision 8
Table 2. Key JPEG 2000 parameters and their values used in digital pathology.

The image in question is large and it is not possible to decode the entire image at one time due to the amount of memory required. The size of the decode output is around 53 GB (92000×201712 * 3). This is excluding the decoder memory requirements.

There are several approaches to handling such large images. In this post, we describe two of them:

  • Decoding an area of interest
  • Decoding the image at lower resolution

Both approaches can be easily performed using specific nvJPEG2000 APIs.

Decoding an area of interest in an image

The nvJPEG2000 library enables the decoding of a specific area of interest in an image supported as part of the  nvjpeg2kDecodeTile API. The following code example shows how to set the area of interest in terms of image coordinates. The nvjpeg2kDecodeParams_t type enables you to control the decode output settings, such as the area of interest to decode.

 nvjpeg2kDecodeParams_t decode_params;
 // all coordinate values are relative to the top-left corner of the image
 uint32_t top_coordinate, bottom_coordinate, left_coordinate, right_coordinate;
 uint32_t tile_id;
  
 nvjpeg2kDecodeParamsSetDecodeArea(decode_params, left_coordinate, right_coordinate, top_coordinate, bottom_coordinate);
  
 nvjpeg2kDecodeTile(nvjpeg2k_handle, nvjpeg2k_decode_state,
                 jpeg2k_stream, decode_params, tile_id, 0,
                 &nvjpeg2k_out, cuda_stream) 

For more information about how to partially decode an image with multiple tiles, see the Decode Tile Decode GitHub sample.

Decoding lower resolutions of an image

The second approach to decode a large image is to decode the image at lower resolutions. The ability to decode only the lower resolutions is a benefit of JPEG 2000 using wavelet transforms. In Figure 5, wavelet transform is applied up to two levels, which gives you access to the image at three resolutions. By controlling how the inverse wavelet transform is applied, you decode only the lower resolutions of an image.

JPEG 2000 decoding based on 2D wavelet transform. This image shows two-level decomposition of the wavelet.
Figure 5. Output of a 2D wavelet transform with two-level decomposition

The digital pathology image described in Table 2 has 12 resolutions. This information can be retrieved on a per-tile basis:

 uint32_t num_res;
 uint32_t tile_id = 0;
 nvjpeg2kStreamGetResolutionsInTile(jpeg2k_stream, tile_id, &num_res);

The image has a size of 92000×201712 with 12 resolutions. If you choose to discard the four higher resolutions and decode the image up to eight resolutions, that means you can extract an image of size 5750×12574. By dropping four higher resolutions, you are scaling the result by a factor of 16.

 uint32_t num_res_to_decode = 8;
 // if num_res_to_decode > num_res nvjpeg2kDecodeTile will return an INVALID //PARAMETER ERROR
  
 nvjpeg2kDecodeTile(nvjpeg2k_handle, nvjpeg2k_decode_state, jpeg2k_stream,              
     decode_params, tile_id, num_res_to_decode, &nvjpeg2k_out, cuda_stream) 

Performance benchmarks

To show the performance improvement that decoding JPEG2000 on GPU brings, compare GPU-based nvJPEG2000 with CPU-based OpenJPEG.

Figures 6 and 7 show the average speedup when decoding one image at a time. The following images are used in the measurements:

  • 1920×1080 8-bit image with 444 chroma subsampling
  • 3840×2160 8-bit image with 444 chroma subsampling
  • 3328×4096 12-bit grayscale
Lossless JPEG 2000 decoding speedup on various GPUs with regard to CPU (16 Threads): RTX A6000, A100, V100, RTX 8000, RTX 4000, T4.
Figure 6. Speed up for Lossless Decode (5-3 DWT) over CPU implementation using 16 threads.
Lossy JPEG 2000 decoding speedup on various GPUs w.r.t. CPU (16 Threads): RTX A6000, A100, V100, RTX 8000, RTX 4000, T4.
Figure 7. Speed for Lossy Decode (9-7 DWT) over CPU implementation using 16 threads

The tables were compiled with OpenJPEG CPU Performance – Intel Xeon Gold 6240@2GHz 3.9GHz Turbo (Cascade Lake) HT On, Number of CPU threads per image=16.

On NVIDIA Ampere Architecture GPUs such as NVIDIA RTX A6000, the speedup factor is more than 8x for decoding. This speedup is measured for single-image latency.

Even higher speedups can be achieved by batching the decode of multiple images. Figures 8 and 9 compare the speed of decoding a 1920×1080 8-bit image with 444 chroma subsampling (Full HD) in both lossless and lossy modes respectively across multiple GPUs.

Batch mode performance of Lossless JPEG 2000 decoding on various GPUs: A100, RTX A6000, V100, RTX 8000, RTX 4000, and T4.
Figure 8. Decode throughput comparison for a 1920×1080 8-bit 444 image using 5-3 wavelet transform (lossless decode).
Batch mode performance of Lossy JPEG 2000 decoding on various GPUs: A100, RTX A6000, V100, RTX 8000, RTX 4000, and T4.
Figure 9. Decode throughput comparison for a 1920×1080 8-bit 444 image using 9-7 wavelet transform (lossy decode).

Figures 8 and 9 demonstrate the benefits of batched decode using the nvJPEG2000 library. There’s a significant performance increase on GPUs with a large number of streaming multiprocessors (SMs), such as A100 and NVIDIA RTX A6000, than with smaller numbers of SMs, such as NVIDIA RTX 4000 and T4. By batching, you are making sure that the compute resources available are efficiently used.

As observed from Figure 8, the decode speed on an NVIDIA RTX A6000 is 232 images per second for a batch size of 20. This equates to an additional 3x speed over batch size = 1, based on a benchmark image with a low compression ratio. The compressed bitstream is only about 3x smaller than the uncompressed image. At higher compression ratios, the speedup is faster.

The following GitHub samples show how to achieve this speedup both at image and tile granularity:

Conclusion

The nvJPEG2000 library accelerates the decoding of JPEG2000 images both in size and volume using NVIDIA GPUs by targeting specific image-processing tasks of interest. Decoding JPEG 2000 images using the nvJPEG2000 library can be as much as 8x faster on GPU (NVIDIA RTX A6000) than on CPU. A further speedup of 3x (24x faster than CPU) is achieved by batching the decode of multiple images.

The simple nvJPEG2000 APIs make it easy to include in your applications and workflows. It is also integrated into the NVIDIA Data Loading Library (DALI), a data loading and preprocessing library to accelerate deep learning applications. Using nvJPEG2000 and DALI together makes it easy to use JPEG2000 images as part of deep learning training workflows.

For more information, see the following resources:

Categories
Misc

Test data generator – model.evaluate()

Hello, I’m trying to measure the performance (accuracy and loss) of my model and I discovered the evaluate() function for this.

My test data (34 pictures) is saved in a ‘test’ folder, so I tried to create an ImageDataGenerator and then to generate my data using flow_from_directory.

I receive a “Found 34 images belonging to 1 classes.” message. However, the result I get in the terminal for this code line result = seqModel.evaluate(data, batch_size=1, verbose=1) is a very weird one: 2/2 [==============================] – 0s 5ms/step – loss: 282.6923 – accuracy: 0.7353

Why do I receive a “2/2” everytime when running the script now, no matter what batch_size I choose? And why is my loss 282.6923, while accuracy is 0.7353? Doesn’t it look super weird? I know I’m doing something wrong, but I just can’t figure it out – maybe when creating the data generator or maybe when using flow_from_directory? (When I add the validationDataGenerator as first argument – in order to test it – it seems all fine, but here I just can’t figure it out.)

A little bit of help would be appreciated. 🙂

submitted by /u/burgundicorn
[visit reddit] [comments]

Categories
Misc

What is the shape of the C object corresponding to this TFLite output?

I have a YOLOv5 trained model converted to .tflite format having used this guide.

I use this code to print the input and output shape in python: “` interpreter = tf.lite.Interpreter( # model_path=”models/exported_resnet640.tflite”) # centernet_512x512 works correctly model_path=”models/yolov5_working.tflite”) # centernet_512x512 works correctly

interpreter.allocate_tensors()

Get input and output tensors.

input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() print(“======================================================”) print(input_details) print(“======================================================”)

print(output_details)

for detail in output_details: print(detail) print(” “) “` and the output looks like this:

“` [{‘name’: ‘input_1’, ‘index’: 0, ‘shape’: array([ 1, 480, 480, 3], dtype=int32), ‘shape_signature’: array([ 1, 480, 480, 3], dtype=int32), ‘dtype’: <class ‘numpy.float32’>, ‘quantization’: (0.0, 0), ‘quantization_parameters’: {‘scales’: array([], dtype=float32), ‘zero_points’: array([], dtype=int32), ‘quantized_dimension’: 0}, ‘sparsity_parameters’: {}}]

{‘name’: ‘Identity’, ‘index’: 422, ‘shape’: array([ 1, 14175, 9], dtype=int32), ‘shape_signature’: array([ 1, 14175, 9], dtype=int32), ‘dtype’: <class ‘numpy.float32’>, ‘quantization’: (0.0, 0), ‘quantization_parameters’: {‘scales’: array([], dtype=float32), ‘zero_points’: array([], dtype=int32), ‘quantized_dimension’: 0}, ‘sparsity_parameters’: {}} After invoking the interpreter after giving some input, I get an output looking like this: Output: [[[0.01191081 0.01366316 0.02800988 … 0.1661754 0.31489396 0.4217688 ] [0.02396268 0.01650745 0.0442626 … 0.24655405 0.35853994 0.2839473 ] [0.04218047 0.01613732 0.0548977 … 0.13136038 0.25760946 0.5338376 ] … [0.82626414 0.9669814 0.4534862 … 0.18754318 0.11680853 0.18492043] [0.8983849 0.9680944 0.64181983 … 0.19781056 0.16431764 0.16926363] [0.9657682 0.9869368 0.5452545 … 0.13321301 0.12015155 0.15937251]]] “`

Using the Tensorflow Lite c_api.h, I am trying to get the same output in C, but I cannot understand how to create the object that get the data.

I have tried using a float*** with size 1 * 14715 * 9 * sizeof(float) and get the output like so: “` int number_of_detections = 14175; struct filedata o_boxes; float **box_coords = (float *)malloc(sizeof(float *) * 1);

box_coords[0] = (float **)malloc(sizeof(float *) * (int)number_of_detections); for (int i = 0; i < (int)number_of_detections; i++) { box_coords[0][i] = (float *)calloc(sizeof(float), 9); // box has 9 coordinates }

o_boxes.data = box_coords; o_boxes.size = 1 * (int)number_of_detections * 9 + 1;

const TfLiteTensor *output_tensor_boxes = TfLiteInterpreterGetOutputTensor(interpreter, 0); TfLiteTensorCopyToBuffer(output_tensor_boxes, o_boxes.data, o_boxes.size * sizeof(float));

box_coords = (float ***)&o_boxes.data;

for (int i = 0; i < o_boxes.size; i++) { for (int j = 0; j < 9; j++) { printf(“%f “, box_coords[0][i][j]); fflush(stdout); } printf(“n”); } where `struct filedata` is a simple struct: struct filedata { void *data; size_t size; }; “`

The result is some garbage big floats: 39688651931648.000000 0.000000 39805756899328.000000 0.000000 39807166185472.000000 0.000000 39807367512064.000000 0.000000 39807568838656.000000 and after the first iteration I get a Segmentation Fault.

How should I create/allocate my float array to get my data?

submitted by /u/morphinnas
[visit reddit] [comments]

Categories
Misc

NVIDIA Launches Morpheus Early Access Program to Enable Advanced Cybersecurity Solution Development

NVIDIA Morpheus gives security teams complete visibility into security threats with unmatched AI processing and real-time monitoring to protect every server and screen every packet in the data center.

NVIDIA is opening early access to its Morpheus AI development framework for cybersecurity applications. Selected developers have access to Morpheus starting today with more developers joining the program over the next few months.

Just announced at NVIDIA GTC in April 2021, NVIDIA Morpheus gives security teams complete visibility into security threats with unmatched AI processing and real-time monitoring to protect every server and screen every packet in the data center. Security applications built on Morpheus help them respond to anomalies and update policies immediately as threats are identified, by building on NVIDIA deep learning and data science tools including RAPIDS, CLX, Streamz, Triton Inference Server, and TensorRT. Data analysis runs on NVIDIA-Certified servers built on the NVIDIA EGX platform or in qualified cloud instances that support NVIDIA GPUs, while traffic collection and telemetry can run on a variety of servers or switches plus the NVIDIA BlueField-2 data processing unit (DPU).

Figure 1. NVIDIA Morpheus leverages NVIDIA data science frameworks and the NVIDIA EGX platform for data analysis, and the NVIDIA DPU for telemetry and pervasive traffic scanning.

Developers in the Morpheus early access program have immediate access to components through the NGC catalog and can load the components into an Amazon Web Services Elastic Compute (AWS EC2) G4 instance — featuring an NVIDIA T4 or A100 GPU — to begin immediate development of cybersecurity applications and solutions. Early access will soon support the use of Red Hat Enterprise Linux (RHEL) and Red Hat OpenShift on NVIDIA-Certified servers built on NVIDIA EGX for on-premises development/deployment, and RHEL on NVIDIA BlueField DPUs for enhanced data collection and traffic screening that can protect every server. Support for running Morpheus on Ubuntu is expected soon afterwards, followed by additional OS options.

Developers accepted to early access are being notified this week and NVIDIA plans to expand the early access program quickly to include more security ISV partners, end users, academics, and other security professionals who wish to develop scalable, adaptive, AI-powered cybersecurity solutions.

If you are a customer, partner or researcher interested in joining the Morpheus early access program, please apply here.

Additional Resources:

Categories
Misc

GFN Thursday Heats Up with ‘LEGO Builder’s Journey’ and ‘Phantom Abyss’ Game Launches, Plus First Look at Kena: Bridge of Spirits

It’s getting hot in here, so get your game on this GFN Thursday with 13 new games joining the GeForce NOW library, including LEGO Builder’s Journey, Phantom Abyss and the Dual Universe beta. Plus, get a sneak peek at Kena: Bridge of Spirits, coming to the cloud later this year. Break the Rules Build up Read article >

The post GFN Thursday Heats Up with ‘LEGO Builder’s Journey’ and ‘Phantom Abyss’ Game Launches, Plus First Look at Kena: Bridge of Spirits appeared first on The Official NVIDIA Blog.

Categories
Misc

More Than Meets the AI: How GANs Research Is Reshaping Video Conferencing

Roll out of bed, fire up the laptop, turn on the webcam — and look picture-perfect in every video call, with the help of AI developed by NVIDIA researchers. Vid2Vid Cameo, one of the deep learning models behind the NVIDIA Maxine SDK for video conferencing, uses generative adversarial networks (known as GANs) to synthesize realistic Read article >

The post More Than Meets the AI: How GANs Research Is Reshaping Video Conferencing appeared first on The Official NVIDIA Blog.

Categories
Misc

Fast-Track Production AI with Pretrained Models and Transfer Learning Toolkit 3.0

NVIDIA announced new pre-trained models and general availability of Transfer Learning Toolkit (TLT) 3.0, a core component of NVIDIA’s Train, Adapt and Optimize (TAO) platform guided workflow for creating AI.

Today, NVIDIA announced new pretrained models and general availability of Transfer Learning Toolkit (TLT) 3.0, a core component of NVIDIA’s Train, Adapt, and Optimize (TAO) platform guided workflow for creating AI. The new release includes a variety of highly accurate and performant pretrained models in computer vision and conversational AI, as well as a set of powerful productivity features that boost AI development by up to 10x. 

As enterprises race to bring AI-enabled solutions to market, your competitiveness relies on access to the best development tools. The development journey to deploy custom, high-accuracy, and performant AI models in production can be treacherous for many engineering and research teams attempting to train with open-source models for AI product creation. NVIDIA offers high-quality, pretrained models and TLT to help reduce costs with large-scale data collection and labeling. It also eliminates the burden of training AI/ML models from scratch. New entrants to the computer vision and speech-enabled service market can now deploy production-class AI without a massive AI development team. 

Highlights of the new release include:

  • A pose-estimation model that supports real-time inference on edge with 9x faster inference performance than the OpenPose model. 
  • PeopleSemSegNet, a semantic segmentation network for people detection.
  • A variety of computer vision pretrained models in various industry use cases, such as license plate detection and recognition, heart rate monitoring, emotion recognition, facial landmarks, and more.
  • CitriNet, a new speech-recognition model that is trained on various proprietary domain-specific and open-source datasets.
  • A new Megatron Uncased model for Question Answering, plus many other pretrained models that support speech-to-text, named-entity recognition, punctuation, and text classification.
  • Training support on AWS, GCP, and Azure.
  • Out-of-the-box deployment on NVIDIA Triton and DeepStream SDK for vision AI, and NVIDIA Jarvis for conversational AI.

Get Started Fast

  • Download Transfer Learning Toolkit and access to developer resources: Get started
  • Download models from NGC: Computer vision | Conversational AI 
  • Check out the latest developer tutorial: Training and Optimizing a 2D Pose-Estimation Model with the NVIDIA Transfer Learning Toolkit. Part 1 | Part 2 

Integration with Data-Generation and Labeling Tools for Faster and More Accurate AI

TLT 3.0 is also now integrated with platforms from several leading partners who provide large, diverse, and high-quality labeled data—enabling faster end-to-end AI/ML workflows. You can now use these partners’ services to generate and annotate data, seamlessly integrate with TLT for model training and optimization, and deploy the model using DeepStream SDK or Jarvis to create reliable applications in computer vision and conversational AI. 

Check out more partner blog post and tutorials about synthetic data and data annotation with TLT:

Learn more about NVIDIA pretrained models and Transfer Learning Toolkit > >

Categories
Misc

New on NGC: PyTorch Lightning Container Speeds Up Deep Learning Research

With PyTorch Lightning, you can scale your models to multiple GPUs and leverage state-of-the-art training features such as 16-bit precision, early stopping, logging, pruning and quantization, while enabling faster iteration and reproducibility.

Deep learning research requires working at scale. Training on massive data sets or multilayered deep networks is computationally intensive and can take an impractically long time as deep learning models are bound by memory. The key here is to compose the deep learning models in a structured way so that they are decoupled from the engineering and data, enabling researchers to conduct fast research.

PyTorch Lightning, developed by Grid.AI, is now available as a container on the NGC catalog, NVIDIA’s hub of GPU-optimized AI and HPC software. Pytorch Lightning was designed to remove the roadblocks in deep learning research and allows researchers to focus on science. Lightning is more of a style guide than a framework, enabling you to structure and organize your code while providing utilities for common functions. With PyTorch Lightning, you can scale your models to multiple GPUs and leverage state-of-the-art training features such as 16-bit precision, early stopping, logging, pruning and quantization, while enabling faster iteration and reproducibility.

Figure 1. PyTorch Lightning Philosophy

A Lightning model is composed of the following:

  • A LightningModule that encapsulates the model code
  • A Lightning DataModule that encapsulates transforms, dataset, and DataLoaders
  • A Lightning trainer that automates the training routine with 70+ flags to make advanced features trivial
  • Callbacks for users to customize Lightning using hooks

The Lightning objects are implemented as hooks that can be overridden, making every single aspect of deep learning training highly configurable. With Lightning, you have full control over every detail:

  • Change how the backward step is done.
  • Change how 16-bit is initialized.
  • Add your own way of doing distributed training.
  • Add learning rate schedulers.
  • Use multiple optimizers.
  • Change the frequency of optimizer updates.

Get started today with NGC PyTorch Lightning Docker Container from the NGC catalog.

Categories
Misc

Achieve up to 75% Performance Improvement for Communication Intensive HPC Applications with NVTAGS

NVTAGS automates intelligent GPU assignment by profiling HPC applications and launching them with a custom GPU assignment tailored to an application and system to minimize communication costs.

Many GPU-accelerated HPC applications spend a substantial portion of their time in non-uniform, GPU-to-GPU communications. Additionally, in many HPC systems, different GPU pairs share communication links with varying bandwidth and latency. As a result, GPU assignment can substantially impact time to solution. Furthermore, on multi-node / multi-socket systems, communication performance can degrade when GPUs communicate with CPUs and NICs outside their system affinity. Because resource selection is system dependent, it is challenging to select resources such that communication costs are minimized.

NVIDIA Topology-Aware GPU Selection (NVTAGS) abstracts away the complexity of efficient resource selection. NVTAGS automates intelligent GPU assignment by profiling HPC applications and launching them with a custom GPU assignment tailored to an application and system to minimize communication costs. NVTAGS ensures that, regardless of a system’s communication topology, MPI processes communicate with the CPUs and NICs or HCAs within their own affinity. 

NVTAGS improves performance of Chroma, MILC, and LAMMPS from 2% to 75% on one to 16 nodes.

Key NVTAGS Features:

  • Automated topology detection along with CPU and NIC/HCA binding, independent of the system and HPC application
  • Support for single- and multi-node, PCIe, and NVIDIA NVLink with NVIDIA Pascal, Volta, and Ampere architecture GPUs
  • Automatic caching of efficient GPU selection for future simulations
  • Straightforward integration with Slurm and Singularity

Download NVTAGS 1.0.0 today. 

Additional Resources:

NVTAGS Product Page
Blog: Overcoming Communication Congestion for HPC Applications with NVIDIA NVTAGS