Categories
Misc

Detect hips api. Python.

I‘m a photographer and I have to crop a lot of pictures. I thought, I‘ll ask you guys. So I would need to identify the hips in order to crop the images. Tops in this situation. I have a bit of knowledge in Python. Is there an API or I have to install everything? Thank you.

submitted by /u/one_of_us31
[visit reddit] [comments]

Categories
Misc

TensorFlow Introduces ‘PluggableDevice’ Architecture To Integrate Accelerators (GPUs, TPUs) With TensorFlow Without Making Any Changes In The TensorFlow Code

TensorFlow introduces the PluggableDevice architecture, which seamlessly integrates accelerators (GPUs, TPUs) with TensorFlow without making any changes in the TensorFlow code. PluggableDevice, as the name suggests, provides plug-in options for registering the devices with TensorFlow. It is constructed using the StreamExecutor C API and builds on the work done for Modular TensorFlow. In TF 2.5, the PluggableDevice feature is available.

Full Story: https://www.marktechpost.com/2021/06/24/tensorflow-introduces-pluggabledevice-architecture-to-integrate-accelerators-gpus-tpus-with-tensorflow-without-making-any-changes-in-the-tensorflow-code/

submitted by /u/ai-lover
[visit reddit] [comments]

Categories
Offsites

Achieving Precision in Quantum Material Simulations

In fall of 2019, we demonstrated that the Sycamore quantum processor could outperform the most powerful classical computers when applied to a tailor-made problem. The next challenge is to extend this result to solve practical problems in materials science, chemistry and physics. But going beyond the capabilities of classical computers for these problems is challenging and will require new insights to achieve state-of-the-art accuracy. Generally, the difficulty in performing quantum simulations of such physical problems is rooted in the wave nature of quantum particles, where deviations in the initial setup, interference from the environment, or small errors in the calculations can lead to large deviations in the computational result.

In two upcoming publications, we outline a blueprint for achieving record levels of precision for the task of simulating quantum materials. In the first work, we consider one-dimensional systems, like thin wires, and demonstrate how to accurately compute electronic properties, such as current and conductance. In the second work, we show how to map the Fermi-Hubbard model, which describes interacting electrons, to a quantum processor in order to simulate important physical properties. These works take a significant step towards realizing our long-term goal of simulating more complex systems with practical applications, like batteries and pharmaceuticals.

A bottom view of one of the quantum dilution refrigerators during maintenance. During the operation, the microwave wires that are floating in this image are connected to the quantum processor, e.g., the Sycamore chip, bringing the temperature of the lowest stage to a few tens of milli-degrees above absolute zero temperature.

Computing Electronic Properties of Quantum Materials
In “Accurately computing electronic properties of a quantum ring”, to be published in Nature, we show how to reconstruct key electronic properties of quantum materials. The focus of this work is on one-dimensional conductors, which we simulate by forming a loop out of 18 qubits on the Sycamore processor in order to mimic a very narrow wire. We illustrate the underlying physics through a series of simple text-book experiments, starting with a computation of the “band-structure” of this wire, which describes the relationship between the energy and momentum of electrons in the metal. Understanding such structure is a key step in computing electronic properties such as current and conductance. Despite being an 18-qubit algorithm consisting of over 1,400 logical operations, a significant computational task for near-term devices, we are able to achieve a total error as low as 1%.

The key insight enabling this level of accuracy stems from robust properties of the Fourier transform. The quantum signal that we measure oscillates in time with a small number of frequencies. Taking a Fourier transform of this signal reveals peaks at the oscillation frequencies (in this case, the energy of electrons in the wire). While experimental imperfections affect the height of the observed peaks (corresponding to the strength of the oscillation), the center frequencies are robust to these errors. On the other hand, the center frequencies are especially sensitive to the physical properties of the wire that we hope to study (e.g., revealing small disorders in the local electric field felt by the electrons). The essence of our work is that studying quantum signals in the Fourier domain enables robust protection against experimental errors while providing a sensitive probe of the underlying quantum system.

(Left) Schematic of the 54-qubit quantum processor, Sycamore. Qubits are shown as gray crosses and tunable couplers as blue squares. Eighteen of the qubits are isolated to form a ring. (Middle) Fourier transform of the measured quantum signal. Peaks in the Fourier spectrum correspond to the energy of electrons in the ring. Each peak can be associated with a traveling wave that has fixed momentum. (Right) The center frequency of each peak (corresponding to the energy of electrons in the wire) is plotted versus the peak index (corresponding to the momentum). The measured relationship between energy and momentum is referred to as the ‘band structure’ of the quantum wire and provides valuable information about electronic properties of the material, such as current and conductance.

Quantum Simulation of the Fermi-Hubbard Model
In “Observation of separated dynamics of charge and spin in the Fermi-Hubbard model”, we focus on the dynamics of interacting electrons. Interactions between particles give rise to novel phenomena such as high temperature superconductivity and spin-charge separation. The simplest model that captures this behavior is known as the Fermi-Hubbard model. In materials such as metals, the atomic nuclei form a crystalline lattice and electrons hop from lattice site to lattice site carrying electrical current. In order to accurately model these systems, it is necessary to include the repulsion that electrons feel when getting close to one another. The Fermi-Hubbard model captures this physics with two simple parameters that describe the hopping rate (J) and the repulsion strength (U).

We realize the dynamics of this model by mapping the two physical parameters to logical operations on the qubits of the processor. Using these operations, we simulate a state of the electrons where both the electron charge and spin densities are peaked near the center of the qubit array. As the system evolves, the charge and spin densities spread at different rates due to the strong correlations between electrons. Our results provide an intuitive picture of interacting electrons and serve as a benchmark for simulating quantum materials with superconducting qubits.

(Left top) Illustration of the one-dimensional Fermi-Hubbard model in a periodic potential. Electrons are shown in blue, with their spin indicated by the connected arrow. J, the distance between troughs in the electric potential field, reflects the “hopping” rate, i.e., the rate at which electrons transition from one trough in the potential to another, and U, the amplitude, represents the strength of repulsion between electrons. (Left bottom) The simulation of the model on a qubit ladder, where each qubit (square) represents a fermionic state with spin-up or spin-down (arrows). (Right) Time evolution of the model reveals separated spreading rates of charge and spin. Points and solid lines represent experimental and numerical exact results, respectively. At t = 0, the charge and spin densities are peaked at the middle sites. At later times, the charge density spreads and reaches the boundaries faster than the spin density.

Conclusion
Quantum processors hold the promise to solve computationally hard tasks beyond the capability of classical approaches. However, in order for these engineered platforms to be considered as serious contenders, they must offer computational accuracy beyond the current state-of-the-art classical methods. In our first experiment, we demonstrate an unprecedented level of accuracy in simulating simple materials, and in our second experiment, we show how to embed realistic models of interacting electrons into a quantum processor. It is our hope that these experimental results help progress the goal of moving beyond the classical computing horizon.

Categories
Misc

Buckle Up for the Industrial HPC Revolution

“A confluence of advances has put us at the beginnings of the industrial HPC revolution,” said Jensen Huang. In a short talk viewable below, NVIDIA’s CEO described to a gathering of high performance computing specialists in Europe the genesis and outlook for the most powerful technology trend of our lifetimes. High performance computing is experiencing Read article >

The post Buckle Up for the Industrial HPC Revolution appeared first on The Official NVIDIA Blog.

Categories
Misc

Innovation, Inclusion, Impact: Highlights from Our Annual Corporate Social Responsibility Report

NVIDIA’s 12th annual corporate social responsibility report, published today, shares our progress — and plans — to take care of employees, reach new sustainability goals and channel our tech to support the global community. Following a year of global hardship — the ongoing battle against the COVID-19 pandemic, social unrest and renewed calls for racial Read article >

The post Innovation, Inclusion, Impact: Highlights from Our Annual Corporate Social Responsibility Report appeared first on The Official NVIDIA Blog.

Categories
Misc

Improving Breast Cancer Detection in Ultrasound Imaging Using AI

Although ultrasound imaging is often used to detect breast cancer, especially mammographically occult cancers, its disadvantage is that it leads to high false-positive rates. We develop an AI system that achieves radiologist-level accuracy in identifying cancer. It is interpretable, achieves high accuracy on an external test set, and is trained in a weakly supervised manner.

Breast cancer is the most frequently diagnosed cancer among women worldwide. It’s also the leading cause of cancer-related deaths. Identifying breast cancer at an early stage before metastasis enables more effective treatments and therefore significantly improves survival rates.

Although mammography is the most widely used imaging technique for early detection of breast cancer, it is not always available in low-resource settings. Its sensitivity also drops for women with dense breast tissue.

Breast ultrasound is often used as a supplementary imaging modality to mammography in screening settings, and as the primary imaging modality in diagnostic settings. Despite its advantages, including lower costs relative to mammography, it is difficult to interpret breast ultrasound images as evident by the considerable intra-reader variability. This leads to increased false-positive findings, unnecessary biopsies, and significant discomfort to patients.

Previous work using deep learning for breast ultrasound has been based predominantly on small datasets on the scale of thousands of images. Many of these efforts also rely on expensive and time-consuming manual annotation of images to obtain image-level (presence of cancer in each image) or pixel-level (exact location of each lesion) labels.

Using AI to improve breast cancer detection

In our recent paper, Artificial Intelligence System Reduces False-Positive Findings in the Interpretation of Breast Ultrasound Exams, we leverage the full potential of deep learning and eliminate the need for manual annotations by designing a weakly supervised deep neural network whose working resembles the diagnostic procedure of radiologists (Figure 1).

Deep neural network architecture learns from a set of ultrasound images using an image-level feature extractor and an information aggregator to compute the final prediction.
Figure 1. Architecture of the deep neural network.

Radiologist diagnostic procedure compared to AI

The following table compares how radiologists make predictions compared to our AI system.

RADIOLOGIST AI NETWORK
Looks for abnormal findings in each image within a breast ultrasound exam. Processes each image within an exam independently using a ResNet-18 model and generates saliency map for it, indicating the most important parts.
Concentrates on images that contain suspicious lesions. Assigns attention scores to each image based on its relative importance.
Considers signals in all images to make a final diagnosis Aggregates information from all images using an attention mechanism to compute the final predictions for benign and malignant findings.
Table 1. Comparing radiology diagnostic procedure to AI

We compared the performance of the trained network to 10 board-certified breast radiologists in a reader study and to hybrid AI-radiologist models, which average the prediction of the AI and each radiologist. 

The neural network was trained with a dataset consisting of approximately four million ultrasound images on an HPC cluster powered by NVIDIA technologies. The cluster consists of 34 computation nodes each of which is equipped with 80 CPUs and four NVIDIA V100 GPUs (16/32 GB). With this cluster, we performed hyperparameter search by launching experiments (each taking around 300 GPU hours) over a broad range of hyperparameters.

A large-scale dataset

Performance metrics, including AUROC, AUPRC, specificity, biopsy rate, and PPV comparing the AI, readers, and hybrid models between the AI and each reader. The hybrid approach improves the performance of all readers across all metrics.
Figure 2. Performance of the AI, readers, and hybrid models.

To complete this ambitious project, we preprocessed more than eight million breast ultrasound images collected at NYU Langone between 2012 and 2019 and extracted breast-level cancer labels by mining pathology reports.

  • Training set: 3,930,347 images within 209,162 exams collected from 101,493 patients.
  • Validation set: 653,924 images within 34,850 exams collected from 16,707 patients.
  • Internal test set: 858,636 images within 44,755 exams collected from 25,003 patients.

Results: the most exciting part!

Our results show that a hybrid AI-radiologist model decreased false positive rates by 37.4% (that is, false suspicions of malignancy). This would lead to a reduction in the number of requested biopsies by 27.8%, while maintaining the same level of sensitivity as radiologists (Figure 3).

AUROC and AUPRC curves of the performance of the AI system in the internal test set.
Figure 3. Performance of the AI compared to readers.

When acting independently, the AI system achieved higher area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) than individual readers. Figure 3 shows how each reader compares to the network’s performance.

Within the internal test set, the AI system maintained high diagnostic accuracy (0.940-0.990 AUROC) across all age groups, mammographic breast densities, and device manufacturers, including GE, Philips, and Siemens. In the biopsied population, it also achieved a 0.940 AUROC.

The AI system achieves a 0.976 and 0.911 AUROC in the internal and external test sets, respectively.
Figure 4. Performance across internal and external test sets

In an external test set collected in Egypt, the system achieved 0.911 AUROC, highlighting its generalization ability in patient demographics not seen during training (Figure 4). 

Based on qualitative assessment, the network produced appropriate localization information of benign and malignant lesions through its saliency maps. In the exam shown in Figure 4, all 10 breast radiologists thought the lesion appeared suspicious for malignancy and recommended that it undergo biopsy, while the AI system correctly classified it as benign. Most impressively, locations of lesions were never given during training, as it was trained in a weakly supervised manner!

The AI system produces saliency maps that identify the benign and malignant lesions.
Figure 5. Saliency maps produced by the network for benign (green) and malignant (red) findings.

Future work

For our next steps, we’d like to evaluate our system through prospective validation before it can be widely deployed in clinical practice. This enables us to measure its potential impact in improving the experience of women who undergo breast ultrasound examinations each year on a global level.

In conclusion, our work highlights the complementary role of an AI system in improving diagnostic accuracy by significantly decreasing unnecessary biopsies. Beyond improving radiologists’ performance, we have made technical contributions to the methodology of deep learning for medical imaging analysis.

This work would not have been possible without state-of-the-art computational resources. For more information, see the preprint, Artificial Intelligence System Reduces False-Positive Findings in the Interpretation of Breast Ultrasound Exams.

Categories
Offsites

A Dataset for Studying Gender Bias in Translation

Advances on neural machine translation (NMT) have enabled more natural and fluid translations, but they still can reflect the societal biases and stereotypes of the data they’re trained on. As such, it is an ongoing goal at Google to develop innovative techniques to reduce gender bias in machine translation, in alignment with our AI Principles.

One research area has been using context from surrounding sentences or passages to improve gender accuracy – this is a challenge because traditional NMT methods translate sentences individually, but gendered information is not always explicitly stated in each individual sentence. For example, in the following passage in Spanish (a language where subjects aren’t always explicitly mentioned), the first sentence refers explicitly to Marie Curie as the subject, but the second one doesn’t explicitly mention the subject. In isolation, this second sentence could refer to a person of any gender. When translating to English, however, a pronoun needs to be picked, and the information needed for an accurate translation is in the first sentence.

Spanish Text Translation to English
Marie Curie nació en Varsovia. Fue la primera persona en recibir dos premios Nobel en distintas especialidades. Marie Curie was born in Warsaw. She was the first person to receive two Nobel Prizes in different specialties.

Advancing translation techniques beyond single sentences requires new metrics for measuring progress and new datasets with the most common context-related errors. Adding to this challenge is the fact that translation errors related to gender (such as picking the correct pronoun or having gender agreement) are particularly sensitive because they may directly refer to people and how they self identify.

To help facilitate progress against the common challenges on contextual translation (e.g., pronoun drop, gender agreement and accurate possessives), we are releasing the Translated Wikipedia Biographies dataset, which can be used to evaluate the gender bias of translation models. Our intent with this release is to support long-term improvements on ML systems focused on pronouns and gender in translation by providing a benchmark in which translations’ accuracy can be measured pre- and post-model changes.

A Source of Common Translation Errors
Because they are well-written, geographically diverse, contain multiple sentences, and refer to subjects in the third person (so contain plenty of pronouns), Wikipedia biographies offer a high potential for common translation errors associated with gender. These often occur when articles refer to a person explicitly in early sentences of a paragraph, but there is no explicit mention of the person in later sentences. Some examples:

Translation Error     Text     Translation
Pro-drop in Spanish → English     Marie Curie nació en Varsovia. Recibió el Premio Nobel en 1903 y en 1911.     Marie Curie was born in Warsaw. He received the Nobel Prize in 1903 and in 1911.
Neutral possessives in Spanish → English     Marie Curie nació en Varsovia. Su carrera profesional fue desarrollada en Francia.     Marie Curie was born in Warsaw. His professional career was developed in France.
Gender agreement in English → German     Marie Curie was born in Warsaw. The distinguished scientist received the Nobel Prize in 1903 and in 1911.     Marie Curie wurde in Varsovia geboren. Der angesehene Wissenschaftler erhielt 1903 und 1911 den Nobelpreis.
Gender agreement in English → Spanish     Marie Curie was born in Warsaw. The distinguished scientist received the Nobel Prize in 1903 and in 1911.     Marie Curie nació en Varsovia. El distinguido científico recibió el Premio Nobel en 1903 y en 1911.

Building the Dataset
The Translated Wikipedia Biographies dataset has been designed to analyze common gender errors in machine translation such as those illustrated above. Each instance of the dataset represents a person (identified in the biographies as feminine or masculine), a rock band or a sports team (considered genderless). Each instance is represented by a long text translation of 8 to 15 connected sentences referring to that central subject (the person, rock band, or sports team). Articles are written in native English and have been professionally translated to Spanish and German. For Spanish, translations were optimized for pronoun-drop, so the same set could be used to analyze pro-drop (Spanish → English) and gender agreement (English → Spanish).

The dataset was built by selecting a group of instances that has equal representation across geographies and genders. To do this, we extracted biographies from Wikipedia according to occupation, profession, job and/or activity. To ensure an unbiased selection of occupations, we chose 9 occupations that represented a range of stereotypical gender associations (either feminine, masculine, or neither) based on Wikipedia statistics. Then, to mitigate any geography-based bias, we divided all these instances based on geographical diversity. For each occupation category, we looked to have one candidate per region (using regions from census.gov as a proxy of geographical diversity). When an instance was associated with a region, we checked that the selected person had a relevant relationship with a country that belongs to a designated region (nationality, place of birth, lived for a big portion of their life, etc.). By using this criteria, the dataset contains entries about individuals from more than 90 countries and all regions of the world.

Although gender is non-binary, we focused on having equal representation of “feminine” and “masculine” entities. It’s worth mentioning that because the entities are represented as such on Wikipedia, the set doesn’t include individuals that identify as non-binary, as unfortunately there are not enough instances currently represented in Wikipedia to accurately reflect the non-binary community. To label each instance as “feminine” or “masculine” we relied on the biographical information from Wikipedia, which contained gender-specific references to the person (she, he, woman, son, father, etc.).

After applying all these filters, we randomly selected an instance for each occupation-region-gender triplet. For each occupation, there are 2 biographies (one masculine and one feminine), for each of the 7 geographic regions.

Finally, we added 12 instances with no gender. We picked rock bands and sports teams because they are usually referred to by non-gendered third person pronouns (such as “it” or singular “they”). The purpose of including these instances is to study over triggering (i.e., when models learn that they are rewarded for producing gender-specific pronouns, soproduce these pronouns in cases where they shouldn’t).

Results and Applications
This dataset enables a new method of evaluation for gender bias reduction in machine translations (introduced in a previous post). Because each instance refers to a subject with a known gender, we can compute the accuracy of the gender-specific translations that refer to this subject. This computation is easier when translating into English (cases of languages with prodrop or neutral pronouns) since computation is mainly based on gender-specific pronouns in English. In these cases, the gender datasets have allowed us to observe a 67% reduction in errors on context-aware models vs. previous models. As mentioned before, the neutral entities have allowed us to discover cases of over triggering like the usage of feminine or masculine pronouns to refer to genderless entities. This new dataset also enables new research directions into the performance of different models across types of occupations or geographic regions.

As an example, the dataset allowed us to discover the following improvements in an excerpt of the translated biography of Marie Curie from Spanish.

Translation result with the previous NMT model.
Translation result with the new contextual model.

Conclusion
This Translated Wikipedia Biographies dataset is the result of our own studies and work on identifying biases associated with gender and machine translation. This set focuses on a specific problem related to gender bias and doesn’t aim to cover the whole problem. It’s worth mentioning that by releasing this dataset, we don’t aim to be prescriptive in determining what’s the optimal approach to address gender bias. This contribution aims to foster progress on this challenge across the global research community.

Acknowledgements
The datasets were built with help from Anja Austermann, Melvin Johnson, Michelle Linch, Mengmeng Niu, Mahima Pushkarna, Apu Shah, Romina Stella, and Kellie Webster.

Categories
Misc

Accelerating JPEG 2000 Decoding for Digital Pathology and Satellite Images Using the nvJPEG2000 Library

JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to … Continued

JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to 17 parts, covering areas like Motion JPEG2000 (Part 3) which extends the standard for video, extensions for three-dimensional data (Part 10), and so on.

Features like mathematically lossless compression and large precision and higher dynamic range per component helped JPEG 2000 find adoption in digital cinema applications. JPEG 2000 is also widely used in digital pathology and geospatial imaging, where image dimensions exceed 4K but regions of interest (ROI) stay small.

GPU acceleration using the nvJPEG2000 library

The JPEG 2000 feature set provides ample opportunities for GPU acceleration when compared to its predecessor, JPEG. Through GPU acceleration, images can be decoded in parallel and larger images can be processed quicker. nvJPEG2000 is a new library that accelerates the decoding of JPEG 2000 images on NVIDIA GPUs. It supports codec features commonly used in geospatial imaging, remote sensing, and digital pathology. Figure 1 overviews the decoding stages that nvJPEG2000 accelerates.

The CPU runs the JPEG2000 and Tier 2 stages. GPU stages include Tier 1, dequantization, IDWT, inverse component transform, and the decoded image.
Figure 1. GPU-accelerated JPEG 2000 decode process. Stages run on the CPU are denoted by the first two blue boxes. All remaining stages are offloaded to the GPU as shown in green.

The Tier1 Decode (entropy decode) stage is the most compute-intensive stage of the entire decode process. The entropy decode algorithm used in the legacy JPEG codec was serial in nature and was hard to parallelize.

In JPEG 2000, the entropy decode stage is applied at a block-based granularity (typical block sizes are 64×64 and 32×32) that makes it possible to offload the entropy decode stage entirely to the GPU. For more information about the entropy decode process, see Section C of the JPEG 2000 Core coding system specification.

The JPEG 2000 core coding system allows for two types of wavelet transforms (5-3 Reversible and 9-7 Irreversible), both of which benefit from GPU acceleration. For more information about the wavelet transforms, see Section F of the JPEG 2000 Core coding system specification.

Decoding geospatial images

In this section, we concentrate on the new nvJPEG2000 API tailored for the geospatial domain, which enables decoding specific tiles within an image instead of decoding the full image. 

Sentinel2 image in a batch of 12 used to verify geospatial acceleration.
Figure 2. Sentinel2 Imaging (S2B_17RQK_20190908_0_L2A) JPEG2000 (Image Size 10980×10980, Tile Size 1024×1024, No of Tiles 11×11, Number of components 1).

Imaging data captured by the European Space Agency’s Sentinel 2 satellites are stored as JPEG 2000 bitstreams. Sentinel 2 level 2A data downloaded from the Copernicus hub can be used with the nvJPEG2000 decoding examples. The imaging data has 12 bands or channels and each of them is stored as an independent JPEG 2000 bitstream. The image in Figure 2 is subdivided into 121 tiles. To speed up the decode of multitile images, a new API called nvjpeg2kDecodeTile has been added in nvJPEG2000 v 0.2, which enables you to decode each tile independently.

For multitile images, decoding each tile sequentially would be suboptimal. The GitHub multitile decode sample demonstrates how to decode each tile on a separate cudaStream_t. By taking this approach, you can simultaneously decode multiple tiles on the GPU. Nsight Systems trace in Figure 3 shows the decoding of Sentinel 2 data set consisting of 12 bands. By using 10 CUDA streams, up to 10 tiles are being decoded in parallel at any point during the decode process.

Effective utilization of CUDA streams for multitile decoding
Figure 3. Nsight Systems trace demonstrating the decoding of multiple tiles on separate CUDA streams

Table 1 shows performance data comparing a single stream and multiple streams on a GV100 GPU.

# of CUDA streams Average decode time (ms) Speedup in % over single CUDA stream decode
1 0.888854
10 0.227408 75%
Table 1. Single stream vs multiple streams decode performance on a Quadro GV 100 for Sentinel2 Dataset

Using 10 CUDA streams reduces the total decode time of the entire dataset by about 75% on a Quadro GV100 GPU. For more information, see the Accelerating Geospatial Remote Sensing Workflows Using NVIDIA SDKs [S32150] GTC’21 talk. It discusses geospatial image-processing workflows in more detail and the role nvJPEG2000 plays there.

Decoding digital pathology images

JPEG 2000 is used in digital pathology to store whole slide images (WSI). Figure 4 gives an overview of various deep learning techniques that can be applied to WSI. Deep learning models can be used to distinguish between cancerous and healthy cells. Image segmentation methods can be used to identify a tumor location in the WSI. For more information, see Deep neural network models for computational histopathology: A survey.

Application work-flow in the digital pathology
Figure 4. Digital pathology workflows

Table 2 lists the key parameters and their commonly used values of a whole slide image (WSI) compressed using JPEG 2000​.

Image size 92000×201712
Tile size 92000×201712
# of tiles 1
# of components 3
Precision 8
Table 2. Key JPEG 2000 parameters and their values used in digital pathology.

The image in question is large and it is not possible to decode the entire image at one time due to the amount of memory required. The size of the decode output is around 53 GB (92000×201712 * 3). This is excluding the decoder memory requirements.

There are several approaches to handling such large images. In this post, we describe two of them:

  • Decoding an area of interest
  • Decoding the image at lower resolution

Both approaches can be easily performed using specific nvJPEG2000 APIs.

Decoding an area of interest in an image

The nvJPEG2000 library enables the decoding of a specific area of interest in an image supported as part of the  nvjpeg2kDecodeTile API. The following code example shows how to set the area of interest in terms of image coordinates. The nvjpeg2kDecodeParams_t type enables you to control the decode output settings, such as the area of interest to decode.

 nvjpeg2kDecodeParams_t decode_params;
 // all coordinate values are relative to the top-left corner of the image
 uint32_t top_coordinate, bottom_coordinate, left_coordinate, right_coordinate;
 uint32_t tile_id;
  
 nvjpeg2kDecodeParamsSetDecodeArea(decode_params, left_coordinate, right_coordinate, top_coordinate, bottom_coordinate);
  
 nvjpeg2kDecodeTile(nvjpeg2k_handle, nvjpeg2k_decode_state,
                 jpeg2k_stream, decode_params, tile_id, 0,
                 &nvjpeg2k_out, cuda_stream) 

For more information about how to partially decode an image with multiple tiles, see the Decode Tile Decode GitHub sample.

Decoding lower resolutions of an image

The second approach to decode a large image is to decode the image at lower resolutions. The ability to decode only the lower resolutions is a benefit of JPEG 2000 using wavelet transforms. In Figure 5, wavelet transform is applied up to two levels, which gives you access to the image at three resolutions. By controlling how the inverse wavelet transform is applied, you decode only the lower resolutions of an image.

JPEG 2000 decoding based on 2D wavelet transform. This image shows two-level decomposition of the wavelet.
Figure 5. Output of a 2D wavelet transform with two-level decomposition

The digital pathology image described in Table 2 has 12 resolutions. This information can be retrieved on a per-tile basis:

 uint32_t num_res;
 uint32_t tile_id = 0;
 nvjpeg2kStreamGetResolutionsInTile(jpeg2k_stream, tile_id, &num_res);

The image has a size of 92000×201712 with 12 resolutions. If you choose to discard the four higher resolutions and decode the image up to eight resolutions, that means you can extract an image of size 5750×12574. By dropping four higher resolutions, you are scaling the result by a factor of 16.

 uint32_t num_res_to_decode = 8;
 // if num_res_to_decode > num_res nvjpeg2kDecodeTile will return an INVALID //PARAMETER ERROR
  
 nvjpeg2kDecodeTile(nvjpeg2k_handle, nvjpeg2k_decode_state, jpeg2k_stream,              
     decode_params, tile_id, num_res_to_decode, &nvjpeg2k_out, cuda_stream) 

Performance benchmarks

To show the performance improvement that decoding JPEG2000 on GPU brings, compare GPU-based nvJPEG2000 with CPU-based OpenJPEG.

Figures 6 and 7 show the average speedup when decoding one image at a time. The following images are used in the measurements:

  • 1920×1080 8-bit image with 444 chroma subsampling
  • 3840×2160 8-bit image with 444 chroma subsampling
  • 3328×4096 12-bit grayscale
Lossless JPEG 2000 decoding speedup on various GPUs with regard to CPU (16 Threads): RTX A6000, A100, V100, RTX 8000, RTX 4000, T4.
Figure 6. Speed up for Lossless Decode (5-3 DWT) over CPU implementation using 16 threads.
Lossy JPEG 2000 decoding speedup on various GPUs w.r.t. CPU (16 Threads): RTX A6000, A100, V100, RTX 8000, RTX 4000, T4.
Figure 7. Speed for Lossy Decode (9-7 DWT) over CPU implementation using 16 threads

The tables were compiled with OpenJPEG CPU Performance – Intel Xeon Gold 6240@2GHz 3.9GHz Turbo (Cascade Lake) HT On, Number of CPU threads per image=16.

On NVIDIA Ampere Architecture GPUs such as NVIDIA RTX A6000, the speedup factor is more than 8x for decoding. This speedup is measured for single-image latency.

Even higher speedups can be achieved by batching the decode of multiple images. Figures 8 and 9 compare the speed of decoding a 1920×1080 8-bit image with 444 chroma subsampling (Full HD) in both lossless and lossy modes respectively across multiple GPUs.

Batch mode performance of Lossless JPEG 2000 decoding on various GPUs: A100, RTX A6000, V100, RTX 8000, RTX 4000, and T4.
Figure 8. Decode throughput comparison for a 1920×1080 8-bit 444 image using 5-3 wavelet transform (lossless decode).
Batch mode performance of Lossy JPEG 2000 decoding on various GPUs: A100, RTX A6000, V100, RTX 8000, RTX 4000, and T4.
Figure 9. Decode throughput comparison for a 1920×1080 8-bit 444 image using 9-7 wavelet transform (lossy decode).

Figures 8 and 9 demonstrate the benefits of batched decode using the nvJPEG2000 library. There’s a significant performance increase on GPUs with a large number of streaming multiprocessors (SMs), such as A100 and NVIDIA RTX A6000, than with smaller numbers of SMs, such as NVIDIA RTX 4000 and T4. By batching, you are making sure that the compute resources available are efficiently used.

As observed from Figure 8, the decode speed on an NVIDIA RTX A6000 is 232 images per second for a batch size of 20. This equates to an additional 3x speed over batch size = 1, based on a benchmark image with a low compression ratio. The compressed bitstream is only about 3x smaller than the uncompressed image. At higher compression ratios, the speedup is faster.

The following GitHub samples show how to achieve this speedup both at image and tile granularity:

Conclusion

The nvJPEG2000 library accelerates the decoding of JPEG2000 images both in size and volume using NVIDIA GPUs by targeting specific image-processing tasks of interest. Decoding JPEG 2000 images using the nvJPEG2000 library can be as much as 8x faster on GPU (NVIDIA RTX A6000) than on CPU. A further speedup of 3x (24x faster than CPU) is achieved by batching the decode of multiple images.

The simple nvJPEG2000 APIs make it easy to include in your applications and workflows. It is also integrated into the NVIDIA Data Loading Library (DALI), a data loading and preprocessing library to accelerate deep learning applications. Using nvJPEG2000 and DALI together makes it easy to use JPEG2000 images as part of deep learning training workflows.

For more information, see the following resources:

Categories
Misc

Test data generator – model.evaluate()

Hello, I’m trying to measure the performance (accuracy and loss) of my model and I discovered the evaluate() function for this.

My test data (34 pictures) is saved in a ‘test’ folder, so I tried to create an ImageDataGenerator and then to generate my data using flow_from_directory.

I receive a “Found 34 images belonging to 1 classes.” message. However, the result I get in the terminal for this code line result = seqModel.evaluate(data, batch_size=1, verbose=1) is a very weird one: 2/2 [==============================] – 0s 5ms/step – loss: 282.6923 – accuracy: 0.7353

Why do I receive a “2/2” everytime when running the script now, no matter what batch_size I choose? And why is my loss 282.6923, while accuracy is 0.7353? Doesn’t it look super weird? I know I’m doing something wrong, but I just can’t figure it out – maybe when creating the data generator or maybe when using flow_from_directory? (When I add the validationDataGenerator as first argument – in order to test it – it seems all fine, but here I just can’t figure it out.)

A little bit of help would be appreciated. 🙂

submitted by /u/burgundicorn
[visit reddit] [comments]

Categories
Misc

What is the shape of the C object corresponding to this TFLite output?

I have a YOLOv5 trained model converted to .tflite format having used this guide.

I use this code to print the input and output shape in python: “` interpreter = tf.lite.Interpreter( # model_path=”models/exported_resnet640.tflite”) # centernet_512x512 works correctly model_path=”models/yolov5_working.tflite”) # centernet_512x512 works correctly

interpreter.allocate_tensors()

Get input and output tensors.

input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() print(“======================================================”) print(input_details) print(“======================================================”)

print(output_details)

for detail in output_details: print(detail) print(” “) “` and the output looks like this:

“` [{‘name’: ‘input_1’, ‘index’: 0, ‘shape’: array([ 1, 480, 480, 3], dtype=int32), ‘shape_signature’: array([ 1, 480, 480, 3], dtype=int32), ‘dtype’: <class ‘numpy.float32’>, ‘quantization’: (0.0, 0), ‘quantization_parameters’: {‘scales’: array([], dtype=float32), ‘zero_points’: array([], dtype=int32), ‘quantized_dimension’: 0}, ‘sparsity_parameters’: {}}]

{‘name’: ‘Identity’, ‘index’: 422, ‘shape’: array([ 1, 14175, 9], dtype=int32), ‘shape_signature’: array([ 1, 14175, 9], dtype=int32), ‘dtype’: <class ‘numpy.float32’>, ‘quantization’: (0.0, 0), ‘quantization_parameters’: {‘scales’: array([], dtype=float32), ‘zero_points’: array([], dtype=int32), ‘quantized_dimension’: 0}, ‘sparsity_parameters’: {}} After invoking the interpreter after giving some input, I get an output looking like this: Output: [[[0.01191081 0.01366316 0.02800988 … 0.1661754 0.31489396 0.4217688 ] [0.02396268 0.01650745 0.0442626 … 0.24655405 0.35853994 0.2839473 ] [0.04218047 0.01613732 0.0548977 … 0.13136038 0.25760946 0.5338376 ] … [0.82626414 0.9669814 0.4534862 … 0.18754318 0.11680853 0.18492043] [0.8983849 0.9680944 0.64181983 … 0.19781056 0.16431764 0.16926363] [0.9657682 0.9869368 0.5452545 … 0.13321301 0.12015155 0.15937251]]] “`

Using the Tensorflow Lite c_api.h, I am trying to get the same output in C, but I cannot understand how to create the object that get the data.

I have tried using a float*** with size 1 * 14715 * 9 * sizeof(float) and get the output like so: “` int number_of_detections = 14175; struct filedata o_boxes; float **box_coords = (float *)malloc(sizeof(float *) * 1);

box_coords[0] = (float **)malloc(sizeof(float *) * (int)number_of_detections); for (int i = 0; i < (int)number_of_detections; i++) { box_coords[0][i] = (float *)calloc(sizeof(float), 9); // box has 9 coordinates }

o_boxes.data = box_coords; o_boxes.size = 1 * (int)number_of_detections * 9 + 1;

const TfLiteTensor *output_tensor_boxes = TfLiteInterpreterGetOutputTensor(interpreter, 0); TfLiteTensorCopyToBuffer(output_tensor_boxes, o_boxes.data, o_boxes.size * sizeof(float));

box_coords = (float ***)&o_boxes.data;

for (int i = 0; i < o_boxes.size; i++) { for (int j = 0; j < 9; j++) { printf(“%f “, box_coords[0][i][j]); fflush(stdout); } printf(“n”); } where `struct filedata` is a simple struct: struct filedata { void *data; size_t size; }; “`

The result is some garbage big floats: 39688651931648.000000 0.000000 39805756899328.000000 0.000000 39807166185472.000000 0.000000 39807367512064.000000 0.000000 39807568838656.000000 and after the first iteration I get a Segmentation Fault.

How should I create/allocate my float array to get my data?

submitted by /u/morphinnas
[visit reddit] [comments]