Currently trying to convert a TF mask rcnn model to TFLite, so I can use it on a TPU. When I try to run the quantization code, I get the following error:
error: 'tf.TensorListReserve' op requires element_shape to be 1D tensor during TF Lite transformation pass
I’m not sure how to deal with the error, or how to fix it. Here’s the code:
import tensorflow as tf import model as modellib import coco import os import sys # Enable eager execution tf.compat.v1.enable_eager_execution() class InferenceConfig(coco.CocoConfig): GPU_COUNT = 1 IMAGES_PER_GPU = 1 config = InferenceConfig() model = modellib.MaskRCNN(mode="inference", model_dir='logs', config=config) model.load_weights('mask_rcnn_coco.h5', by_name=True) model = model.keras_model tf.saved_model.save(model, "tflite") # Preparing before conversion - making the representative dataset ROOT_DIR = os.path.abspath("../") CARS = os.path.join(ROOT_DIR, 'Mask_RCNN\mrcnn\smallCar') IMAGE_SIZE = 224 datagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255) def representative_data_gen(): dataset_list = tf.data.Dataset.list_files(CARS) for i in range(100): image = next(iter(dataset_list)) image = tf.io.read_file(image) image = tf.io.decode_jpeg(image, channels=3) image = tf.image.resize(image, [IMAGE_SIZE, IMAGE_SIZE]) image = tf.cast(image / 255., tf.float32) image = tf.expand_dims(image, 0) yield [image] converter = tf.lite.TFLiteConverter.from_keras_model(model) # This enables quantization converter.optimizations = [tf.lite.Optimize.DEFAULT] # This sets the representative dataset for quantization converter.representative_dataset = representative_data_gen # This ensures that if any ops can't be quantized, the converter throws an error converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8] # For full integer quantization, though supported types defaults to int8 only, we explicitly declare it for clarity. converter.target_spec.supported_types = [tf.int8] # These set the input and output tensors to uint8 (added in r2.3) converter.inference_input_type = tf.uint8 converter.inference_output_type = tf.uint8 tflite_model = converter.convert() with open('modelQuantized.tflite', 'wb') as f: f.write(tflite_model)
When I try to run the training process of my neural network on my GPU, I get these errors:
W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices...
I have followed all the steps in the GPU installation guide. I have downloaded the latest Nvidia GPU driver which is compatible with my graphic card. I have also downloaded the CUDA tool kit as well as the cuDNN packages. I have also made sure to add the directories of the cuDNN C:cudabin, C:cudainclude, C:cudalibx64 folders as variables in the PATH environment. I also checked that the file cudnn64_8.dll exists, which it does in cudalibx64.
I just started learning Tensorflow/Keras and in this paper it says “We use SGD with momentum of 0.9 to optimize for sum-squared error in the output of our model and use a learning rate of 0.0001 and a weight decay of 0.0001 to train for 5 epochs.” I’m trying to implement that and I have this now
I’m doing some testing using Google Cloud AI Platform and have seen some strange variation in training times. As an example, I did a test run that had an average training time of around 3.2 seconds per batch. I repeated it with the exact same hyperparameters and machine type and it took around 2.4 seconds the next time. Is there some explanation for this other than one GPU I’m assigned to being better in some way than another? That doesn’t really make sense either, but I don’t know how else to explain it.
I have a dataset of type:
<BatchDataset shapes: ((None, 256, 256, 3), (None,)), types: (tf.float32, tf.int32)>
How do i convert it into a dataset of type:
<PrefetchDataset shapes: {image: (256, 256, 3), label: ()}, types: {image: tf.uint8, label: tf.int64}>
in tensorflow
Posted by Samuel J. Yang, Research Scientist and Dick Lyon, Principal Scientist, Google Research
For the ~466 million people in the world who are deaf or hard of hearing, the lack of easy access to accessibility services can be a barrier to participating in spoken conversations encountered daily. While hearing aids can help alleviate this, simply amplifying sound is insufficient for many. One additional option that may be available is the cochlear implant (CI), which is an electronic device that is surgically inserted into a part of the inner ear, called the cochlea, and stimulates the auditory nerve electrically via external sound processors. While many individuals with these cochlear implants can learn to interpret these electrical stimulations as audible speech, the listening experience can be quite varied and particularly challenging in noisy environments.
Modern cochlear implants drive electrodes with pulsatile signals (i.e., discrete stimulation pulses) that are computed by external sound processors. The main challenge still facing the CI field is how to best process sounds — to convert sounds to pulses on electrodes — in a way that makes them more intelligible to users. Recently, to stimulate progress on this problem, scientists in industry and academia organized a CI Hackathon to open the problem up to a wider range of ideas.
In this post, we share exploratory research demonstrating that a speech enhancement preprocessor — specifically, a noise suppressor — can be used at the input of a CI’s processor to enhance users’ understanding of speech in noisy environments. We also discuss how we built on this work in our entry for the CI Hackathon and how we will continue developing this work.
Improving CIs with Noise Suppression In 2019, a small internal project demonstrated the benefits of noise suppression at the input of a CI’s processor. In this project, participants listened to 60 pre-recorded and pre-processed audio samples and ranked them by their listening comfort. CI users listened to the audio using their devices’ existing strategy for generating electrical pulses.
As shown below, both listening comfort and intelligibility usually increased, sometimes dramatically, when speech with noise (the lightest bar) was processed with noise suppression.
CI users in an early research study have improved listening comfort — qualitatively scored from “very poor” (0.0) to “OK” (0.5) to “very good” (1.0) — and speech intelligibility (i.e., the fraction of words in a sentence correctly transcribed) when trying to listen to noisy audio samples of speech with noise suppression applied.
For the CI Hackathon, we built on the project above, continuing to leverage our use of a noise suppressor while additionally exploring an approach to compute the pulses too
Overview of the Processing Approach The hackathon considered a CI with 16 electrodes. Our approach decomposes the audio into 16 overlapping frequency bands, corresponding to the positions of the electrodes in the cochlea. Next, because the dynamic range of sound easily spans multiple orders of magnitude more than what we expect the electrodes to represent, we aggressively compress the dynamic range of the signal by applying “per-channel energy normalization” (PCEN). Finally, the range-compressed signals are used to create the electrodogram (i.e., what the CI displays on the electrodes).
In addition, the hackathon required a submission be evaluated in multiple audio categories, including music, which is an important but notoriously difficult category of sounds for CI users to enjoy. However, the speech enhancement network was trained to suppress non-speech sounds, including both noise and music, so we needed to take extra measures to avoid suppressing instrumental music (note that in general, music suppression might be preferred by some users in certain contexts). To do this, we created a “mix” of the original audio with the noise-suppressed audio so that enough of the music would pass through to remain audible. We varied in real-time the fraction of original audio mixed from 0% to 40% (0% if all of the input is estimated as speech, up to 40% as more of the input is estimated as non-speech) based on the estimate from the open-source YAMNet classifier on every ~1 second window of audio of whether the input is speech or non-speech.
The Conv-TasNet Speech Enhancement Model To implement a speech enhancement module that suppresses non-speech sounds, such as noise and music, we use the Conv-TasNet model, which can separate different kinds of sounds. To start, the raw audio waveforms are transformed and processed into a form that can be used by a neural network. The model transforms short, 2.5 millisecond frames of input audio with a learnable analysis transform to generate features optimized for sound separation. The network then produces two “masks” from those features: one mask for speech and one mask for noise. These masks indicate the degree to which each feature corresponds to either speech or noise. Separated speech and noise are reconstructed back to the audio domain by multiplying the masks with the analysis features, applying a synthesis transform back to audio-domain frames, and stitching the resulting short frames together. As a final step, the speech and noise estimates are processed by a mixture consistency layer, which improves the quality of the estimated waveforms by ensuring that they sum up to the original input mixture waveform.
Block diagram of the speech enhancement system, which is based on Conv-TasNet.
The model is both causal and low latency: for each 2.5 milliseconds of input audio, the model produces estimates of separated speech and noise, and thus could be used in real-time. For the hackathon, to demonstrate what could be possible with increased compute power in future hardware, we chose to use a model variant with 2.9 million parameters. This model size is too large to be practically implemented in a CI today, but demonstrates what kind of performance would be possible with more capable hardware in the future.
Listening to the Results As we optimized our models and overall solution, we used the hackathon-provided vocoder (which required a fixed temporal spacing of electrical pulses) to produce audio simulating what CI users might perceive. We then conducted blind A-B listening tests as typical hearing users.
Listening to the vocoder simulations below, the speech in the reconstructed sounds — from the vocoder processing the electrodograms — is reasonably intelligible when the input sound doesn’t contain too much background noise, however there is still room to improve the clarity of the speech. Our submission performed well in the speech-in-noise category and achieved second place overall.
Simulated audio with fixed temporal spacing
Vocoder simulation of what CI users might perceive from audio from an electrodogram with fixed temporal spacing, with background noise and noise suppression applied.
A bottleneck on quality is that the fixed temporal spacing of stimulation pulses sacrifices fine-time structure in the audio. A change to the processing to produce pulses timed to peaks in the filtered sound waveforms captures more information about the pitch and structure of sound than is conventionally represented in implant stimulation patterns.
Simulated audio with adaptive spacing and fine time structure
Vocoder simulation, using the same vocoder as above, but on an electrodogram from the modified processing that synchronizes stimulation pulses to peaks of the sound waveform.
It’s important to note that this second vocoder output is overly optimistic about how well it might sound to a real CI user. For instance, the simple vocoder used here does not model how current spread in the cochlea blurs the stimulus, making it harder to resolve different frequencies. But this at least suggests that preserving fine-time structure is valuable and that the electrodogram itself is not the bottleneck.
Ideally, all processing approaches would be evaluated by a broad range of CI users, with the electrodograms implemented directly on their CIs rather than relying upon vocoder simulations.
Conclusion and a Call to Collaborate We are planning to follow up on this experience in two main directions. First, we plan to explore the application of noise suppression to other hearing-accessibility modalities, including hearing aids, transcription, and vibrotactile sensory substitution. Second, we’ll take a deeper dive into the creation of electrodogram patterns for cochlear implants, exploiting fine temporal structure that is not accommodated in the usual CIS (continous interleaved sampling) patterns that are standard in the industry. According to Louizou: “It remains a puzzle how some single-channel patients can perform so well given the limited spectral information they receive”. Therefore, using fine temporal structure might be a critical step towards achieving an improved CI experience.
Acknowledgements We would like to thank the Cochlear Impact hackathon organizers for giving us this opportunity and partnering with us. The participating team within Google is Samuel J. Yang, Scott Wisdom, Pascal Getreuer, Chet Gnegy, Mihajlo Velimirović, Sagar Savla, and Richard F. Lyon with guidance from Dan Ellis and Manoj Plakal.
An impressive array of NVIDIA GDC announcements elevates game development to the next level. Real-time ray tracing comes to Arm and Linux, DLSS gets an expansive update, the newly announced RTX Memory Utility enables efficient memory allocation, and Omniverse supercharges the development workflow.
Increasingly, game developers are making full use of real-time ray tracing and AI in their games. As a result, more gamers than ever are enjoying the beautifully realized lighting and AI-boosted images that you can only achieve with NVIDIA technology. At GDC 2021, NVIDIA’s updates, enhancements, and platform compatibility expansions enable RTX to be turned ON for a larger base than ever before.
NVIDIA RTX enables game developers to integrate stunning real-time ray traced lighting into games. Now, NVIDIA ray tracing and AI are making their premiere on Arm and Linux systems. Arm processors are the engines inside of billions of power-efficient devices. Linux is an extensively adopted open-source operating system with an avid user base. Together, these two platforms offer a massive new audience for ray tracing technology. To show our commitment to nurturing the Arm and Linux gaming ecosystems, NVIDIA has prepared a demo of Wolfenstein: Youngblood and Amazon’s Bistro scene running RTX on an Arm-based MediaTek processor.
For more information, contact NVIDIA’s developer relations team or visit developer.nvidia.com.
New Ray Tracing SDK: RTX Memory Utility Improves Memory Allocation for Games
Real-time ray tracing elevates game visuals to new heights with dynamic, physically-accurate lighting running at interactive frame rates. Though the results of ray tracing are stunning, the process is computationally expensive and can put a strain on memory availability in hardware. To alleviate this heavy cost, NVIDIA is releasing a new open source ray tracing SDK, RTX Memory Utility (RTXMU), built to optimize and reduce memory consumption of acceleration structures.
RTXMU uses sophisticated compaction and suballocation techniques that eliminates wasted memory, resulting in a roughly 50% reduction in memory footprint. By freeing this space, larger and more robust ray tracing worlds can be built than ever before.
RTXMU is easy to integrate, provides immediate benefits, and is available today.
DLSS Update brings Linux Support, Streamlined Access, new Customizable Options
A new DLSS update brings support for games running natively on Linux, alongside support for Vulkan API games on Proton introduced in June 2021. Arm support for DLSS has been announced as well. This update also brings a host of customizable options for both users and developers. New features include a Sharpening Slider that enables user-specific sharpening preferences, an Auto Mode that calibrates DLSS to the optimal quality given a particular resolution, and an Auto-Exposure Option that can improve image quality in low-contrast scenes.
Furthermore, accessing DLSS has been streamlined, and an application is no longer required to download DLSS SDK 2.2.1. DLSS is a game-changing software, and integrating it has never been easier. Learn more about using DLSS in NVIDIA’s DLSS Overview GDC session.
Read more about Linux and Arm support, new customizable options, and how to access DLSS in the DLSS 2.2.1 developer blog.
NvRTX: NVIDIA’s Branch of Unreal Engine Includes Best-in-Class Ray Tracing Technology
NVIDIA makes it easy for Unreal Engine developers to use RTX and AI in their games with the NvRTX branch, which adds RTX Direct Illumination (RTXDI), RTX Global Illumination (RTXGI), and DLSS to Unreal Engine 4.26 and Unreal Engine 5. Enhancing the game development ecosystem with support like NvRTX is incredibly important to NVIDIA; read more about how NVIDIA strives to empower game creation here.
Delve into how NVIDIA integrates ray tracing support into Unreal Engine with the NvRTX branch, solving ray tracing challenges on our end so that implementation is seamless and intuitive in the hands of developers. Watch the NvRTX Technical Overview GDC session.
Learn about how RTXDI and RTXGI in the NvRTX branch collaborate with a variety of Unreal Engine tools to enable artists to create ray traced scenes and effects. Watch the NvRTX Artists Guide GDC session.
Omniverse Accelerates Game Development to the Speed of Light
For game development, NVIDIA Omniverse offers the ultimate platform for cross-collaboration between the library of applications and development teams that must work in unison to push a game from concept to credits. Omniverse is a powerful collaboration tool that enables seamless communication and so much more — providing engines for simulation, ray traced rendering, and AI development to name a few. Learn about NVIDIA Omniverse’s extensive feature list and everything the platform can offer to game creation in the Omniverse at GDC blog.
Watch NVIDIA’s Omniverse Game Development GDC session, covering how to connect your favorite industry tools to the Omniverse Nucleus as well as how to use Extensions to build custom tools for the workflow.
NVIDIA Nsight Updates: Optimize and Debug GPU Performance on a Super-Granular Scale
NVIDIA Nsight enables developers to build and profile state-of-the-art games and applications that harness the full power of NVIDIA GPUs. Announced at GDC is a new serving of Nsight Developer Tools to improve the debugging and optimization process, learn more about these new additions in the Nsight Tools update blog.
Included in the newly available developer tools: Nsight Systems 2021.3, the latest Nsight release adding new features for register dependency visualization. Learn more about Nsight Systems 2021.3 in the release blog.
You can also read more about Nsight Graphics 2021.3, NVIDIA’s tool for deep analysis of GPU performance, in the Nsight Graphics 2021.3 release blog.
Ensuring that the GPU is being fully utilized when performing ray tracing tasks can be challenging. Explore how Nsight and other NVIDIA Developer Tools allow for optimizing and debugging GPU performance in the Nsight: Developer Tools GDC session.
Watch a demo of NSight Graphics in action below.
That’s a wrap on NVIDIA at GDC 2021!
Take a closer look at our game development SDKs and developer resources here.
For Python 3.8 and TensorFlow 2.5, I have a 3-D tensor of shape (3, 3, 3) where the goal is to compute the L2-norm for each of the three (3, 3) square matrices. The code that I came up with is:
The DevKit is an integrated hardware-software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications for Arm server based accelerated platforms.
Today NVIDIA announced the availability of the NVIDIA Arm HPC Developer Kit with the NVIDIA HPC SDK version 21.7. The DevKit is an integrated hardware-software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications for Arm server based accelerated platforms. The HPC SDK v21.7 is the latest update of the software development kit, and fully supports the new Arm HPC DevKit.
This DevKit targets heterogeneous GPU/CPU system development, and includes an Arm CPU, two NVIDIA A100 Tensor Core GPUs, two NVIDIA BlueField-2 data processing units (DPUs), and the NVIDIA HPC SDK suite of tools.
The integrated HW/SW DevKit delivers:
A validated system for quick and easy bring-up in a stable environment for accelerated computing code execution and evaluation, performance analysis, system experimentation, and system characterization.
A stable hardware and software platform for development and performance analysis of accelerated HPC, AI, and scientific computing applications
Experimentation and characterization of high-performance, NVIDIA-accelerated, Arm server-based system architectures
The NVIDIA Arm HPC Developer Kit is based on the GIGABYTE G242-P32 2U server, and leverages the NVIDIA HPC SDK, a comprehensive suite of compilers, libraries, and tools for HPC delivering performance, portability, and productivity. The platform will support Ubuntu, SLES, and RHEL operating systems.
HPC SDK 21.7 includes:
Full support for the NVIDIA Arm HPC Developer Kit
CUDA 11.4 support
HPC Compilers with Arm-specific performance enhancements including improved vectorization and optimized math functions
Maintenance support and bug fixes
Previously HPC SDK 21.5 introduced support for:
A subset of Arm Neon intrinsics have been implemented in the HPC Compilers and can be enabled with -Mneon_intrinsics.
The NVIDIA HPC SDK C++ and Fortran compilers are the first compilers to support automatic GPU acceleration of standard language constructs including C++17 parallel algorithms and Fortran intrinsics.