Categories
Misc

New Releases of NVIDIA Nsight Systems and Nsight Graphics Debut at SIGGRAPH 2022

Graphics professionals and researchers have come together at SIGGRAPH 2022 to share their expertise and learn about recent innovations in the computer graphics…

Graphics professionals and researchers have come together at SIGGRAPH 2022 to share their expertise and learn about recent innovations in the computer graphics industry. 

NVIDIA Developer Tools is excited to be a part of this year’s event, hosting the hands-on lab Using Nsight to Optimize Ray-Tracing Applications, and announcing new releases for NVIDIA Nsight Systems and NVIDIA Nsight Graphics that are available for download now.

NVIDIA Nsight Systems 2022.3

The new 2022.3 release of Nsight Systems brings expanded Vulkan API support alongside improvements to the user experience.

Nsight Systems now supports Vulkan Video, the Vulkan solution for processing hardware-accelerated video files. In previous versions of Nsight Systems, a Vulkan Video workload would not be identified as a subset of the larger queue command it occupied. 

With full integration in Nsight Systems 2022.3, Vulkan Video coding ambiguity is removed and the process can be profiled in the timeline. 

Screenshot showing that Vulkan Video workload can be identified in the Nsight System timeline below the Vulkan tab
Figure 1. Vulkan Video workload can be identified in the Nsight System timeline below the Vulkan tab

With the new VK_KHR_graphics_pipeline_library extension, Vulkan applications can now precompile shaders and link them at runtime at a substantially reduced cost. This is a critical feature for shader-heavy applications such as games, making its full support an exciting edition to Nsight Systems 2022.3. 

To round out the new version, visual improvements to multi-report viewing have been made for better clarity. For Linux machines, improved counters for the CPU, PMU, and OS make system-wide performance tracing more precise. A host of bug fixes accompany these updates.

Learn more about Nsight Systems 2022.3.

NVIDIA Nsight Graphics 2022.4

Nsight Graphics 2022.4 introduces a robust set of upgrades to its most powerful profiling tools.

In the 2022.4 release, the API inspector has been redesigned. The new design includes an improved display, search functions within API Inspector pages, significantly enhanced constant buffer views, and data export for data persistence and offline comparison.

Watch the updated demonstration video (below) from the Nsight Graphics team to learn about all the new features and improved interface:

Video 1. A demonstration of the new Nsight Graphics features and improved interface

Nsight Graphics GPU Trace is a detailed performance timeline that tracks GPU throughput, enabling meticulous profiling of hardware utilization. To aid the work of graphics development across all specifications, GPU Trace now supports generating trace and analysis for OpenGL applications on Windows and Linux.

Screenshot showing full GPU utilization timeline for an OpenGL application captured by NVIDIA Nsight Graphics
Figure 2. Full GPU utilization timeline for an OpenGL application captured by NVIDIA Nsight Graphics

Also new to GPU Trace, you can now identify subchannel switches with an event overlay. Subchannel switches occur when the GPU swaps between Compute or Graphics calls in the same hardware queue, causing the GPU to briefly idle. In the interest of performance, it is best to minimize subchannel switches, which can now be identified within the timeline.

The shader profiler summary has also been expanded, with new columns for per-shader register numbers as well as theoretical warp occupancy.

Image showing expanded shader profile summary section with new columns on the left that identify shader count and warp occupancy
Figure 3. Expanded shader profile summary section with new columns on the left that identify shader count and warp occupancy

Nsight Graphics 2022.4 is wrapped up with support for enhanced barriers that are available in recent DirectX 12 Agility SDKs. Applications that use either enhanced barriers or traditional barriers will now be equally supported. Learn more about all of the new additions to Nsight Graphics 2022.4

Nsight Deep Learning Designer 2022.2

A new version of Nsight Deep Learning Designer is available now. The 2022.2 update features expanded support for importing PyTorch models as well as launching the PyTorch exporter from a virtual environment. Performance improvements have also been made to the Channel Inspector as well as path-finding to reduce overhead. 

Paired with this release, NVIDIA Feature Map Explorer 2022.1 is available now, offering measurable performance boosts to its feature map loading process alongside additional metrics for tracking tensor values. Learn more about Nsight Deep Learning Designer 2022.2 and NVIDIA Feature Map Explorer 2022.1.

Get the latest Nsight releases

Additional resources

Watch a guided walkthrough about using Nsight tools to work through real-life development scenarios.

For even more information, see:

Want to help us build better tools for you? Share your thoughts with the NVIDIA Nsight Graphics Survey that takes less than one minute to complete. 

Categories
Misc

Reimagining Drug Discovery with Computational Biology at GTC 2022

Take a deep dive into the latest advances in drug research with AI and accelerated computing at these GTC 2022 featured sessions.

Take a deep dive into the latest advances in drug research with AI and accelerated computing at these GTC 2022 featured sessions.

Categories
Misc

Design in the Age of Digital Twins: A Conversation With Graphics-Pioneer Donald Greenberg

Asked about the future of design, Donald Greenberg holds up a model of a human aorta. “After my son became an intravascular heart surgeon at the Cleveland Clinic, he hired one of my students to use CAT scans and create digital 3D models of an aortic aneurysm,” said the computer graphics pioneer in a video Read article >

The post Design in the Age of Digital Twins: A Conversation With Graphics-Pioneer Donald Greenberg appeared first on NVIDIA Blog.

Categories
Misc

Unlocking a Simple, Extensible, and Performant Video Pipeline at Fyma with NVIDIA DeepStream

Providing computer vision in the cloud and at scale is a complex task. Fyma, a computer vision company, is tackling this complexity with the help of NVIDIA…

Providing computer vision in the cloud and at scale is a complex task. Fyma, a computer vision company, is tackling this complexity with the help of NVIDIA DeepStream

A relatively new company, Fyma turns video into data–more specifically, movement data in physical space. The Fyma platform consumes customers’ live video streams all day, every day, and produces movement events (someone walking through a doorway or down a store aisle, for example). 

One of the early lessons they learned is that their video-processing pipeline has to be simple, extensible, and performant all at the same time. With limited development resources, in the beginning they could only have one of those three. NVIDIA DeepStream has recently unlocked the ability to have all three simultaneously by shortening development times, increasing performance, and offering excellent software components such as GStreamer. 

Challenges with live video streaming

Fyma is focused on consuming live video streams to ease implementation for their customers. Customers can be hesitant to implement sensors or any additional hardware on their premises, as they have already invested in security cameras. Since these cameras can be anywhere, Fyma can provide different object detection models to maximize accuracy in different environments.

Consuming live video streams is challenging in multiple aspects:

  • Cameras sometimes produce broken video (presentation/decoding timestamps jump, reported framerate is wrong)
  • Network issues cause video streams to freeze, stutter, jump, go offline
  • CPU/memory load distribution and planning isn’t straightforward
  • Live video stream is infinite

The infinite nature of live video streams means that Fyma’s platform must perform computer vision at least as quickly as frames arrive. Basically, the whole pipeline must work in real time. Otherwise, frames would accumulate endlessly.

Luckily, object detection has steadily improved in the last few years in terms of speed and accuracy. This means being able to detect objects from more than 1,000 images per second with mAP over 90%. Such advancements have enabled Fyma to provide computer vision at scale at a reasonable price to their customers.

Providing physical space analytics using computer vision (especially in real time) involves a lot more than just object detection. According to Kaarel Kivistik, Head of Software Development at Fyma, “To actually make something out of these objects we need to track them between frames and use some kind of component to analyze the behavior as well. Considering that each customer can choose their own model, set up their own analytics, and generate reports from gathered data, a simple video processing pipeline becomes a behemoth of a platform.”

Version 1: Hello world

Fyma began with coupling OpenCV and ffmpeg to a very simple Python application. Nothing was hardware-accelerated except their neural network. They were using Yolo v3 and Darknet at the time. Performance was poor, around 50-60 frames per second, despite their use of an AWS g4dn.xlarge instance with an NVIDIA Tesla T4 GPU (which they continue to use). The application functioned like this:

  • OpenCV for capturing the video
  • Darknet with Python bindings to detect objects
  • Homemade IoU based multi-object tracker

While the implementation was fairly simple, it was not enough to scale. The poor performance was caused by three factors: 

  • Software video decoding
  • Copying decoded video frames between processes and between CPU/GPU memory
  • Software encoding the output while drawing detections on it

They worked to improve the first version with hardware video decoding and encoding. At the time, that didn’t increase overall speed by much since they still copied decoded frames from GPU to CPU memory and then back to GPU memory.

Version 2: Custom ffmpeg encoder

A real breakthrough in terms of speed came with a custom ffmpeg encoder, which was basically a wrapper around Darknet turning video frames into detected objects. Frame rates increased tenfold since they were now decoding on hardware without copying video frames between host and device memory. 

But that increase in frame rate meant that part of their application was now written in C and came with the added complexity of ffmpeg with its highly complex build system. Still, their new component didn’t need much changing and proved to be quite reliable.

One downside to this system was that they were now constrained to using Darknet.

Version 2.1: DeepSORT

To improve object tracking accuracy, Fyma replaced a homemade IoU-based tracker with DeepSORT. The results were good, but they needed to change their custom encoder to output visual features of objects in addition to bounding boxes which DeepSORT required for tracking.

Bringing in DeepSORT improved accuracy, but created another problem: depending on the video content it sometimes used a lot of CPU memory. To mitigate this problem, the team resorted to “asynchronous tracking.” Essentially a worker-based approach, it involved each worker consuming metadata consisting of bounding boxes, and producing events about object movement. While this resolved the problem of uneven CPU usage, once again it made the overall architecture more complex.

Version 3: Triton Inference Server

While previous versions performed well, Fyma found that they still couldn’t run enough cameras on each GPU. Each video stream on their platform had an individual copy of whatever model it was using. If they could reduce the memory footprint of a single camera, it would be possible to squeeze a lot more out of their GPU instances.

Fyma decided to rewrite the ffmpeg-related parts of their application. More specifically, the application now interfaces with ffmpeg libraries (libav) directly through custom Python bindings. 

This allowed Fyma to connect their application to NVIDIA Triton Inference Server which enabled sharing neural networks between camera streams. To keep the core of their object detection code the same, they moved their custom ffmpeg encoder code to a custom Triton backend.

While this solved the memory issues, it increased the complexity of Fyma’s application by at least three times.

Version 4: DeepStream

The latest version of Fyma’s application is a complete rewrite based on GStreamer and NVIDIA DeepStream. 

“A pipeline-based approach with accelerated DeepStream components is what really kicked us into gear,” Kivistik said. “Also, the joy of throwing all the previous C-based stuff into the recycle bin while not compromising on performance, it’s really incredible. We took everything that DeepStream offers: decoding, encoding, inference, tracking and analytics. We were back to synchronous tracking with a steady CPU/GPU usage thanks to nvtracker.” 

This meant events were now arriving in their database in almost real time. Previously, this data would be delayed up to a few hours, depending on how many workers were present and the general “visual” load (how many objects the whole platform was seeing).

Fyma’s current implementation runs a master process for each GPU instance. This master process in turn runs a GStreamer pipeline for each video stream added to the platform. Memory overhead for each camera is low since everything runs in a single process.

Regarding end-to-end performance (decoding, inference, tracking, analytics) Fyma is achieving frame rates up to 10x faster (around 500 fps for a single video stream) with accuracy improved up to 2-3x in comparison to their very first implementation. And Fyma was able to implement DeepStream in less than two months.

“I think we can finally say that we now have simplicity with a codebase that is not that large, and extensibility since we can easily switch out models and change the video pipeline and performance,” Kivistik said. 

“Using DeepStream really is a no-brainer for every software developer or data scientist who wants to create production-grade computer vision applications.” 

Summary

Using NVIDIA DeepStream, Fyma was able to unlock the power of its AI models and increase the performance of its vision AI applications while speeding up development time. If you would like to do the same and supercharge your development, visit the DeepStream SDK product page and DeepStream Getting Started.

Categories
Misc

AI Flying Off the Shelves: Restocking Robot Rolls Out to Hundreds of Japanese Convenience Stores

Tokyo-based startup Telexistence this week announced it will deploy NVIDIA AI-powered robots to restock shelves at hundreds of FamilyMart convenience stores in Japan. There are 56,000 convenience stores in Japan — the third-highest density worldwide. Around 16,000 of them are run by FamilyMart. Telexistence aims to save time for these stores by offloading repetitive tasks Read article >

The post AI Flying Off the Shelves: Restocking Robot Rolls Out to Hundreds of Japanese Convenience Stores appeared first on NVIDIA Blog.

Categories
Offsites

Efficient Video-Text Learning with Iterative Co-tokenization

Video is an ubiquitous source of media content that touches on many aspects of people’s day-to-day lives. Increasingly, real-world video applications, such as video captioning, video content analysis, and video question-answering (VideoQA), rely on models that can connect video content with text or natural language. VideoQA is particularly challenging, however, as it requires grasping both semantic information, such as objects in a scene, as well as temporal information, e.g., how things move and interact, both of which must be taken in the context of a natural-language question that holds specific intent. In addition, because videos have many frames, processing all of them to learn spatio-temporal information can be computationally expensive. Nonetheless, understanding all this information enables models to answer complex questions — for example, in the video below, a question about the second ingredient poured in the bowl requires identifying objects (the ingredients), actions (pouring), and temporal ordering (second).

An example input question for the VideoQA task “What is the second ingredient poured into the bowl?” which requires deeper understanding of both the visual and text inputs. The video is an example from the 50 Salads dataset, used under the Creative Commons license.

To address this, in “Video Question Answering with Iterative Video-Text Co-Tokenization”, we introduce a new approach to video-text learning called iterative co-tokenization, which is able to efficiently fuse spatial, temporal and language information for VideoQA. This approach is multi-stream, processing different scale videos with independent backbone models for each to produce video representations that capture different features, e.g., those of high spatial resolution or long temporal durations. The model then applies the co-tokenization module to learn efficient representations from fusing the video streams with the text. This model is highly efficient, using only 67 giga-FLOPs (GFLOPs), which is at least 50% fewer than previous approaches, while giving better performance than alternative state-of-the-art models.

Video-Text Iterative Co-tokenization
The main goal of the model is to produce features from both videos and text (i.e., the user question), jointly allowing their corresponding inputs to interact. A second goal is to do so in an efficient manner, which is highly important for videos since they contain tens to hundreds of frames as input.

The model learns to tokenize the joint video-language inputs into a smaller set of tokens that jointly and efficiently represent both modalities. When tokenizing, we use both modalities to produce a joint compact representation, which is fed to a transformer layer to produce the next level representation. A challenge here, which is also typical in cross-modal learning, is that often the video frame does not correspond directly to the associated text. We address this by adding two learnable linear layers which unify the visual and text feature dimensions before tokenization. This way we enable both video and text to condition how video tokens are learned.

Moreover, a single tokenization step does not allow for further interaction between the two modalities. For that, we use this new feature representation to interact with the video input features and produce another set of tokenized features, which are then fed into the next transformer layer. This iterative process allows the creation of new features, or tokens, which represent a continual refinement of the joint representation from both modalities. At the last step the features are input to a decoder that generates the text output.

As customarily done for VideoQA, we pre-train the model before fine-tuning it on the individual VideoQA datasets. In this work we use the videos automatically annotated with text based on speech recognition, using the HowTo100M dataset instead of pre-training on a large VideoQA dataset. This weaker pre-training data still enables our model to learn video-text features.

Visualization of the video-text iterative co-tokenization approach. Multi-stream video inputs, which are versions of the same video input (e.g., a high resolution, low frame-rate video and a low resolution, high frame-rate video), are efficiently fused together with the text input to produce a text-based answer by the decoder. Instead of processing the inputs directly, the video-text iterative co-tokenization model learns a reduced number of useful tokens from the fused video-language inputs. This process is done iteratively, allowing the current feature tokenization to affect the selection of tokens at the next iteration, thus refining the selection.

Efficient Video Question-Answering
We apply the video-language iterative co-tokenization algorithm to three main VideoQA benchmarks, MSRVTT-QA, MSVD-QA and IVQA, and demonstrate that this approach achieves better results than other state-of-the-art models, while having a modest size. Furthermore, iterative co-tokenization learning yields significant compute savings for video-text learning tasks. The method uses only 67 giga-FLOPs (GFLOPS), which is one sixth the 360 GFLOPS needed when using the popular 3D-ResNet video model jointly with text and is more than twice as efficient as the X3D model. This is all the while producing highly accurate results, outperforming state-of-the-art methods.

Comparison of our iterative co-tokenization approach to previous methods such as MERLOT and VQA-T, as well as, baselines using single ResNet-3D or X3D-XL.

Multi-stream Video Inputs
For VideoQA, or any of a number of other tasks that involve video inputs, we find that multi-stream input is important to more accurately answer questions about both spatial and temporal relationships. Our approach utilizes three video streams at different resolutions and frame-rates: a low-resolution high frame-rate, input video stream (with 32 frames-per-second and spatial resolution 64×64, which we denote as 32x64x64); a high-resolution, low frame-rate video (8x224x224); and one in-between (16x112x112). Despite the apparently more voluminous information to process with three streams, we obtain very efficient models due to the iterative co-tokenization approach. At the same time these additional streams allow extraction of the most pertinent information. For example, as shown in the figure below, questions related to a specific activity in time will produce higher activations in the smaller resolution but high frame-rate video input, whereas questions related to the general activity can be answered from the high resolution input with very few frames. Another benefit of this algorithm is that the tokenization changes depending on the questions asked.

Visualization of the attention maps learned per layer during the video-text co-tokenization. The attention maps differ depending on the question asked for the same video. For example, if the question is related to the general activity (e.g., surfing in the figure above), then the attention maps of the higher resolution low frame-rate inputs are more active and seem to consider more global information. Whereas if the question is more specific, e.g., asking about what happens after an event, the feature maps are more localized and tend to be active in the high frame-rate video input. Furthermore, we see that the low-resolution, high-frame rate video inputs provide more information related to activities in the video.

Conclusion
We present a new approach to video-language learning that focuses on joint learning across video-text modalities. We address the important and challenging task of video question-answering. Our approach is both highly efficient and accurate, outperforming current state-of-the-art models, despite being more efficient. Our approach results in modest model sizes and can gain further improvements with larger models and data. We hope this work provokes more research in vision-language learning to enable more seamless interaction with vision-based media.

Acknowledgements
This work is conducted by AJ Pierviovanni, Kairo Morton, Weicheng Kuo, Michael Ryoo and Anelia Angelova. We thank our collaborators in this research, and Soravit Changpinyo for valuable comments and suggestions, and Claire Cui for suggestions and support. We also thank Tom Small for visualizations.

Categories
Misc

As Far as the AI Can See: ILM Uses Omniverse DeepSearch to Create the Perfect Sky

For cutting-edge visual effects and virtual production, creative teams and studios benefit from digital sets and environments that can be updated in real time. A crucial element in any virtual production environment is a sky dome, often used to provide realistic lighting for virtual environments and in-camera visual effects. Legendary studio Industrial Light & Magic Read article >

The post As Far as the AI Can See: ILM Uses Omniverse DeepSearch to Create the Perfect Sky appeared first on NVIDIA Blog.

Categories
Misc

NVIDIA AI Makes Performance Capture Possible With Any Camera

NVIDIA AI tools are enabling deep learning-powered performance capture for creators at every level: visual effects and animation studios, creative professionals — even any enthusiast with a camera. With NVIDIA Vid2Vid Cameo, creators can harness AI to capture their facial movements and expressions from any standard 2D video taken with a professional camera or smartphone. Read article >

The post NVIDIA AI Makes Performance Capture Possible With Any Camera appeared first on NVIDIA Blog.

Categories
Misc

At SIGGRAPH, NVIDIA CEO Jensen Huang Illuminates Three Forces Sparking Graphics Revolution

In a swift, eye-popping special address at SIGGRAPH, NVIDIA execs described the forces driving the next era in graphics, and the company’s expanding range of tools to accelerate them. “The combination of AI and computer graphics will power the metaverse, the next evolution of the internet,” said Jensen Huang, founder and CEO of NVIDIA, kicking Read article >

The post At SIGGRAPH, NVIDIA CEO Jensen Huang Illuminates Three Forces Sparking Graphics Revolution appeared first on NVIDIA Blog.

Categories
Misc

Future of Creativity on Display ‘In the NVIDIA Studio’ During SIGGRAPH Special Address

A glimpse into the future of AI-infused virtual worlds was on display at SIGGRAPH — the world’s largest gathering of computer graphics experts — as NVIDIA founder and CEO Jensen Huang put the finishing touches on the company’s special address.

The post Future of Creativity on Display ‘In the NVIDIA Studio’ During SIGGRAPH Special Address appeared first on NVIDIA Blog.