Whether animating fish fins or fashioning chic outfits for digital characters, creators can tap Marvelous Designer software to compose and tailor assets, clothes and other materials for their 3D workflows.
Amir Anbarestani, an accomplished 3D artist who goes by the moniker Kingsletter, had a “shell of a good time” creating his Space Turtle scene this week In the NVIDIA Studio.
How to Successfully Integrate NVIDIA DLSS 3
NVIDIA DLSS Frame Generation is the new performance multiplier in DLSS 3 that uses AI to create entirely new frames. This breakthrough has made real-time path…
NVIDIA DLSS Frame Generation is the new performance multiplier in DLSS 3 that uses AI to create entirely new frames. This breakthrough has made real-time path tracing—the next frontier in video game graphics—possible.
NVIDIA has made it easier for you to take full advantage of this technology with the release of the Unreal Engine 5.2 Plugin and Streamline 2.1 SDK.
Unreal Engine developers can get started now. Coupled with the NVIDIA Reflex low-latency technology available through Unreal Engine 5, they have all the tools to boost game performance while providing a highly responsive experience for players.
If you’re looking to do an integration within your own custom engine, Streamline 2.1 greatly simplifies the manual API hooking for all necessary components needed for DLSS 3. Streamline is an open-source cross-IHV framework that simplifies the integration of features like DLSS 3.
Instead of manually integrating the DLSS Frame Generation libraries, you identify which resources (motion vectors, depth, and so on) are required for the desired plug-in and then trigger when to execute the plug-ins in the rendering pipeline. Here are the necessary steps to ensure that your integrations take full advantage of DLSS 3:
- Integrate the Streamline 2.1 SDK: To add Streamline to your application, follow the Streamline Manual Hooking guide. Integrate without any features and focus on tasks such as manual hooking and resource state tracking.
- Perform a security check: Verify the NVIDIA and Streamline dual signatures on sl.itnerposer.dll before loading the DLL. Follow the verification process within the Security section of the programming guide.
- Check for system support: The DLSS 3 components (Super Resolution, Frame Generation, and NVIDIA Reflex) all have varied system requirements. Check for hardware and software system support and show appropriate error messages based on reported support.
- Integrate DLSS Super Resolution through Streamline: Pass in the necessary input resources and set up the upscaling pipeline. Follow these integration steps before all other post-processing.
- Evaluate integration: Validate and confirm image quality and performance benefits from DLSS Super Resolution.
- Integrate NVIDIA Reflex through Streamline: Add Reflex and its sub-features to the rendering pipeline. Make sure to place Reflex markers in the appropriate location or where your application should sleep.
- Confirm system latency reduction: There are three primary ways to check that input latency was reduced:
- NVIDIA FrameView SDK
- GeForce Experience in-game overlay
- The Reflex latency analyzer
- Integrate DLSS Frame Generation through Streamline: Follow these integration steps and pass in the appropriate constants, camera matrices, and input resources in your post-processing pipeline. Pass in all the input resources marked for DLSS Super Resolution (for example, hudless and UIColor Color with Alpha). Disable DLSS Frame Generation when appropriate, such as when in-menu or for scene transitions.
- Validate DLSS Frame Generation inputs: Use the
sl.imgui
plugin to validate inputs (camera matrices, depth, MVEC, color, and so on). We recommend using ICAT to validate image quality and FrameView to validate latency. Lastly, buffer visualization using the development DLLs. - Swap to production DLLs: After image quality and performance benefits from DLSS Frame Generation are validated, replace the watermarked DLLs with non-watermarked, production-ready DLLs from NVIDIA.
For an integration checklist and the most asked questions for DLSS Super Resolution, Frame Generation, and NVIDIA Reflex, see Streamline Getting Started (registration required). To learn more about the new DLSS plugin in Unreal Engine 5, see the Unreal Engine page.
Game developers can find additional free resources to re-create fully path-traced and AI-driven virtual worlds on the NVIDIA Game Development page.
NVIDIA DLSS 3 is a neural graphics technology that multiplies performance using AI image reconstruction and frame generation. It’s a combination of three core…
NVIDIA DLSS 3 is a neural graphics technology that multiplies performance using AI image reconstruction and frame generation. It’s a combination of three core innovations:
- Super Resolution uses deep learning algorithms to upscale a lower-resolution input into a higher-resolution output, creating a sharp image with a boosted frame rate.
- Frame Generation uses AI rendering to generate entirely new frames with best-in-class quality and responsiveness.
- NVIDIA Reflex is a low-latency technology that minimizes input lag by synchronizing the CPU and the GPU for optimal responsiveness.
Powered by these three technologies, DLSS 3 enables upwards of 4x performance boosts, providing headroom for next-generation, path-traced rendering.
DLSS Super Resolution has been available in Unreal Engine since 2021, making it easy to integrate NVIDIA AI scaling technology into Unreal Engine projects. NVIDIA has now released DLSS 3 for Unreal Engine 5.2, which includes Frame Generation and the latest NVIDIA Reflex version. For more information about Unreal Engine 5.1 and earlier, see step 2 in the installation guide later in this post.
To make integrating NVIDIA technology into your project as simple as possible, the new DLSS 3 Unreal Engine 5.2 package contains the Frame Generation, Super Resolution, and NVIDIA Reflex plugins all in a single download.
DLSS 3 technologies
The DLSS Frame Generation plugin uses Frame Generation to create entirely new frames by analyzing sequential frames and motion data from the Optical Flow Accelerator in GeForce RTX 40 Series GPUs.
Bundled inside the DLSS Frame Generation plugin is NVIDIA Reflex. Paired with DLSS 3, NVIDIA Reflex reduces onscreen latency by up to 2x compared to native rendering.
The DLSS Super Resolution plugin supports a variety of image quality modes—from Ultra Performance to Quality—determined by the native resolution relative to the DLSS output resolution. DLSS Super Resolution is customizable based on the needs of your game, with additional NVIDIA technologies included in the plugin:
- Deep Learning Anti-Aliasing Mode (DLAA) offers an AI-based anti-aliasing mode for users who have spare GPU headroom and want higher levels of image quality.
- NVIDIA Image Scaling is an open-source spatial upscaler and sharpening algorithm that is available for all platforms.
The DLSS 3 Unreal Engine 5.2 plugin is delivered with the latest optimizations to NVIDIA AI algorithms, always learning and evolving with over-the-air updates.
How to install DLSS 3 for Unreal Engine
Follow these steps to download and install DLSS 3 for your Unreal Engine project.
- Agree to the Terms of the License Agreement and download DLSS 3 for your version of Unreal Engine.
- Unzip the DLSS folder. Only the 5.2 version of DLSS contains the Streamline/Frame Generation plugin.
- Copy the plugin folders to install to the
/Engine/Plugins/MarketPlace
folder of your Unreal Engine directory. If you don’t currently have a/MarketPlace
folder, create one. - Launch Unreal Editor, go to Plugins, and search for the plugins to activate. Search for “NVIDIA” to quickly list all of the included DLSS 3 plugins.
- Activate and restart Unreal Editor.
- Load the DLSS 3 Test project from the
/Samples
folder of the downloaded DLSS plugin file.
For prior versions of Unreal, you must build from the source and modify your source code with a small patch. For more information, see the included DLSS Frame Generation Quick Start Guide PDF in the download .zip file.
Tips for using DLSS 3 in Unreal Engine
After DLSS 3 is installed, follow these steps to verify that the Frame Generation, Super Resolution, and Reflex plugins are integrated into your project correctly.
- To confirm that DLSS Frame Generation is working, along with real-time statistics, navigate to project settings, and then to your preferences for the NVIDIA Streamline plugin. Toggle the Load Debug Overlay option.
- The Load Debug Overlay option for Frame Generation works in the editor and can appear in development or debug builds, but won’t appear in production builds.
- To update Streamline automatically as well as DLSS AI algorithms with the latest improvements, use the same settings window to ensure that the Allow OTA Update option is enabled.
- In the Unreal Editor, Frame Generation only works from a new editor window (PIE) or in Standalone mode. It doesn’t work from the selected viewport or while editing.
- If any of the included DLSS 3 technologies aren’t working, check the output log or look for onscreen warning messages. A common issue may be that the NVIDIA drivers may have to be updated, for example.
- The DLSS 3 Unreal Engine plugin contains the latest NVIDIA Reflex technology, a newer version than the version currently built into Unreal Engine. While it’s possible to keep the earlier plugin enabled, and even use the earlier NVIDIA Reflex Blueprint scripts, we recommended that you disable the earlier NVIDIA Reflex plugin and use the new version bundled in DLSS 3 Streamline instead.
- We recommend that you set up all NVIDIA plugins through Blueprint scripts, as this enables you to conveniently activate plugins from menus and set preferences for users. However, if you need access to the console commands, they can be found under r.ngx. For more information about using console commands, see the DLSS Quick Start Guide PDF included in the DLSS 3 plugin download.
- When Frame Generation is on, we recommend that you disable VSYNC in your application. The DLSS 3 plugin can set VSYNC to behave incorrectly when active. VSYNC can be disabled with the r.vsync 0 console command.
Download DLSS 3 for Unreal Engine
DLSS 3 for Unreal Engine makes the latest NVIDIA advancements in neural rendering and performance multiplication easy to integrate into your UE project. Get started with the Frame Generation, Super Resolution, and Reflex plugins now.
DLSS 3 for Unreal Engine 5.2 is now available.
For more information, see NVIDIA technologies supported by Unreal Engine 5.
Learn how financial firms can build automated, real-time fraud and threat detection solutions with NVIDIA Morpheus.
Learn how financial firms can build automated, real-time fraud and threat detection solutions with NVIDIA Morpheus.
Generative AI will “supercharge” creators across industries and content types, NVIDIA founder and CEO Jensen Huang said today at the Cannes Lions Festival, on the French Riviera. “For the very first time, the creative process can be amplified in content generation, and the content generation could be in any modality — it could be be Read article >
The analysis of 3D medical images is crucial for advancing clinical responses, disease tracking, and overall patient survival. Deep learning models form the…
The analysis of 3D medical images is crucial for advancing clinical responses, disease tracking, and overall patient survival. Deep learning models form the backbone of modern 3D medical representation learning, enabling precise spatial context measurements that are essential for clinical decision-making. These 3D representations are highly sensitive to the physiological properties of medical imaging data, such as CT or MRI scans.
Medical image segmentation, a key visual task for medical applications, serves as a quantitative tool for measuring various aspects of medical images. To improve the analysis of these images, the development and application of foundation models are becoming increasingly important in the field of medical image analysis.
What are foundation models?
Foundation models, the latest generation of AI neural networks, are trained on extensive, diverse datasets and can be employed for a wide range of tasks or targets.
As large language models demonstrate their capability to tackle generic tasks, visual foundation models are emerging to address various problems, including classification, detection, and segmentation.
Foundation models can be used as powerful AI neural networks for segmenting different targets in medical images. It opens up a world of possibilities for medical imaging applications, enhancing the effectiveness of segmentation tasks and enabling more accurate measurements.
Challenges in medical image analysis
The application of medical foundation models in medical image analysis poses significant challenges. Unlike general computer vision models, medical image applications typically demand high-level domain knowledge.
Institutes have traditionally created fully annotated datasets for specific targets like spleens or tumors, relying solely on the association between input data features and target labels. Addressing multiple targets is more difficult, as manual annotations are laborious and time-consuming. Training larger or multi-task models is also increasingly challenging.
Despite recent advancements, there is still a long-standing issue in comprehending large medical imaging data due to its heterogeneity:
- Medical volumetric data is often extremely high-resolution, necessitating substantial computational resources.
- Current deep learning models have yet to effectively capture anatomical variability.
- The large-scale nature of medical imaging data makes learning robust and efficient 3D representations difficult, particularly when dealing with heterogeneous data.
However, the modern analysis of high-resolution, high-dimensional, and large-scale medical volumetric data presents an opportunity to accelerate discoveries and obtain innovative insights into human body functions, behavior, and disease.
Foundation models offer the capability to address the heterogeneous variations that complicate the rectification of inter– and intra-subject differences. AI has the potential to revolutionize medical imaging by enabling more accurate and efficient analysis of large-scale, complex data.
A platform for medical visual segmentation foundation models
MONAI Model Zoo serves as a platform for hosting medical visual foundation models. It contains a collection of pretrained models for medical imaging tasks developed using the Medical Open Network for AI (MONAI) framework.
The MONAI Model Zoo is a publicly available resource that provides access to a variety of pretrained models for different medical imaging tasks, such as segmentation, classification, registration, and synthesis. These pretrained models can be used as starting points or foundation models for training on new datasets or fine-tuning for specific applications.
The MONAI Model Zoo is designed to accelerate the development of new medical imaging applications and enable researchers and clinicians to leverage pre-existing models and build on top of them.
Whole-body CT segmentation
Segmenting the entirety of a whole-body CT scan from a single model is a daunting task. However, the MONAI team has risen to the challenge. They’ve developed models that segment all 104 anatomical structures from a single model:
- 27 organs
- 59 bones
- 10 muscles
- 8 vessels
Using the dataset released by the totalSegmentator team, MONAI conducted research and benchmarking to achieve fast inference times. For a high-resolution 1.5 mm model, the inference time using a single NVIDIA V100 GPU for all 104 structures is just 4.12 seconds, while the inference time using a CPU is 30.30 seconds. This is a significant improvement from the original paper’s reported inference time for a single CT scan, which took more than 1 minute.
To access the MONAI Whole Body CT Segmentation foundation model, see the MONAI Model Zoo.
For more information about the overview of all anatomical structures in whole-body CT scans, see the TotalSegmentator: robust segmentation of 104 anatomical structures in CT images whitepaper.
(Source: TotalSegmentator: robust segmentation of 104 anatomical structures in CT images)
Whole-brain MRI segmentation
Whole-brain segmentation is a critical technique in medical image analysis, providing a non-invasive means of measuring brain regions from clinical structural magnetic resonance imaging (MRI). However, with over 130 substructures in human brains, segmenting anything in the brain is a difficult challenge for MRI 3D segmentation. Unfortunately, detailed annotations of the brain are scarce, making this task even more challenging for the medical imaging community.
To address this issue, the MONAI team collaborated with Vanderbilt University to develop a deep learning model that can simultaneously segment all 133 brain structures. Using 3D Slicer, the MONAI model can infer the entire brain in just 2.0 seconds. The MONAI whole brain MRI segmentation model represents a promising development in medical imaging research, offering a valuable resource for improving the accuracy of brain measurements in clinical settings.
Visit the MONAI Model Zoo to access the MONAI Whole Brain MRI Segmentation Foundation Model.
How to access medical imaging foundation models
The use of foundation models in medical image analysis has great potential to improve diagnostic accuracy and enhance patient care. However, it’s important to recognize that medical application requires strong domain knowledge.
With the ability to process large amounts of data and identify subtle patterns and anomalies, foundation models have proven to be valuable tools in the medical image analysis field. The development and refinement of these models is ongoing, with researchers and practitioners working to improve their accuracy and expand their capabilities.
Although challenges such as patient privacy and potential biases must be addressed, the use of foundation models has already demonstrated significant benefits. It is expected to play a more prominent role in healthcare in the future.
As researchers, clinicians, and users continue to focus on foundation models, the MONAI Model Zoo, a platform hosting pretrained medical image models, is amplifying its impact. Fine-tuning pretrained models is crucial to the future of medical image analysis.
The MONAI Model Zoo provides access to a diverse collection of pretrained models for various medical imaging tasks, including segmentation, classification, registration, and synthesis. By using these pre-existing models as starting points, researchers and clinicians can accelerate the development of new medical imaging applications, saving time and resources.
Join us in driving innovation and collaboration in medical imaging research by exploring the MONAI Model Zoo today.
Google at CVPR 2023
This week marks the beginning of the premier annual Computer Vision and Pattern Recognition conference (CVPR 2023), held in-person in Vancouver, BC (with additional virtual content). As a leader in computer vision research and a Platinum Sponsor, Google Research will have a strong presence across CVPR 2023 with 90 papers being presented at the main conference and active involvement in over 40 conference workshops and tutorials.
If you are attending CVPR this year, please stop by our booth to chat with our researchers who are actively exploring the latest techniques for application to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including on-device ML applications with MediaPipe, strategies for differential privacy, neural radiance field technologies and much more.
You can also learn more about our research being presented at CVPR 2023 in the list below (Google affiliations in bold).
Board and organizing committee
Senior area chairs include: Cordelia Schmid, Ming-Hsuan Yang
Area chairs include: Andre Araujo, Anurag Arnab, Rodrigo Benenson, Ayan Chakrabarti, Huiwen Chang, Alireza Fathi, Vittorio Ferrari, Golnaz Ghiasi, Boqing Gong, Yedid Hoshen, Varun Jampani, Lu Jiang, Da-Cheng Jua, Dahun Kim, Stephen Lombardi, Peyman Milanfar, Ben Mildenhall, Arsha Nagrani, Jordi Pont-Tuset, Paul Hongsuck Seo, Fei Sha, Saurabh Singh, Noah Snavely, Kihyuk Sohn, Chen Sun, Pratul P. Srinivasan, Deqing Sun, Andrea Tagliasacchi, Federico Tombari, Jasper Uijlings
Publicity Chair: Boqing Gong
Demonstration Chair: Jonathan T. Barron
Program Advisory Board includes: Cordelia Schmid, Richard Szeliski
Panels
Scientific Discovery and the Environment
Best Paper Award candidates
MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures
Zhiqin Chen, Thomas Funkhouser, Peter Hedman, Andrea Tagliasacchi
DynIBaR: Neural Dynamic Image-Based Rendering
Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, Noah Snavely
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Nataniel Ruiz*, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman
On Distillation of Guided Diffusion Models
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans
Highlight papers
Connecting Vision and Language with Video Localized Narratives
Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, Vittorio Ferrari
MaskSketch: Unpaired Structure-Guided Masked Image Generation
Dina Bashkirova*, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa
SPARF: Neural Radiance Fields from Sparse and Noisy Poses
Prune Truong*, Marie-Julie Rakotosaona, Fabian Manhardt, Federico Tombari
MAGVIT: Masked Generative Video Transformer
Lijun Yu*, Yong Cheng, Kihyuk Sohn, Jose Lezama, Han Zhang, Huiwen Chang, Alexander Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
Dahun Kim, Anelia Angelova, Weicheng Kuo
I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification
Muhammad Ferjad Naeem, Gul Zain Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, Federico Tombari
Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization
Zifan Wang*, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting (see blog post)
Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J. Fleet, Radu Soricut, Jason Baldridge, Mohammad Norouzi, Peter Anderson, William Cha
RUST: Latent Neural Scene Representations from Unposed Imagery
Mehdi S. M. Sajjadi, Aravindh Mahendran, Thomas Kipf, Etienne Pot, Daniel Duckworth, Mario Lučić, Klaus Greff
REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory (see blog post)
Ziniu Hu*, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David Ross, Alireza Fathi
RobustNeRF: Ignoring Distractors with Robust Losses
Sara Sabour, Suhani Vora, Daniel Duckworth, Ivan Krasin, David J. Fleet, Andrea Tagliasacchi
Papers
AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training
Yifan Jiang*, Peter Hedman, Ben Mildenhall, Dejia Xu, Jonathan T. Barron, Zhangyang Wang, Tianfan Xue*
BlendFields: Few-Shot Example-Driven Facial Modeling
Kacper Kania, Stephan Garbin, Andrea Tagliasacchi, Virginia Estellers, Kwang Moo Yi, Tomasz Trzcinski, Julien Valentin, Marek Kowalski
Enhancing Deformable Local Features by Jointly Learning to Detect and Describe Keypoints
Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson Nascimento
How Can Objects Help Action Recognition?
Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
Hybrid Neural Rendering for Large-Scale Scenes with Motion Blur
Peng Dai, Yinda Zhang, Xin Yu, Xiaoyang Lyu, Xiaojuan Qi
IFSeg: Image-Free Semantic Segmentation via Vision-Language Model
Sukmin Yun, Seong Park, Paul Hongsuck Seo, Jinwoo Shin
Learning from Unique Perspectives: User-Aware Saliency Modeling (see blog post)
Shi Chen*, Nachiappan Valliappan, Shaolei Shen, Xinyu Ye, Kai Kohlhoff, Junfeng He
MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis
Tianhong Li*, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan
NeRF-Supervised Deep Stereo
Fabio Tosi, Alessio Tonioni, Daniele Gregorio, Matteo Poggi
Omnimatte3D: Associating Objects and their Effects in Unconstrained Monocular Video
Mohammed Suhail, Erika Lu, Zhengqi Li, Noah Snavely, Leon Sigal, Forrester Cole
OpenScene: 3D Scene Understanding with Open Vocabularies
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser
PersonNeRF: Personalized Reconstruction from Photo Collections
Chung-Yi Weng, Pratul Srinivasan, Brian Curless, Ira Kemelmacher-Shlizerman
Prefix Conditioning Unifies Language and Label Supervision
Kuniaki Saito*, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning (see blog post)
AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
Burstormer: Burst Image Restoration and Enhancement Transformer
Akshay Dudhane, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang
Decentralized Learning with Multi-Headed Distillation
Andrey Zhmoginov, Mark Sandler, Nolan Miller, Gus Kristiansen, Max Vladymyrov
GINA-3D: Learning to Generate Implicit Neural Assets in the Wild
Bokui Shen, Xinchen Yan, Charles R. Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, Dragomir Anguelov
Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent with Learned Distance Functions
Yun He, Danhang Tang, Yinda Zhang, Xiangyang Xue, Yanwei Fu
Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble
Chun-Han Yao*, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani
Hyperbolic Contrastive Learning for Visual Representations beyond Objects
Songwei Ge, Shlok Mishra, Simon Kornblith, Chun-Liang Li, David Jacobs
Imagic: Text-Based Real Image Editing with Diffusion Models
Bahjat Kawar*, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani
Incremental 3D Semantic Scene Graph Prediction from RGB Sequences
Shun-Cheng Wu, Keisuke Tateno, Nassir Navab, Federico Tombari
IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction
Dekai Zhu, Guangyao Zhai, Yan Di, Fabian Manhardt, Hendrik Berkemeyer, Tuan Tran, Nassir Navab, Federico Tombari, Benjamin Busam
Learning to Generate Image Embeddings with User-Level Differential Privacy
Zheng Xu, Maxwell Collins, Yuxiao Wang, Liviu Panait, Sewoong Oh, Sean Augenstein, Ting Liu, Florian Schroff, H. Brendan McMahan
NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs
Harsh Rangwani, Lavish Bansal, Kartik Sharma, Tejan Karmali, Varun Jampani, Venkatesh Babu Radhakrishnan
NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models
Ron Mokady*, Amir Hertz*, Kfir Aberman, Yael Pritch, Daniel Cohen-Or*
SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow
Itai Lang*, Dror Aiger, Forrester Cole, Shai Avidan, Michael Rubinstein
Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion
Dario Pavllo*, David Joseph Tan, Marie-Julie Rakotosaona, Federico Tombari
TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation
Hanzhi Chen, Fabian Manhardt, Nassir Navab, Benjamin Busam
TryOnDiffusion: A Tale of Two UNets
Luyang Zhu*, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, Ira Kemelmacher-Shlizerman
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning
Aishwarya Kamath*, Peter Anderson, Su Wang, Jing Yu Koh*, Alexander Ku, Austin Waters, Yinfei Yang*, Jason Baldridge, Zarana Parekh
CLIPPO: Image-and-Language Understanding from Pixels Only
Michael Tschannen, Basil Mustafa, Neil Houlsby
Controllable Light Diffusion for Portraits
David Futschik, Kelvin Ritland, James Vecore, Sean Fanello, Sergio Orts-Escolano, Brian Curless, Daniel Sýkora, Rohit Pandey
CUF: Continuous Upsampling Filters
Cristina Vasconcelos, Cengiz Oztireli, Mark Matthews, Milad Hashemi, Kevin Swersky, Andrea Tagliasacchi
Improving Zero-Shot Generalization and Robustness of Multi-modal Models
Yunhao Ge*, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao
LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding
Gen Li, Varun Jampani, Deqing Sun, Laura Sevilla-Lara
Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision
Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, Kyle Genova
Self-Supervised AutoFlow
Hsin-Ping Huang, Charles Herrmann, Junhwa Hur, Erika Lu, Kyle Sargent, Austin Stone, Ming-Hsuan Yang, Deqing Sun
Train-Once-for-All Personalization
Hong-You Chen*, Yandong Li, Yin Cui, Mingda Zhang, Wei-Lun Chao, Li Zhang
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning (see blog post)
Antoine Yang*, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid
VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining
Junjie Ke, Keren Ye, Jiahui Yu, Yonghui Wu, Peyman Milanfar, Feng Yang
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model
Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li, Caiwen Ding, Yanzhi Wang, Yi Liang, Dongkuan Xu
Accidental Light Probes
Hong-Xing Yu, Samir Agarwala, Charles Herrmann, Richard Szeliski, Noah Snavely, Jiajun Wu, Deqing Sun
FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning
Yuanhao Xiong, Ruochen Wang, Minhao Cheng, Felix Yu, Cho-Jui Hsieh
FlexiViT: One Model for All Patch Sizes
Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic
Iterative Vision-and-Language Navigation
Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason
MoDi: Unconditional Motion Synthesis from Diverse Data
Sigal Raab, Inbal Leibovitch, Peizhuo Li, Kfir Aberman, Olga Sorkine-Hornung, Daniel Cohen-Or
Multimodal Prompting with Missing Modalities for Visual Recognition
Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, Chen-Yu Lee
Scene-Aware Egocentric 3D Human Pose Estimation
Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, Christian Theobalt
ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-Based Consistency
Zixuan Huang, Varun Jampani, Ngoc Anh Thai, Yuanzhen Li, Stefan Stojanov, James M. Rehg
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Ahmet Iscen, Alireza Fathi, Cordelia Schmid
JacobiNeRF: NeRF Shaping with Mutual Information Gradients
Xiaomeng Xu, Yanchao Yang, Kaichun Mo, Boxiao Pan, Li Yi, Leonidas Guibas
Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos
Ziqian Bai*, Feitong Tan, Zeng Huang, Kripasindhu Sarkar, Danhang Tang, Di Qiu, Abhimitra Meka, Ruofei Du, Mingsong Dou, Sergio Orts-Escolano, Rohit Pandey, Ping Tan, Thabo Beeler, Sean Fanello, Yinda Zhang
NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis
Allan Zhou, Mo Jin Kim, Lirui Wang, Pete Florence, Chelsea Finn
Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval
Kuniaki Saito*, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister
SCADE: NeRFs from Space Carving with Ambiguity-Aware Depth Estimates
Mikaela Uy, Ricardo Martin Brualla, Leonidas Guibas, Ke Li
Structured 3D Features for Reconstructing Controllable Avatars
Enric Corona, Mihai Zanfir, Thiemo Alldieck, Eduard Gabriel Bazavan, Andrei Zanfir, Cristian Sminchisescu
Token Turing Machines
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab
TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization
Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, Luisa Verdoliva
Video Probabilistic Diffusion Models in Projected Latent Space
Sihyun Yu, Kihyuk Sohn, Subin Kim, Jinwoo Shin
Visual Prompt Tuning for Generative Transfer Learning
Kihyuk Sohn, Yuan Hao, Jose Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang
Zero-Shot Referring Image Segmentation with Global-Local Context Features
Seonghoon Yu, Paul Hongsuck Seo, Jeany Son
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR (see blog post)
Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid
DC2: Dual-Camera Defocus Control by Learning to Refocus
Hadi Alzayer, Abdullah Abuolaim, Leung Chun Chan, Yang Yang, Ying Chen Lou, Jia-Bin Huang, Abhishek Kar
Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision
Aditay Tripathi*, Rishubh Singh, Anirban Chakraborty, Pradeep Shenoy
MetaCLUE: Towards Comprehensive Visual Metaphors Research
Arjun R. Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T. Freeman, Yuanzhen Li, Varun Jampani
Multi-Realism Image Compression with a Conditional Generator
Eirikur Agustsson, David Minnen, George Toderici, Fabian Mentzer
NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors
Congyue Deng, Chiyu Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov
On Calibrating Semantic Segmentation Models: Analyses and an Algorithm
Dongdong Wang, Boqing Gong, Liqiang Wang
Persistent Nature: A Generative Model of Unbounded 3D Worlds
Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, Noah Snavely
Rethinking Domain Generalization for Face Anti-spoofing: Separability and Alignment
Yiyou Sun*, Yaojie Liu, Xiaoming Liu, Yixuan Li, Wen-Sheng Chu
SINE: Semantic-Driven Image-Based NeRF Editing with Prior-Guided Editing Field
Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui
Sequential Training of GANs Against GAN-Classifiers Reveals Correlated “Knowledge Gaps” Present Among Independently Trained GAN Instances
Arkanath Pathak, Nicholas Dufour
SparsePose: Sparse-View Camera Pose Regression and Refinement
Samarth Sinha, Jason Zhang, Andrea Tagliasacchi, Igor Gilitschenski, David Lindell
Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models
Yushi Yao, Chang Ye, Gamaleldin F. Elsayed, Junfeng He
Workshops
Computer Vision for Mixed Reality
Speakers include: Ira Kemelmacher-Shlizerman
Workshop on Autonomous Driving (WAD)
Speakers include: Chelsea Finn
Multimodal Content Moderation (MMCM)
Organizers include: Chris Bregler
Speakers include: Mevan Babakar
Medical Computer Vision (MCV)
Speakers include: Shekoofeh Azizi
VAND: Visual Anomaly and Novelty Detection
Speakers include: Yedid Hoshen, Jie Ren
Structural and Compositional Learning on 3D Data
Organizers include: Leonidas Guibas
Speakers include: Andrea Tagliasacchi, Fei Xia, Amir Hertz
Fine-Grained Visual Categorization (FGVC10)
Organizers include: Kimberly Wilber, Sara Beery
Panelists include: Hartwig Adam
XRNeRF: Advances in NeRF for the Metaverse
Organizers include: Jonathan T. Barron
Speakers include: Ben Poole
OmniLabel: Infinite Label Spaces for Semantic Understanding via Natural Language
Organizers include: Golnaz Ghiasi, Long Zhao
Speakers include: Vittorio Ferrari
Large Scale Holistic Video Understanding
Organizers include: David Ross
Speakers include: Cordelia Schmid
New Frontiers for Zero-Shot Image Captioning Evaluation (NICE)
Speakers include: Cordelia Schmid
Computational Cameras and Displays (CCD)
Organizers include: Ulugbek Kamilov
Speakers include: Mauricio Delbracio
Gaze Estimation and Prediction in the Wild (GAZE)
Organizers include: Thabo Beele
Speakers include: Erroll Wood
Face and Gesture Analysis for Health Informatics (FGAHI)
Speakers include: Daniel McDuff
Computer Vision for Animal Behavior Tracking and Modeling (CV4Animals)
Organizers include: Sara Beery
Speakers include: Arsha Nagrani
3D Vision and Robotics
Speakers include: Pete Florence
End-to-End Autonomous Driving: Perception, Prediction, Planning and Simulation (E2EAD)
Organizers include: Anurag Arnab
End-to-End Autonomous Driving: Emerging Tasks and Challenges
Speakers include: Sergey Levine
Multi-Modal Learning and Applications (MULA)
Speakers include: Aleksander Hołyński
Synthetic Data for Autonomous Systems (SDAS)
Speakers include: Lukas Hoyer
Vision Datasets Understanding
Organizers include: José Lezama
Speakers include: Vijay Janapa Reddi
Precognition: Seeing Through the Future
Organizers include: Utsav Prabhu
New Trends in Image Restoration and Enhancement (NTIRE)
Organizers include: Ming-Hsuan Yang
Generative Models for Computer Vision
Speakers include: Ben Mildenhall, Andrea Tagliasacchi
Adversarial Machine Learning on Computer Vision: Art of Robustness
Organizers include: Xinyun Chen
Speakers include: Deqing Sun
Media Forensics
Speakers include: Nicholas Carlini
Tracking and Its Many Guises: Tracking Any Object in Open-World
Organizers include: Paul Voigtlaender
3D Scene Understanding for Vision, Graphics, and Robotics
Speakers include: Andy Zeng
Computer Vision for Physiological Measurement (CVPM)
Organizers include: Daniel McDuff
Affective Behaviour Analysis In-the-Wild
Organizers include: Stefanos Zafeiriou
Ethical Considerations in Creative Applications of Computer Vision (EC3V)
Organizers include: Rida Qadri, Mohammad Havaei, Fernando Diaz, Emily Denton, Sarah Laszlo, Negar Rostamzadeh, Pamela Peter-Agbia, Eva Kozanecka
VizWiz Grand Challenge: Describing Images and Videos Taken by Blind People
Speakers include: Haoran Qi
Efficient Deep Learning for Computer Vision (see blog post)
Organizers include: Andrew Howard, Chas Leichner
Speakers include: Andrew Howard
Visual Copy Detection
Organizers include: Priya Goyal
Learning 3D with Multi-View Supervision (3DMV)
Speakers include: Ben Poole
Image Matching: Local Features and Beyond
Organizers include: Eduard Trulls
Vision for All Seasons: Adverse Weather and Lightning Conditions (V4AS)
Organizers include: Lukas Hoyer
Transformers for Vision (T4V)
Speakers include: Cordelia Schmid, Huiwen Chang
Scholars vs Big Models — How Can Academics Adapt?
Organizers include: Sara Beery
Speakers include: Jonathan T. Barron, Cordelia Schmid
ScanNet Indoor Scene Understanding Challenge
Speakers include: Tom Funkhouser
Computer Vision for Microscopy Image Analysis
Speakers include: Po-Hsuan Cameron Chen
Embedded Vision
Speakers include: Rahul Sukthankar
Sight and Sound
Organizers include: Arsha Nagrani, William Freeman
AI for Content Creation
Organizers include: Deqing Sun, Huiwen Chang, Lu Jiang
Speakers include: Ben Mildenhall, Tim Salimans, Yuanzhen Li
Computer Vision in the Wild
Organizers include: Xiuye Gu, Neil Houlsby
Speakers include: Boqing Gong, Anelia Angelova
Visual Pre-Training for Robotics
Organizers include: Mathilde Caron
Omnidirectional Computer Vision
Organizers include: Yi-Hsuan Tsai
Tutorials
All Things ViTs: Understanding and Interpreting Attention in Vision
Hila Chefer, Sayak Paul
Recent Advances in Anomaly Detection
Guansong Pang, Joey Tianyi Zhou, Radu Tudor Ionescu, Yu Tian, Kihyuk Sohn
Contactless Healthcare Using Cameras and Wireless Sensors
Wenjin Wang, Xuyu Wang, Jun Luo, Daniel McDuff
Object Localization for Free: Going Beyond Self-Supervised Learning
Oriane Simeoni, Weidi Xie, Thomas Kipf, Patrick Pérez
Prompting in Vision
Kaiyang Zhou, Ziwei Liu, Phillip Isola, Hyojin Bahng, Ludwig Schmidt, Sarah Pratt, Denny Zhou
* Work done while at Google
The proliferation of large diffusion models for image generation has led to a significant increase in model size and inference workloads. On-device ML inference in mobile environments requires meticulous performance optimization and consideration of trade-offs due to resource constraints. Running inference of large diffusion models (LDMs) on-device, driven by the need for cost efficiency and user privacy, presents even greater challenges due to the substantial memory requirements and computational demands of these models.
We address this challenge in our work titled “Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations” (to be presented at the CVPR 2023 workshop for Efficient Deep Learning for Computer Vision) focusing on the optimized execution of a foundational LDM model on a mobile GPU. In this blog post, we summarize the core techniques we employed to successfully execute large diffusion models like Stable Diffusion at full resolution (512×512 pixels) and 20 iterations on modern smartphones with high-performing inference speed of the original model without distillation of under 12 seconds. As discussed in our previous blog post, GPU-accelerated ML inference is often limited by memory performance, and execution of LDMs is no exception. Therefore, the central theme of our optimization is efficient memory input/output (I/O) even if it means choosing memory-efficient algorithms over those that prioritize arithmetic logic unit efficiency. Ultimately, our primary objective is to reduce the overall latency of the ML inference.
A sample output of an LDM on Mobile GPU with the prompt text: “a photo realistic and high resolution image of a cute puppy with surrounding flowers”. |
Enhanced attention module for memory efficiency
An ML inference engine typically provides a variety of optimized ML operations. Despite this, achieving optimal performance can still be challenging as there is a certain amount of overhead for executing individual neural net operators on a GPU. To mitigate this overhead, ML inference engines incorporate extensive operator fusion rules that consolidate multiple operators into a single operator, thereby reducing the number of iterations across tensor elements while maximizing compute per iteration. For instance, TensorFlow Lite utilizes operator fusion to combine computationally expensive operations, like convolutions, with subsequent activation functions, like rectified linear units, into one.
A clear opportunity for optimization is the heavily used attention block adopted in the denoiser model in the LDM. The attention blocks allow the model to focus on specific parts of the input by assigning higher weights to important regions. There are multiple ways one can optimize the attention modules, and we selectively employ one of the two optimizations explained below depending on which optimization performs better.
The first optimization, which we call partially fused softmax, removes the need for extensive memory writes and reads between the softmax and the matrix multiplication in the attention module. Let the attention block be just a simple matrix multiplication of the form Y = softmax(X) * W where X and W are 2D matrices of shape a×b and b×c, respectively (shown below in the top half).
For numerical stability, T = softmax(X) is typically calculated in three passes:
- Determine the maximum value in the list, i.e., for each row in matrix X
- Sum up the differences of the exponential of each list item and the maximum value (from pass 1)
- Divide the exponential of the items minus the maximum value by the sum from pass 2
Carrying out these passes naïvely would result in a huge memory write for the temporary intermediate tensor T holding the output of the entire softmax function. We bypass this large memory write if we only store the results of passes 1 and 2, labeled m and s, respectively, which are small vectors, with a elements each, compared to T which has a·b elements. With this technique, we are able to reduce tens or even hundreds of megabytes of memory consumption by multiple orders of magnitude (shown below in the bottom half).
Attention modules. Top: A naïve attention block, composed of a SOFTMAX (with all three passes) and a MATMUL, requires a large memory write for the big intermediate tensor T. Bottom: Our memory-efficient attention block with partially fused softmax in MATMUL only needs to store two small intermediate tensors for m and s. |
The other optimization involves employing FlashAttention, which is an I/O-aware, exact attention algorithm. This algorithm reduces the number of GPU high-bandwidth memory accesses, making it a good fit for our memory bandwidth–limited use case. However, we found this technique to only work for SRAM with certain sizes and to require a large number of registers. Therefore, we only leverage this technique for attention matrices with a certain size on a select set of GPUs.
Winograd fast convolution for 3×3 convolution layers
The backbone of common LDMs heavily relies on 3×3 convolution layers (convolutions with filter size 3×3), comprising over 90% of the layers in the decoder. Despite increased memory consumption and numerical errors, we found that Winograd fast convolution to be effective at speeding up the convolutions. Distinct from the filter size 3×3 used in convolutions, tile size refers to the size of a sub region of the input tensor that is processed at a time. Increasing the tile size enhances the efficiency of the convolution in terms of arithmetic logic unit (ALU) usage. However, this improvement comes at the expense of increased memory consumption. Our tests indicate that a tile size of 4×4 achieves the optimal trade-off between computational efficiency and memory utilization.
Memory usage | |||
Tile size | FLOPS savings | Intermediate tensors | Weights |
2×2 | 2.25× | 4.00× | 1.77× |
4×4 | 4.00× | 2.25× | 4.00× |
6×6 | 5.06× | 1.80× | 7.12× |
8×8 | 5.76× | 1.56× | 11.1× |
Impact of Winograd with varying tile sizes for 3×3 convolutions. |
Specialized operator fusion for memory efficiency
We discovered that performantly inferring LDMs on a mobile GPU requires significantly larger fusion windows for commonly employed layers and units in LDMs than current off-the-shelf on-device GPU-accelerated ML inference engines provide. Consequently, we developed specialized implementations that could execute a larger range of neural operators than typical fusion rules would permit. Specifically, we focused on two specializations: the Gaussian Error Linear Unit (GELU) and the group normalization layer.
An approximation of GELU with the hyperbolic tangent function requires writing to and reading from seven auxiliary intermediate tensors (shown below as light orange rounded rectangles in the figure below), reading from the input tensor x three times, and writing to the output tensor y once across eight GPU programs implementing the labeled operation each (light blue rectangles). A custom GELU implementation that performs the eight operations in a single shader (shown below in the bottom) can bypass all the memory I/O for the intermediate tensors.
GELU implementations. Top: A naïve implementation with built-in operations would require 8 memory writes and 10 reads. Bottom: Our custom GELU only requires 1 memory read (for x) and 1 write (for y). |
Results
After applying all of these optimizations, we conducted tests of Stable Diffusion 1.5 (image resolution 512×512, 20 iterations) on high-end mobile devices. Running Stable Diffusion with our GPU-accelerated ML inference model uses 2,093MB for the weights and 84MB for the intermediate tensors. With latest high-end smartphones, Stable Diffusion can be run in under 12 seconds.
Conclusion
Performing on-device ML inference of large models has proven to be a substantial challenge, encompassing limitations in model file size, extensive runtime memory requirements, and protracted inference latency. By recognizing memory bandwidth usage as the primary bottleneck, we directed our efforts towards optimizing memory bandwidth utilization and striking a delicate balance between ALU efficiency and memory efficiency. As a result, we achieved state-of-the-art inference latency for large diffusion models. You can learn more about this work in the paper.
Acknowledgments
We’d like to thank Yu-Hui Chen, Jiuqiang Tang, Frank Barchard, Yang Zhao, Joe Zou, Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, Lu Wang, and Matthias Grundmann.
NVIDIA will be showcased next week as the winner of the fiercely contested 3D Occupancy Prediction Challenge for autonomous driving development at the Computer Vision and Pattern Recognition Conference (CVPR), in Vancouver, Canada. The competition had more than 400 submissions from nearly 150 teams across 10 regions. 3D occupancy prediction is the process of forecasting Read article >