Categories
Misc

Webinar: AI-Enabled Cybersecurity for Financial Services

A closeup of a laptop with financial projections.Learn how financial firms can build automated, real-time fraud and threat detection solutions with NVIDIA Morpheus.A closeup of a laptop with financial projections.

Learn how financial firms can build automated, real-time fraud and threat detection solutions with NVIDIA Morpheus.

Categories
Misc

NVIDIA CEO: Creators Will Be “Supercharged” by Generative AI

Generative AI will “supercharge” creators across industries and content types, NVIDIA founder and CEO Jensen Huang said today at the Cannes Lions Festival, on the French Riviera. “For the very first time, the creative process can be amplified in content generation, and the content generation could be in any modality — it could be be Read article >

Categories
Misc

Visual Foundation Models for Medical Image Analysis

Image of torso from medical segmentation scan.The analysis of 3D medical images is crucial for advancing clinical responses, disease tracking, and overall patient survival. Deep learning models form the…Image of torso from medical segmentation scan.

The analysis of 3D medical images is crucial for advancing clinical responses, disease tracking, and overall patient survival. Deep learning models form the backbone of modern 3D medical representation learning, enabling precise spatial context measurements that are essential for clinical decision-making. These 3D representations are highly sensitive to the physiological properties of medical imaging data, such as CT or MRI scans.

Medical image segmentation, a key visual task for medical applications, serves as a quantitative tool for measuring various aspects of medical images. To improve the analysis of these images, the development and application of foundation models are becoming increasingly important in the field of medical image analysis.

What are foundation models?

Foundation models, the latest generation of AI neural networks, are trained on extensive, diverse datasets and can be employed for a wide range of tasks or targets.

As large language models demonstrate their capability to tackle generic tasks, visual foundation models are emerging to address various problems, including classification, detection, and segmentation.

Foundation models can be used as powerful AI neural networks for segmenting different targets in medical images. It opens up a world of possibilities for medical imaging applications, enhancing the effectiveness of segmentation tasks and enabling more accurate measurements.

Challenges in medical image analysis

The application of medical foundation models in medical image analysis poses significant challenges. Unlike general computer vision models, medical image applications typically demand high-level domain knowledge.

Institutes have traditionally created fully annotated datasets for specific targets like spleens or tumors, relying solely on the association between input data features and target labels. Addressing multiple targets is more difficult, as manual annotations are laborious and time-consuming. Training larger or multi-task models is also increasingly challenging.

Despite recent advancements, there is still a long-standing issue in comprehending large medical imaging data due to its heterogeneity:

  • Medical volumetric data is often extremely high-resolution, necessitating substantial computational resources.
  • Current deep learning models have yet to effectively capture anatomical variability.
  • The large-scale nature of medical imaging data makes learning robust and efficient 3D representations difficult, particularly when dealing with heterogeneous data.

However, the modern analysis of high-resolution, high-dimensional, and large-scale medical volumetric data presents an opportunity to accelerate discoveries and obtain innovative insights into human body functions, behavior, and disease.

Foundation models offer the capability to address the heterogeneous variations that complicate the rectification of inter– and intra-subject differences. AI has the potential to revolutionize medical imaging by enabling more accurate and efficient analysis of large-scale, complex data.

A platform for medical visual segmentation foundation models

MONAI Model Zoo serves as a platform for hosting medical visual foundation models. It contains a collection of pretrained models for medical imaging tasks developed using the Medical Open Network for AI (MONAI) framework.

The MONAI Model Zoo is a publicly available resource that provides access to a variety of pretrained models for different medical imaging tasks, such as segmentation, classification, registration, and synthesis. These pretrained models can be used as starting points or foundation models for training on new datasets or fine-tuning for specific applications.

The MONAI Model Zoo is designed to accelerate the development of new medical imaging applications and enable researchers and clinicians to leverage pre-existing models and build on top of them.

Whole-body CT segmentation

Segmenting the entirety of a whole-body CT scan from a single model is a daunting task. However, the MONAI team has risen to the challenge. They’ve developed models that segment all 104 anatomical structures from a single model:

  • 27 organs
  • 59 bones
  • 10 muscles
  • 8 vessels

Using the dataset released by the totalSegmentator team, MONAI conducted research and benchmarking to achieve fast inference times. For a high-resolution 1.5 mm model, the inference time using a single NVIDIA V100 GPU for all 104 structures is just 4.12 seconds, while the inference time using a CPU is 30.30 seconds. This is a significant improvement from the original paper’s reported inference time for a single CT scan, which took more than 1 minute.

To access the MONAI Whole Body CT Segmentation foundation model, see the MONAI Model Zoo.

For more information about the overview of all anatomical structures in whole-body CT scans, see the TotalSegmentator: robust segmentation of 104 anatomical structures in CT images whitepaper.

3D Slicer user interface showing segmentations of a torso from multiple angles.
Figure 1. Segmenting 104 anatomical structures in whole-body CT scan

(Source: TotalSegmentator: robust segmentation of 104 anatomical structures in CT images)

Whole-brain MRI segmentation

Whole-brain segmentation is a critical technique in medical image analysis, providing a non-invasive means of measuring brain regions from clinical structural magnetic resonance imaging (MRI). However, with over 130 substructures in human brains, segmenting anything in the brain is a difficult challenge for MRI 3D segmentation. Unfortunately, detailed annotations of the brain are scarce, making this task even more challenging for the medical imaging community.

To address this issue, the MONAI team collaborated with Vanderbilt University to develop a deep learning model that can simultaneously segment all 133 brain structures. Using 3D Slicer, the MONAI model can infer the entire brain in just 2.0 seconds. The MONAI whole brain MRI segmentation model represents a promising development in medical imaging research, offering a valuable resource for improving the accuracy of brain measurements in clinical settings.

Visit the MONAI Model Zoo to access the MONAI Whole Brain MRI Segmentation Foundation Model.

3D Slicer user interface showing segmentation of anatomical structures in a brain from multiple angles.
Figure 2. Segmenting 133 anatomical structures in T1 brain MRI scan

How to access medical imaging foundation models

The use of foundation models in medical image analysis has great potential to improve diagnostic accuracy and enhance patient care. However, it’s important to recognize that medical application requires strong domain knowledge.

With the ability to process large amounts of data and identify subtle patterns and anomalies, foundation models have proven to be valuable tools in the medical image analysis field. The development and refinement of these models is ongoing, with researchers and practitioners working to improve their accuracy and expand their capabilities.

Although challenges such as patient privacy and potential biases must be addressed, the use of foundation models has already demonstrated significant benefits. It is expected to play a more prominent role in healthcare in the future.

As researchers, clinicians, and users continue to focus on foundation models, the MONAI Model Zoo, a platform hosting pretrained medical image models, is amplifying its impact. Fine-tuning pretrained models is crucial to the future of medical image analysis.

The MONAI Model Zoo provides access to a diverse collection of pretrained models for various medical imaging tasks, including segmentation, classification, registration, and synthesis. By using these pre-existing models as starting points, researchers and clinicians can accelerate the development of new medical imaging applications, saving time and resources.

Join us in driving innovation and collaboration in medical imaging research by exploring the MONAI Model Zoo today.

Categories
Offsites

Google at CVPR 2023

This week marks the beginning of the premier annual Computer Vision and Pattern Recognition conference (CVPR 2023), held in-person in Vancouver, BC (with additional virtual content). As a leader in computer vision research and a Platinum Sponsor, Google Research will have a strong presence across CVPR 2023 with 90 papers being presented at the main conference and active involvement in over 40 conference workshops and tutorials.

If you are attending CVPR this year, please stop by our booth to chat with our researchers who are actively exploring the latest techniques for application to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including on-device ML applications with MediaPipe, strategies for differential privacy, neural radiance field technologies and much more.

You can also learn more about our research being presented at CVPR 2023 in the list below (Google affiliations in bold).

Board and organizing committee

Senior area chairs include: Cordelia Schmid, Ming-Hsuan Yang

Area chairs include: Andre Araujo, Anurag Arnab, Rodrigo Benenson, Ayan Chakrabarti, Huiwen Chang, Alireza Fathi, Vittorio Ferrari, Golnaz Ghiasi, Boqing Gong, Yedid Hoshen, Varun Jampani, Lu Jiang, Da-Cheng Jua, Dahun Kim, Stephen Lombardi, Peyman Milanfar, Ben Mildenhall, Arsha Nagrani, Jordi Pont-Tuset, Paul Hongsuck Seo, Fei Sha, Saurabh Singh, Noah Snavely, Kihyuk Sohn, Chen Sun, Pratul P. Srinivasan, Deqing Sun, Andrea Tagliasacchi, Federico Tombari, Jasper Uijlings

Publicity Chair: Boqing Gong

Demonstration Chair: Jonathan T. Barron

Program Advisory Board includes: Cordelia Schmid, Richard Szeliski

Panels

History and Future of Artificial Intelligence and Computer Vision

Panelists include: Chelsea Finn

Scientific Discovery and the Environment

Panelists include: Sara Beery

Best Paper Award candidates

MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures

Zhiqin Chen, Thomas Funkhouser, Peter Hedman, Andrea Tagliasacchi

DynIBaR: Neural Dynamic Image-Based Rendering

Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, Noah Snavely

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Nataniel Ruiz*, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman

On Distillation of Guided Diffusion Models

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans

Highlight papers

Connecting Vision and Language with Video Localized Narratives

Paul Voigtlaender, Soravit Changpinyo, Jordi Pont-Tuset, Radu Soricut, Vittorio Ferrari

MaskSketch: Unpaired Structure-Guided Masked Image Generation

Dina Bashkirova*, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa

SPARF: Neural Radiance Fields from Sparse and Noisy Poses

Prune Truong*, Marie-Julie Rakotosaona, Fabian Manhardt, Federico Tombari

MAGVIT: Masked Generative Video Transformer

Lijun Yu*, Yong Cheng, Kihyuk Sohn, Jose Lezama, Han Zhang, Huiwen Chang, Alexander Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

Dahun Kim, Anelia Angelova, Weicheng Kuo

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Muhammad Ferjad Naeem, Gul Zain Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, Federico Tombari

Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization

Zifan Wang*, Nan Ding, Tomer Levinboim, Xi Chen, Radu Soricut

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting (see blog post)

Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J. Fleet, Radu Soricut, Jason Baldridge, Mohammad Norouzi, Peter Anderson, William Cha

RUST: Latent Neural Scene Representations from Unposed Imagery

Mehdi S. M. Sajjadi, Aravindh Mahendran, Thomas Kipf, Etienne Pot, Daniel Duckworth, Mario Lučić, Klaus Greff

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory (see blog post)

Ziniu Hu*, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David Ross, Alireza Fathi

RobustNeRF: Ignoring Distractors with Robust Losses

Sara Sabour, Suhani Vora, Daniel Duckworth, Ivan Krasin, David J. Fleet, Andrea Tagliasacchi

Papers

AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training

Yifan Jiang*, Peter Hedman, Ben Mildenhall, Dejia Xu, Jonathan T. Barron, Zhangyang Wang, Tianfan Xue*

BlendFields: Few-Shot Example-Driven Facial Modeling

Kacper Kania, Stephan Garbin, Andrea Tagliasacchi, Virginia Estellers, Kwang Moo Yi, Tomasz Trzcinski, Julien Valentin, Marek Kowalski

Enhancing Deformable Local Features by Jointly Learning to Detect and Describe Keypoints

Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson Nascimento

How Can Objects Help Action Recognition?

Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

Hybrid Neural Rendering for Large-Scale Scenes with Motion Blur

Peng Dai, Yinda Zhang, Xin Yu, Xiaoyang Lyu, Xiaojuan Qi

IFSeg: Image-Free Semantic Segmentation via Vision-Language Model

Sukmin Yun, Seong Park, Paul Hongsuck Seo, Jinwoo Shin

Learning from Unique Perspectives: User-Aware Saliency Modeling (see blog post)

Shi Chen*, Nachiappan Valliappan, Shaolei Shen, Xinyu Ye, Kai Kohlhoff, Junfeng He

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Tianhong Li*, Huiwen Chang, Shlok Kumar Mishra, Han Zhang, Dina Katabi, Dilip Krishnan

NeRF-Supervised Deep Stereo

Fabio Tosi, Alessio Tonioni, Daniele Gregorio, Matteo Poggi

Omnimatte3D: Associating Objects and their Effects in Unconstrained Monocular Video

Mohammed Suhail, Erika Lu, Zhengqi Li, Noah Snavely, Leon Sigal, Forrester Cole

OpenScene: 3D Scene Understanding with Open Vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser

PersonNeRF: Personalized Reconstruction from Photo Collections

Chung-Yi Weng, Pratul Srinivasan, Brian Curless, Ira Kemelmacher-Shlizerman

Prefix Conditioning Unifies Language and Label Supervision

Kuniaki Saito*, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning (see blog post)

AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

Burstormer: Burst Image Restoration and Enhancement Transformer

Akshay Dudhane, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang

Decentralized Learning with Multi-Headed Distillation

Andrey Zhmoginov, Mark Sandler, Nolan Miller, Gus Kristiansen, Max Vladymyrov

GINA-3D: Learning to Generate Implicit Neural Assets in the Wild

Bokui Shen, Xinchen Yan, Charles R. Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, Dragomir Anguelov

Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent with Learned Distance Functions

Yun He, Danhang Tang, Yinda Zhang, Xiangyang Xue, Yanwei Fu

Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from Sparse Image Ensemble

Chun-Han Yao*, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani

Hyperbolic Contrastive Learning for Visual Representations beyond Objects

Songwei Ge, Shlok Mishra, Simon Kornblith, Chun-Liang Li, David Jacobs

Imagic: Text-Based Real Image Editing with Diffusion Models

Bahjat Kawar*, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani

Incremental 3D Semantic Scene Graph Prediction from RGB Sequences

Shun-Cheng Wu, Keisuke Tateno, Nassir Navab, Federico Tombari

IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction

Dekai Zhu, Guangyao Zhai, Yan Di, Fabian Manhardt, Hendrik Berkemeyer, Tuan Tran, Nassir Navab, Federico Tombari, Benjamin Busam

Learning to Generate Image Embeddings with User-Level Differential Privacy

Zheng Xu, Maxwell Collins, Yuxiao Wang, Liviu Panait, Sewoong Oh, Sean Augenstein, Ting Liu, Florian Schroff, H. Brendan McMahan

NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs

Harsh Rangwani, Lavish Bansal, Kartik Sharma, Tejan Karmali, Varun Jampani, Venkatesh Babu Radhakrishnan

NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models

Ron Mokady*, Amir Hertz*, Kfir Aberman, Yael Pritch, Daniel Cohen-Or*

SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow

Itai Lang*, Dror Aiger, Forrester Cole, Shai Avidan, Michael Rubinstein

Shape, Pose, and Appearance from a Single Image via Bootstrapped Radiance Field Inversion

Dario Pavllo*, David Joseph Tan, Marie-Julie Rakotosaona, Federico Tombari

TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation

Hanzhi Chen, Fabian Manhardt, Nassir Navab, Benjamin Busam

TryOnDiffusion: A Tale of Two UNets

Luyang Zhu*, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, Ira Kemelmacher-Shlizerman

A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Aishwarya Kamath*, Peter Anderson, Su Wang, Jing Yu Koh*, Alexander Ku, Austin Waters, Yinfei Yang*, Jason Baldridge, Zarana Parekh

CLIPPO: Image-and-Language Understanding from Pixels Only

Michael Tschannen, Basil Mustafa, Neil Houlsby

Controllable Light Diffusion for Portraits

David Futschik, Kelvin Ritland, James Vecore, Sean Fanello, Sergio Orts-Escolano, Brian Curless, Daniel Sýkora, Rohit Pandey

CUF: Continuous Upsampling Filters

Cristina Vasconcelos, Cengiz Oztireli, Mark Matthews, Milad Hashemi, Kevin Swersky, Andrea Tagliasacchi

Improving Zero-Shot Generalization and Robustness of Multi-modal Models

Yunhao Ge*, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao

LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding

Gen Li, Varun Jampani, Deqing Sun, Laura Sevilla-Lara

Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision

Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, Kyle Genova

Self-Supervised AutoFlow

Hsin-Ping Huang, Charles Herrmann, Junhwa Hur, Erika Lu, Kyle Sargent, Austin Stone, Ming-Hsuan Yang, Deqing Sun

Train-Once-for-All Personalization

Hong-You Chen*, Yandong Li, Yin Cui, Mingda Zhang, Wei-Lun Chao, Li Zhang

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning (see blog post)

Antoine Yang*, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Junjie Ke, Keren Ye, Jiahui Yu, Yonghui Wu, Peyman Milanfar, Feng Yang

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li, Caiwen Ding, Yanzhi Wang, Yi Liang, Dongkuan Xu

Accidental Light Probes

Hong-Xing Yu, Samir Agarwala, Charles Herrmann, Richard Szeliski, Noah Snavely, Jiajun Wu, Deqing Sun

FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning

Yuanhao Xiong, Ruochen Wang, Minhao Cheng, Felix Yu, Cho-Jui Hsieh

FlexiViT: One Model for All Patch Sizes

Lucas Beyer, Pavel Izmailov, Alexander Kolesnikov, Mathilde Caron, Simon Kornblith, Xiaohua Zhai, Matthias Minderer, Michael Tschannen, Ibrahim Alabdulmohsin, Filip Pavetic

Iterative Vision-and-Language Navigation

Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson, Stefan Lee, Jesse Thomason

MoDi: Unconditional Motion Synthesis from Diverse Data

Sigal Raab, Inbal Leibovitch, Peizhuo Li, Kfir Aberman, Olga Sorkine-Hornung, Daniel Cohen-Or

Multimodal Prompting with Missing Modalities for Visual Recognition

Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, Chen-Yu Lee

Scene-Aware Egocentric 3D Human Pose Estimation

Jian Wang, Diogo Luvizon, Weipeng Xu, Lingjie Liu, Kripasindhu Sarkar, Christian Theobalt

ShapeClipper: Scalable 3D Shape Learning from Single-View Images via Geometric and CLIP-Based Consistency

Zixuan Huang, Varun Jampani, Ngoc Anh Thai, Yuanzhen Li, Stefan Stojanov, James M. Rehg

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

Ahmet Iscen, Alireza Fathi, Cordelia Schmid

JacobiNeRF: NeRF Shaping with Mutual Information Gradients

Xiaomeng Xu, Yanchao Yang, Kaichun Mo, Boxiao Pan, Li Yi, Leonidas Guibas

Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos

Ziqian Bai*, Feitong Tan, Zeng Huang, Kripasindhu Sarkar, Danhang Tang, Di Qiu, Abhimitra Meka, Ruofei Du, Mingsong Dou, Sergio Orts-Escolano, Rohit Pandey, Ping Tan, Thabo Beeler, Sean Fanello, Yinda Zhang

NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis

Allan Zhou, Mo Jin Kim, Lirui Wang, Pete Florence, Chelsea Finn

Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval

Kuniaki Saito*, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister

SCADE: NeRFs from Space Carving with Ambiguity-Aware Depth Estimates

Mikaela Uy, Ricardo Martin Brualla, Leonidas Guibas, Ke Li

Structured 3D Features for Reconstructing Controllable Avatars

Enric Corona, Mihai Zanfir, Thiemo Alldieck, Eduard Gabriel Bazavan, Andrei Zanfir, Cristian Sminchisescu

Token Turing Machines

Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab

TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization

Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, Luisa Verdoliva

Video Probabilistic Diffusion Models in Projected Latent Space

Sihyun Yu, Kihyuk Sohn, Subin Kim, Jinwoo Shin

Visual Prompt Tuning for Generative Transfer Learning

Kihyuk Sohn, Yuan Hao, Jose Lezama, Luisa Polania, Huiwen Chang, Han Zhang, Irfan Essa, Lu Jiang

Zero-Shot Referring Image Segmentation with Global-Local Context Features

Seonghoon Yu, Paul Hongsuck Seo, Jeany Son

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR (see blog post)

Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

DC2: Dual-Camera Defocus Control by Learning to Refocus

Hadi Alzayer, Abdullah Abuolaim, Leung Chun Chan, Yang Yang, Ying Chen Lou, Jia-Bin Huang, Abhishek Kar

Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision

Aditay Tripathi*, Rishubh Singh, Anirban Chakraborty, Pradeep Shenoy

MetaCLUE: Towards Comprehensive Visual Metaphors Research

Arjun R. Akula, Brendan Driscoll, Pradyumna Narayana, Soravit Changpinyo, Zhiwei Jia, Suyash Damle, Garima Pruthi, Sugato Basu, Leonidas Guibas, William T. Freeman, Yuanzhen Li, Varun Jampani

Multi-Realism Image Compression with a Conditional Generator

Eirikur Agustsson, David Minnen, George Toderici, Fabian Mentzer

NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

Congyue Deng, Chiyu Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov

On Calibrating Semantic Segmentation Models: Analyses and an Algorithm

Dongdong Wang, Boqing Gong, Liqiang Wang

Persistent Nature: A Generative Model of Unbounded 3D Worlds

Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola, Noah Snavely

Rethinking Domain Generalization for Face Anti-spoofing: Separability and Alignment

Yiyou Sun*, Yaojie Liu, Xiaoming Liu, Yixuan Li, Wen-Sheng Chu

SINE: Semantic-Driven Image-Based NeRF Editing with Prior-Guided Editing Field

Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui

Sequential Training of GANs Against GAN-Classifiers Reveals Correlated “Knowledge Gaps” Present Among Independently Trained GAN Instances

Arkanath Pathak, Nicholas Dufour

SparsePose: Sparse-View Camera Pose Regression and Refinement

Samarth Sinha, Jason Zhang, Andrea Tagliasacchi, Igor Gilitschenski, David Lindell

Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models

Yushi Yao, Chang Ye, Gamaleldin F. Elsayed, Junfeng He

Workshops

Computer Vision for Mixed Reality

Speakers include: Ira Kemelmacher-Shlizerman

Workshop on Autonomous Driving (WAD)

Speakers include: Chelsea Finn

Multimodal Content Moderation (MMCM)

Organizers include: Chris Bregler

Speakers include: Mevan Babakar

Medical Computer Vision (MCV)

Speakers include: Shekoofeh Azizi

VAND: Visual Anomaly and Novelty Detection

Speakers include: Yedid Hoshen, Jie Ren

Structural and Compositional Learning on 3D Data

Organizers include: Leonidas Guibas

Speakers include: Andrea Tagliasacchi, Fei Xia, Amir Hertz

Fine-Grained Visual Categorization (FGVC10)

Organizers include: Kimberly Wilber, Sara Beery

Panelists include: Hartwig Adam

XRNeRF: Advances in NeRF for the Metaverse

Organizers include: Jonathan T. Barron

Speakers include: Ben Poole

OmniLabel: Infinite Label Spaces for Semantic Understanding via Natural Language

Organizers include: Golnaz Ghiasi, Long Zhao

Speakers include: Vittorio Ferrari

Large Scale Holistic Video Understanding

Organizers include: David Ross

Speakers include: Cordelia Schmid

New Frontiers for Zero-Shot Image Captioning Evaluation (NICE)

Speakers include: Cordelia Schmid

Computational Cameras and Displays (CCD)

Organizers include: Ulugbek Kamilov

Speakers include: Mauricio Delbracio

Gaze Estimation and Prediction in the Wild (GAZE)

Organizers include: Thabo Beele


Speakers include: Erroll Wood

Face and Gesture Analysis for Health Informatics (FGAHI)

Speakers include: Daniel McDuff

Computer Vision for Animal Behavior Tracking and Modeling (CV4Animals)

Organizers include: Sara Beery

Speakers include: Arsha Nagrani

3D Vision and Robotics

Speakers include: Pete Florence

End-to-End Autonomous Driving: Perception, Prediction, Planning and Simulation (E2EAD)

Organizers include: Anurag Arnab

End-to-End Autonomous Driving: Emerging Tasks and Challenges

Speakers include: Sergey Levine

Multi-Modal Learning and Applications (MULA)

Speakers include: Aleksander Hołyński

Synthetic Data for Autonomous Systems (SDAS)

Speakers include: Lukas Hoyer

Vision Datasets Understanding

Organizers include: José Lezama

Speakers include: Vijay Janapa Reddi

Precognition: Seeing Through the Future

Organizers include: Utsav Prabhu

New Trends in Image Restoration and Enhancement (NTIRE)

Organizers include: Ming-Hsuan Yang

Generative Models for Computer Vision

Speakers include: Ben Mildenhall, Andrea Tagliasacchi

Adversarial Machine Learning on Computer Vision: Art of Robustness

Organizers include: Xinyun Chen

Speakers include: Deqing Sun

Media Forensics

Speakers include: Nicholas Carlini

Tracking and Its Many Guises: Tracking Any Object in Open-World

Organizers include: Paul Voigtlaender

3D Scene Understanding for Vision, Graphics, and Robotics

Speakers include: Andy Zeng

Computer Vision for Physiological Measurement (CVPM)

Organizers include: Daniel McDuff

Affective Behaviour Analysis In-the-Wild

Organizers include: Stefanos Zafeiriou

Ethical Considerations in Creative Applications of Computer Vision (EC3V)

Organizers include: Rida Qadri, Mohammad Havaei, Fernando Diaz, Emily Denton, Sarah Laszlo, Negar Rostamzadeh, Pamela Peter-Agbia, Eva Kozanecka

VizWiz Grand Challenge: Describing Images and Videos Taken by Blind People

Speakers include: Haoran Qi

Efficient Deep Learning for Computer Vision (see blog post)

Organizers include: Andrew Howard, Chas Leichner


Speakers include: Andrew Howard

Visual Copy Detection

Organizers include: Priya Goyal

Learning 3D with Multi-View Supervision (3DMV)

Speakers include: Ben Poole

Image Matching: Local Features and Beyond

Organizers include: Eduard Trulls

Vision for All Seasons: Adverse Weather and Lightning Conditions (V4AS)

Organizers include: Lukas Hoyer

Transformers for Vision (T4V)

Speakers include: Cordelia Schmid, Huiwen Chang

Scholars vs Big Models — How Can Academics Adapt?

Organizers include: Sara Beery

Speakers include: Jonathan T. Barron, Cordelia Schmid

ScanNet Indoor Scene Understanding Challenge

Speakers include: Tom Funkhouser

Computer Vision for Microscopy Image Analysis

Speakers include: Po-Hsuan Cameron Chen

Embedded Vision

Speakers include: Rahul Sukthankar

Sight and Sound

Organizers include: Arsha Nagrani, William Freeman

AI for Content Creation

Organizers include: Deqing Sun, Huiwen Chang, Lu Jiang

Speakers include: Ben Mildenhall, Tim Salimans, Yuanzhen Li

Computer Vision in the Wild

Organizers include: Xiuye Gu, Neil Houlsby

Speakers include: Boqing Gong, Anelia Angelova

Visual Pre-Training for Robotics

Organizers include: Mathilde Caron

Omnidirectional Computer Vision

Organizers include: Yi-Hsuan Tsai

Tutorials

All Things ViTs: Understanding and Interpreting Attention in Vision

Hila Chefer, Sayak Paul

Recent Advances in Anomaly Detection

Guansong Pang, Joey Tianyi Zhou, Radu Tudor Ionescu, Yu Tian, Kihyuk Sohn

Contactless Healthcare Using Cameras and Wireless Sensors

Wenjin Wang, Xuyu Wang, Jun Luo, Daniel McDuff

Object Localization for Free: Going Beyond Self-Supervised Learning

Oriane Simeoni, Weidi Xie, Thomas Kipf, Patrick Pérez

Prompting in Vision

Kaiyang Zhou, Ziwei Liu, Phillip Isola, Hyojin Bahng, Ludwig Schmidt, Sarah Pratt, Denny Zhou


* Work done while at Google

Categories
Offsites

Speed is all you need: On-device acceleration of large diffusion models via GPU-aware optimizations

The proliferation of large diffusion models for image generation has led to a significant increase in model size and inference workloads. On-device ML inference in mobile environments requires meticulous performance optimization and consideration of trade-offs due to resource constraints. Running inference of large diffusion models (LDMs) on-device, driven by the need for cost efficiency and user privacy, presents even greater challenges due to the substantial memory requirements and computational demands of these models.

We address this challenge in our work titled “Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations” (to be presented at the CVPR 2023 workshop for Efficient Deep Learning for Computer Vision) focusing on the optimized execution of a foundational LDM model on a mobile GPU. In this blog post, we summarize the core techniques we employed to successfully execute large diffusion models like Stable Diffusion at full resolution (512×512 pixels) and 20 iterations on modern smartphones with high-performing inference speed of the original model without distillation of under 12 seconds. As discussed in our previous blog post, GPU-accelerated ML inference is often limited by memory performance, and execution of LDMs is no exception. Therefore, the central theme of our optimization is efficient memory input/output (I/O) even if it means choosing memory-efficient algorithms over those that prioritize arithmetic logic unit efficiency. Ultimately, our primary objective is to reduce the overall latency of the ML inference.

A sample output of an LDM on Mobile GPU with the prompt text: “a photo realistic and high resolution image of a cute puppy with surrounding flowers”.

Enhanced attention module for memory efficiency

An ML inference engine typically provides a variety of optimized ML operations. Despite this, achieving optimal performance can still be challenging as there is a certain amount of overhead for executing individual neural net operators on a GPU. To mitigate this overhead, ML inference engines incorporate extensive operator fusion rules that consolidate multiple operators into a single operator, thereby reducing the number of iterations across tensor elements while maximizing compute per iteration. For instance, TensorFlow Lite utilizes operator fusion to combine computationally expensive operations, like convolutions, with subsequent activation functions, like rectified linear units, into one.

A clear opportunity for optimization is the heavily used attention block adopted in the denoiser model in the LDM. The attention blocks allow the model to focus on specific parts of the input by assigning higher weights to important regions. There are multiple ways one can optimize the attention modules, and we selectively employ one of the two optimizations explained below depending on which optimization performs better.

The first optimization, which we call partially fused softmax, removes the need for extensive memory writes and reads between the softmax and the matrix multiplication in the attention module. Let the attention block be just a simple matrix multiplication of the form Y = softmax(X) * W where X and W are 2D matrices of shape a×b and b×c, respectively (shown below in the top half).

For numerical stability, T = softmax(X) is typically calculated in three passes:

  1. Determine the maximum value in the list, i.e., for each row in matrix X
  2. Sum up the differences of the exponential of each list item and the maximum value (from pass 1)
  3. Divide the exponential of the items minus the maximum value by the sum from pass 2

Carrying out these passes naïvely would result in a huge memory write for the temporary intermediate tensor T holding the output of the entire softmax function. We bypass this large memory write if we only store the results of passes 1 and 2, labeled m and s, respectively, which are small vectors, with a elements each, compared to T which has a·b elements. With this technique, we are able to reduce tens or even hundreds of megabytes of memory consumption by multiple orders of magnitude (shown below in the bottom half).

Attention modules. Top: A naïve attention block, composed of a SOFTMAX (with all three passes) and a MATMUL, requires a large memory write for the big intermediate tensor T. Bottom: Our memory-efficient attention block with partially fused softmax in MATMUL only needs to store two small intermediate tensors for m and s.

The other optimization involves employing FlashAttention, which is an I/O-aware, exact attention algorithm. This algorithm reduces the number of GPU high-bandwidth memory accesses, making it a good fit for our memory bandwidth–limited use case. However, we found this technique to only work for SRAM with certain sizes and to require a large number of registers. Therefore, we only leverage this technique for attention matrices with a certain size on a select set of GPUs.

Winograd fast convolution for 3×3 convolution layers

The backbone of common LDMs heavily relies on 3×3 convolution layers (convolutions with filter size 3×3), comprising over 90% of the layers in the decoder. Despite increased memory consumption and numerical errors, we found that Winograd fast convolution to be effective at speeding up the convolutions. Distinct from the filter size 3×3 used in convolutions, tile size refers to the size of a sub region of the input tensor that is processed at a time. Increasing the tile size enhances the efficiency of the convolution in terms of arithmetic logic unit (ALU) usage. However, this improvement comes at the expense of increased memory consumption. Our tests indicate that a tile size of 4×4 achieves the optimal trade-off between computational efficiency and memory utilization.

    Memory usage    
    Tile size         FLOPS savings         Intermediate tensors         Weights    
2×2 2.25× 4.00× 1.77×
4×4 4.00× 2.25× 4.00×
6×6 5.06× 1.80× 7.12×
8×8 5.76× 1.56× 11.1×

Impact of Winograd with varying tile sizes for 3×3 convolutions.

Specialized operator fusion for memory efficiency

We discovered that performantly inferring LDMs on a mobile GPU requires significantly larger fusion windows for commonly employed layers and units in LDMs than current off-the-shelf on-device GPU-accelerated ML inference engines provide. Consequently, we developed specialized implementations that could execute a larger range of neural operators than typical fusion rules would permit. Specifically, we focused on two specializations: the Gaussian Error Linear Unit (GELU) and the group normalization layer.

An approximation of GELU with the hyperbolic tangent function requires writing to and reading from seven auxiliary intermediate tensors (shown below as light orange rounded rectangles in the figure below), reading from the input tensor x three times, and writing to the output tensor y once across eight GPU programs implementing the labeled operation each (light blue rectangles). A custom GELU implementation that performs the eight operations in a single shader (shown below in the bottom) can bypass all the memory I/O for the intermediate tensors.

GELU implementations. Top: A naïve implementation with built-in operations would require 8 memory writes and 10 reads. Bottom: Our custom GELU only requires 1 memory read (for x) and 1 write (for y).

Results

After applying all of these optimizations, we conducted tests of Stable Diffusion 1.5 (image resolution 512×512, 20 iterations) on high-end mobile devices. Running Stable Diffusion with our GPU-accelerated ML inference model uses 2,093MB for the weights and 84MB for the intermediate tensors. With latest high-end smartphones, Stable Diffusion can be run in under 12 seconds.

Stable Diffusion runs on modern smartphones in under 12 seconds. Note that running the decoder after each iteration for displaying the intermediate output in this animated GIF results in a ~2× slowdown.

Conclusion

Performing on-device ML inference of large models has proven to be a substantial challenge, encompassing limitations in model file size, extensive runtime memory requirements, and protracted inference latency. By recognizing memory bandwidth usage as the primary bottleneck, we directed our efforts towards optimizing memory bandwidth utilization and striking a delicate balance between ALU efficiency and memory efficiency. As a result, we achieved state-of-the-art inference latency for large diffusion models. You can learn more about this work in the paper.

Acknowledgments

We’d like to thank Yu-Hui Chen, Jiuqiang Tang, Frank Barchard, Yang Zhao, Joe Zou, Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, Lu Wang, and Matthias Grundmann.

Categories
Misc

NVIDIA Research Wins Autonomous Driving Challenge, Innovation Award at CVPR

NVIDIA will be showcased next week as the winner of the fiercely contested 3D Occupancy Prediction Challenge for autonomous driving development at the Computer Vision and Pattern Recognition Conference (CVPR), in Vancouver, Canada. The competition had more than 400 submissions from nearly 150 teams across 10 regions. 3D occupancy prediction is the process of forecasting Read article >

Categories
Misc

test test scheduled post

test test scheduled post

Categories
Misc

Do Pass Go, Do Collect More Games: Xbox Game Pass Coming to GeForce NOW

Xbox Game Pass support is coming to GeForce NOW. Members will soon be able to play supported PC games from the Xbox Game Pass catalog through NVIDIA’s cloud gaming servers. Learn more about how support for Game Pass and Microsoft Store will roll out in the coming months. Plus, Age of Empires IV: Anniversary Edition Read article >

Categories
Misc

NVIDIA AI Red Team: An Introduction

Two men working at a desktop computer in an office.Machine learning has the promise to improve our world, and in many ways it already has. However, research and lived experiences continue to show this technology…Two men working at a desktop computer in an office.

Machine learning has the promise to improve our world, and in many ways it already has. However, research and lived experiences continue to show this technology has risks. Capabilities that used to be restricted to science fiction and academia are increasingly available to the public. The responsible use and development of AI requires categorizing, assessing, and mitigating enumerated risks where practical. This is true from a pure AI standpoint but also from a standard information security perspective.

Until standards are in place and mature testing has taken hold, organizations are using red teams to explore and enumerate the immediate risks presented by AI. This post introduces the NVIDIA AI red team philosophy and the general framing of ML systems.

Assessment foundations

Our AI red team is a cross-functional team made up of offensive security professionals and data scientists. We use our combined skills to assess our ML systems to identify and help mitigate any risks from the perspective of information security.

Information security has a lot of useful paradigms, tools, and network access that enable us to accelerate responsible use in all areas. This framework is our foundation and directs assessment efforts toward a standard within the organization. We use it to guide assessments (Figure 1) toward the following goals:

  • The risks that our organization cares about and wants to eliminate are addressed.
  • Required assessment activities and the various tactics, techniques, and procedures (TTPs) are clearly defined. TTPs can be added without changing existing structures.
  • The systems and technologies in scope for our assessments are clearly defined. This helps us remain focused on ML systems and not stray into other areas.
  • All efforts live within a single framework that stakeholders can reference and immediately get a broad overview of what ML security looks like.

This helps us set expectations for what an assessment looks like, what systems we could potentially be affecting, and the risks that we address. This framework is not specific to red teaming, but some of these properties are the basis for a functional ML security program, of which red teaming is a small part.

Diagram shows threat model assessment including GRC, methodology, ML development phases, the tech stack, and the intra- and Internet networks.
Figure 1. AI red team assessment framework

The specific technologies—and which features go where—is not necessarily important. The important part is that there is a place for everything to go, whether you’re red teaming, vulnerability scanning, or doing any sort of assessment of an ML system.

This framework enables us to address specific issues in specific parts of the ML pipeline, infrastructure, or technologies. It becomes a place to communicate risk about issues to affected systems: up and down the stack and informing policy and technology.

Any given subsection can be isolated, expanded, and described within the context of the whole system. Here are some examples:

  • Evasion is expanded to include specific algorithms or TTPs that would be relevant for an assessment of a particular model type. Red teams can point to exactly the infrastructure components that are affected.
  • Technical vulnerabilities can affect any level of infrastructure or just a specific application. They can be dealt with in the context of their function and risk-rated accordingly.
  • Harm-and-abuse scenarios that are foreign to many information security practitioners are not only included but are integrated. In this way, we motivate technical teams to consider harm-and-abuse scenarios as they’re assessing ML systems. Or, they can provide ethics teams access to tools and expertise.
  • Requirements handed down can be integrated more quickly, both old and new.

There are many benefits to a framework like this. Consider how a disclosure process can benefit from this composed view. The core building blocks are governance, risk, and compliance (GRC) and ML development.

Governance, risk, and compliance

As in many organizations, GRC is the top level of information security efforts, ensuring that business security requirements are enumerated, communicated, and implemented. As an AI red team under the banner of information security, here are the high-level risks we’re interested in surfacing:

  • Technical risk: ML systems or processes are compromised as the result of a technical vulnerability or shortcoming.
  • Reputational risk: Model performance or behavior reflects poorly on the organization. In this new paradigm, this could include releasing a model that has a broad societal impact.
  • Compliance risk: The ML system is out of compliance, leading to fines or reduced market competitiveness, much like PCI or GDPR.

These high-level risk categories are present in all information systems, including ML systems. Think of these categories like individually colored lenses on a light. Using each colored lens provides a different perspective of risks with respect to the underlying system, and sometimes the risks can be additive. For example, a technical vulnerability that leads to a breach can cause reputational damage. Depending on where the breach occurred, compliance could also require breach notification, fines, and so on.

Even if ML didn’t come with its own vulnerabilities, it is still developed, stored, and deployed on an infrastructure that is subject to standards set by GRC efforts. All assets within an organization are subject to being compliant with GRC standards. And if they aren’t, it’s ideally only because management filed and approved an exception.

ML development

The bottom of the stack is the ML development lifecycle, as it is the activity that GRC wants insight into. We generally consider an ML system as any system that involves ML, inclusive of the processes and systems by which models are built. Components of an ML system might include a web server that hosts a model for inference, a data lake holding training data, or a system that uses the model output to make a decision.

Development pipelines span multiple and sometimes incongruent systems. Each phase of the lifecycle is both unique in function and dependent on the prior phase. Because of this, ML systems tend to be tightly integrated, and the compromise of any one part of the pipeline likely affects other upstream or downstream development phases.

There are more detailed MLOps pipelines, but the canonical example is sufficient to successfully group the supporting tools and services with their lifecycle phase (Table 1).

Phase Description Model state
Ideation Discussions, meetings, and intention toward requirements. Pre-development
Data collection Models require data to be trained. Data is usually collected from both public and private sources with a specific model in mind. This is an ongoing process and data continues to be collected from these sources. Train
Data processing The collected data is processed in any number of ways before being introduced to an algorithm for both training and inference. Train
Model training The processed data is then ingested by an algorithm and a model is trained. Train
Model evaluation After a model is trained, it is validated to ensure accuracy, robustness, interpretability, or any number of other metrics. Train
Model deployment The trained model is embedded in a system for use in production. Machine learning is deployed in a wide variety of ways: inside autonomous vehicles, on a web API, or in client-side applications. Inference
System monitoring After the model has been deployed, the system is monitored. This includes aspects of the system that may not relate to the ML model directly. Inference
End-of-life Data shifts, business requirement changes, and innovations require that systems are discontinued properly. Post-development
Table 1. MLOps pipeline with supporting red team tools and services

This high-level structure enables risks to be put into the context of the whole ML system and provides some natural security boundaries to work with. For example, implementing privilege tiering between phases potentially prevents an incident from spanning an entire pipeline or multiple pipelines. Compromised or not, the purpose of the pipeline is to deploy models for use.

Methodology and use cases

This methodology attempts to cover all primary concerns related to ML systems. In our framework, any given phase can be handed to an appropriately skilled team:

  • Existing offensive security teams are likely equipped to perform reconnaissance and explore technical vulnerabilities.
  • Responsible AI teams are equipped to address harm-and-abuse scenarios.
  • ML researchers are equipped to handle model vulnerabilities.

Our AI red team prefers to aggregate those skill sets on the same or adjacent teams. The increased learning and effectiveness are undeniable: Traditional red team members are part of academic papers and data scientists are given CVEs.

Assessment phase Description  
Reconnaissance This phase describes classic reconnaissance techniques found in MITRE ATT&CK or MITRE ATLAS  
Technical vulnerabilities All the traditional vulnerabilities you know and love.  
Model vulnerabilities These vulnerabilities typically come out of research spaces and cover the following: extraction, evasion, inversion, membership inference, and poisoning.  
Harm and abuse Models are often trained and distributed such that they can be abused for malicious or other harmful tasks. Models can also be biased intentionally or unintentionally. Or, they don’t accurately reflect the environments in which they’re deployed.
Table 2. Methodology assessment phases

Regardless of which team performs which assessment activity, it all remains within the same framework and feeds into the larger assessment effort. Here are some specific use cases:

  • Address new prompt-injection techniques
  • Examine and define security boundaries
  • Use privilege tiering
  • Conduct tabletop exercises

Address new prompt-injection techniques

In this scenario, outputs from large language models (LLMs) are clumsily put into Python exec or eval statements. Already, you can see how a composed view helps address multiple aspects, as input validation is a layer of defense against prompt injection.

Threat model diagram used earlier, but only showing the Methodology columns.
Figure 2. New prompt-injection vulnerability could span both model and technical columns

Examine and define security boundaries

Compartmentalizing each phase with security controls reduces attack surfaces and increases visibility into ML systems. An example control might be that pickles (yes, that torch file has pickles) are blocked outside of development environments, and production models must be converted to something less prone to code execution, like ONNX. This enables R&D to continue using pickles during development but prevents them from being used in sensitive environments.

While not using pickles at all would be ideal, security is often about compromises. Organizations should seek to add mitigating controls where the complete avoidance of issues is not practical.

Threat model diagram used earlier but with red lines between the model phases showing security points.
Figure 3. Security between development phases

Use privilege tiering

Inside a development flow, it’s important to understand the tools and their properties at each stage of the lifecycle. For example, MLFlow has no authentication by default. Starting an MLFlow server knowingly or unknowingly opens that host for exploitation through deserialization.

In another example, Jupyter servers are often started with arguments that remove authentication and TensorBoard has no authentication. This isn’t to say that TensorBoard should have authentication. Teams should just be aware of this fact and ensure that the appropriate network security rules are in place.

Consider the scope of all technologies within the development pipeline. This includes easy things like two-factor authentication on ML services like HuggingFace.

Threat-modeling diagram used earlier but with red lines separating the tech stack from the corporate network and the network from the Internet.
Figure 4. Privilege tiering between infrastructure layers

Conduct tabletop exercises

Consider how you might empty the ML development process and only consider your technologies, where they live, and the TTPs that would apply. Work your way up and down the stack. Here are some quick scenarios to think through:

  • A Flask server was deployed with debug privileges enabled and exposed to the Internet. It was hosting a model that provided inference for HIPAA-protected data.
  • PII was downloaded as part of a dataset and several models have been trained on it. Now, a customer is asking about it.
  • A public bucket with several ML artifacts, including production models, was left open to the public. It has been improperly accessed and files have been changed.
  • Someone can consistently bypass content filters despite the models being accurate and up-to-date.
  • A model isn’t performing as well as it should in certain geographic regions.
  • Someone is scanning the internal network from an inference server used to host an LLM.
  • System monitoring services have detected someone sending a well-known dataset against the inference service.

These are maybe a little contrived but spend some time putting various technologies in the right buckets and then working your way up through the methodology.

  • Does it sound like a technical vulnerability caused the issue?
  • Does this affect any ML processes?
  • Who is responsible for the models? Would they know if changes have been made?

These are all questions that must be answered. Some of these scenarios immediately seem like they fit into one section of the methodology. However, on closer inspection, you’ll find most of them span multiple areas.

Framework diagram with blank spaces for custom use: Tech stack, Corp Network, Internet.
Figure 5. Threat-model worksheet

Conclusion

This framework already provides you with several familiar paradigms that your organizations can start to strategize around. With a principled methodology, you can create foundations from which to build continuous security improvement, reaching toward standards and maturity from product design to production deployment. We invite you to adopt our methodology and adapt it to your own purposes.

Our methodology does not prescribe behaviors or processes. Instead, it aims to organize them. Your organization may already have mature processes to discover, manage, and mitigate risks associated with traditional applications. We hope that this framework and methodology similarly prepare you to identify and mitigate new risks from the ML components deployed in your organization.

If you have any questions, please comment below or contact threatops@nvidia.com.

Categories
Misc

Generative AI Research Empowers Creators with Guided Image Structure Control

A GIF showing a white horse transformed into a pink horse off the beach, a bronzed horse, and a robot horse.New research is boosting the creative potential of generative AI with a text-guided image-editing tool. The innovative study presents a framework using…A GIF showing a white horse transformed into a pink horse off the beach, a bronzed horse, and a robot horse.

New research is boosting the creative potential of generative AI with a text-guided image-editing tool. The innovative study presents a framework using plug-and-play diffusion features (PnP DFs) that guides realistic and precise image generation. With this work, visual content creators can transform images into visuals with just a single prompt image and a few descriptive words.

The ability to edit and generate content reliably and with ease has the potential to expand creative possibilities for artists, designers, and creators. It could also strengthen industries reliant on animation, visual design, and image editing.

“Recent text-to-image generative models mark a new era in digital content creation. However, the main challenge in applying them to real-world applications is the lack of user-controllability, which has been largely restricted to guiding the generation solely through input text. Our work is one of the first methods to provide users with control over the image layout,” said Narek Tumanyan, a lead author and Ph.D. candidate at the Weizmann Institute of Science.

Recent breakthroughs in generative AI are unlocking new approaches for developing powerful text-to-image models. However, complexities, ambiguity, and the need for custom content limit current rendering techniques.

The study introduces a novel approach using PnP DFs that improves the image editing and generation process, giving creators greater control over their final product.

The researchers start with a simple question: How is the shape, or the outline of an image, represented and captured by diffusion models? The study explores the internal representations of images as they evolve over the generation process and examines how these representations encode shape and semantic information.

The new method controls the generated layout without training a new diffusion model or tuning it, but rather by understanding how spatial information is encoded in a pretrained text-to-image model. During the generation process, the model extracts diffusion features from an introduced guidance image and injects them into each step of the generation process resulting in fine-grained control over the structure of the new image.

By incorporating these spatial features, the diffusion model refines the new image to match the guidance structure. It does this iteratively, updating image features until it lands on a final image that preserves the guide image layout while also matching the text prompt.

“This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the translated image, requiring no training or fine-tuning,” the authors write. 

This method paves the way for more advanced controlled generation and manipulation methods.

Video 1. An overview of the study “Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation” being presented at the 2023 Conference on Computer Vision and Pattern Recognition (CVPR)

The researchers developed and tested the PNP model with the cuDNN-accelerated PyTorch framework on a single NVIDIA A100 GPU. According to the team, the large capacity of the GPU made it possible for them to focus on method development. The researchers were awarded an A100 as recipients of the NVIDIA Applied Research Accelerator Program

Deployed on the A100, the framework transforms a new image from the guidance image and text in about 50 seconds.

The process is not only effective but also reliable, producing stunning imagery accurately. It also works beyond images, translating sketches, drawings, and animations, and can modify lighting, color, and backgrounds.

Four images showing the transformation of an origami hummingbird into a sculpture, then a realistic image, then into sketch of a parrot.
Figure 1. Sample results of the method preserving the structure of the guidance origami image while matching the description of the target prompt (credit: Tumanyan, Narek, et al./CVPR 2023)

Their method also outperforms existing text-to-image models, achieving a superior balance between preserving the guidance layout and deviating from its appearance. 

A grid of images showing the source image and seven comparison outputs across different models. The PnP DFs outperform the others.
Figure 2. Sample results comparing the model with P2P, DiffuseIT, SDedit with three different noising levels, and VQ+CLIP models (credit: Tumanyan, Narek, et al./CVPR 2023)

However, the model does have limitations. It does not perform well when editing image sections with arbitrary colors, as the model cannot extract semantic information from the input image.

The researchers are currently working on extending this approach to text-guided video editing. The work is also proving valuable to other research harnessing the powers of analyzing the internal representations of images in diffusion models.

For instance, one study is employing the team’s research insights to improve computer vision tasks, such as semantic point correspondence. Another focuses on expanding text-to-image generation controls, including the shape, placement, and appearance of objects.

The research team, from the Weizmann Institute of Science, is presenting their study at the CVPR 2023. The work is also open source on GitHub.


Learn more on the team’s project page
Read the study Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation.
See NVIDIA research delivering AI breakthroughs at CVPR 2023.