Categories
Offsites

The Technology Behind Cinematic Photos

Looking at photos from the past can help people relive some of their most treasured moments. Last December we launched Cinematic photos, a new feature in Google Photos that aims to recapture the sense of immersion felt the moment a photo was taken, simulating camera motion and parallax by inferring 3D representations in an image. In this post, we take a look at the technology behind this process, and demonstrate how Cinematic photos can turn a single 2D photo from the past into a more immersive 3D animation.

Camera 3D model courtesy of Rick Reitano.

Depth Estimation
Like many recent computational photography features such as Portrait Mode and Augmented Reality (AR), Cinematic photos requires a depth map to provide information about the 3D structure of a scene. Typical techniques for computing depth on a smartphone rely on multi-view stereo, a geometry method to solve for the depth of objects in a scene by simultaneously capturing multiple photos at different viewpoints, where the distances between the cameras is known. In the Pixel phones, the views come from two cameras or dual-pixel sensors.

To enable Cinematic photos on existing pictures that were not taken in multi-view stereo, we trained a convolutional neural network with encoder-decoder architecture to predict a depth map from just a single RGB image. Using only one view, the model learned to estimate depth using monocular cues, such as the relative sizes of objects, linear perspective, defocus blur, etc.

Because monocular depth estimation datasets are typically designed for domains such as AR, robotics, and self-driving, they tend to emphasize street scenes or indoor room scenes instead of features more common in casual photography, like people, pets, and objects, which have different composition and framing. So, we created our own dataset for training the monocular depth model using photos captured on a custom 5-camera rig as well as another dataset of Portrait photos captured on Pixel 4. Both datasets included ground-truth depth from multi-view stereo that is critical for training a model.

Mixing several datasets in this way exposes the model to a larger variety of scenes and camera hardware, improving its predictions on photos in the wild. However, it also introduces new challenges, because the ground-truth depth from different datasets may differ from each other by an unknown scaling factor and shift. Fortunately, the Cinematic photo effect only needs the relative depths of objects in the scene, not the absolute depths. Thus we can combine datasets by using a scale-and-shift-invariant loss during training and then normalize the output of the model at inference.

The Cinematic photo effect is particularly sensitive to the depth map’s accuracy at person boundaries. An error in the depth map can result in jarring artifacts in the final rendered effect. To mitigate this, we apply median filtering to improve the edges, and also infer segmentation masks of any people in the photo using a DeepLab segmentation model trained on the Open Images dataset. The masks are used to pull forward pixels of the depth map that were incorrectly predicted to be in the background.

Camera Trajectory
There can be many degrees of freedom when animating a camera in a 3D scene, and our virtual camera setup is inspired by professional video camera rigs to create cinematic motion. Part of this is identifying the optimal pivot point for the virtual camera’s rotation in order to yield the best results by drawing one’s eye to the subject.

The first step in 3D scene reconstruction is to create a mesh by extruding the RGB image onto the depth map. By doing so, neighboring points in the mesh can have large depth differences. While this is not noticeable from the “face-on” view, the more the virtual camera is moved, the more likely it is to see polygons spanning large changes in depth. In the rendered output video, this will look like the input texture is stretched. The biggest challenge when animating the virtual camera is to find a trajectory that introduces parallax while minimizing these “stretchy” artifacts.

The parts of the mesh with large depth differences become more visible (red visualization) once the camera is away from the “face-on” view. In these areas, the photo appears to be stretched, which we call “stretchy artifacts”.

Because of the wide spectrum in user photos and their corresponding 3D reconstructions, it is not possible to share one trajectory across all animations. Instead, we define a loss function that captures how much of the stretchiness can be seen in the final animation, which allows us to optimize the camera parameters for each unique photo. Rather than counting the total number of pixels identified as artifacts, the loss function triggers more heavily in areas with a greater number of connected artifact pixels, which reflects a viewer’s tendency to more easily notice artifacts in these connected areas.

We utilize padded segmentation masks from a human pose network to divide the image into three different regions: head, body and background. The loss function is normalized inside each region before computing the final loss as a weighted sum of the normalized losses. Ideally the generated output video is free from artifacts but in practice, this is rare. Weighting the regions differently biases the optimization process to pick trajectories that prefer artifacts in the background regions, rather than those artifacts near the image subject.

During the camera trajectory optimization, the goal is to select a path for the camera with the least amount of noticeable artifacts. In these preview images, artifacts in the output are colored red while the green and blue overlay visualizes the different body regions.

Framing the Scene
Generally, the reprojected 3D scene does not neatly fit into a rectangle with portrait orientation, so it was also necessary to frame the output with the correct right aspect ratio while still retaining the key parts of the input image. To accomplish this, we use a deep neural network that predicts per-pixel saliency of the full image. When framing the virtual camera in 3D, the model identifies and captures as many salient regions as possible while ensuring that the rendered mesh fully occupies every output video frame. This sometimes requires the model to shrink the camera’s field of view.

Heatmap of the predicted per-pixel saliency. We want the creation to include as much of the salient regions as possible when framing the virtual camera.

Conclusion
Through Cinematic photos, we implemented a system of algorithms – with each ML model evaluated for fairness – that work together to allow users to relive their memories in a new way, and we are excited about future research and feature improvements. Now that you know how they are created, keep an eye open for automatically created Cinematic photos that may appear in your recent memories within the Google Photos app!

Acknowledgments
Cinematic Photos is the result of a collaboration between Google Research and Google Photos teams. Key contributors also include: Andre Le, Brian Curless, Cassidy Curtis, Ce Liu‎, Chun-po Wang, Daniel Jenstad, David Salesin, Dominik Kaeser, Gina Reynolds, Hao Xu, Huiwen Chang, Huizhong Chen‎, Jamie Aspinall, Janne Kontkanen, Matthew DuVall, Michael Kucera, Michael Milne, Mike Krainin, Mike Liu, Navin Sarma, Orly Liba, Peter Hedman, Rocky Cai‎, Ruirui Jiang‎, Steven Hickson, Tracy Gu, Tyler Zhu, Varun Jampani, Yuan Hao, Zhongli Ding.

Categories
Misc

Meet the Researcher: Lorenzo Baraldi, Artificial Intelligence for Vision, Language and Embodied AI

This month, we spotlight Lorenzo Baraldi, Assistant Professor at the University of Modena and Reggio Emilia in Italy.

‘Meet the Researcher’ is a monthly series in which we spotlight different researchers in academia who are using NVIDIA technologies to accelerate their work. This month, we spotlight Lorenzo Baraldi, Assistant Professor at the University of Modena and Reggio Emilia in Italy.

Before working as a professor, Baraldi was a research intern at Facebook AI Research. He serves as an Associate Editor of the Pattern Recognition Letters journal and works at the integration of Vision, Language, and Embodied AI.

What are your research areas of focus?

I work within the AimageLab research group on Computer Vision and Deep Learning. I focus mainly on the integration of vision, language, and action. The final goal of our research is to develop agents that can perceive and act in our world while being capable of communicating with humans.

What motivated you to pursue this research area of focus?

Combining the ability to perceive the visual world around us, with that of acting and that of expressing in natural language is something that humans do quite naturally and is one of the keys to human intelligence. In the last few years, we have witnessed tremendous achievements in areas that consider only one of those abilities: Computer Vision, Natural Language Processing, and Robotics. How to combine these abilities, instead, still needs to be understood and is a thrilling field of research.

Tell us about your current research projects.

We are mainly working in three directions: 1 – we integrate vision and language, for example by developing algorithms that can describe images in natural language. A recent paper of this work was presented at CVPR Transformer-based model for image captioning; 2 – we integrate vision and action, by developing agents for autonomous navigation. We are interested in agents moving in indoor and outdoor scenarios, and possibly interacting with people, also in crowded situations; 3 – we integrate all of this with the ability to understand language, for instance by training agents that can move following an instruction or curiosity-driven agents that can describe what they see along their path.

Overview of Baraldi’s image captioning approach. Building on a Transformer-like encoder-decoder architecture, the approach includes a memory-aware region encoder that augments self-attention with memory vectors.

What problems or challenges does your research address?

I think one of the main challenges we need to solve is to find the right way of integrating multi-modal information, which can come from either visual, textual, or motorial perception. In other words, we need to find the right architecture for dealing with this information, that is why a lot of our research involves the design of new architectures. Secondly, most of the approaches we design are generative and sequential: we generate sentences, we generate actions or paths for robots, and so on. Again, how to generate sequences conditioned on multi-modal information is still a challenge.

Sentences generated on the ACVR Robotic Vision Challenge dataset.

What is the (expected) impact of your work on the field/community/world?

If the research efforts that the community is devoting to this area will be successful, we will have algorithms that can understand us and help us in our daily lives, seeing with us and acting in the world to help us. I think in the long run this might also change the way we interact with computers, which might become a lot easier and language-based.

How have you used NVIDIA technology either in your current or previous research?

Performing large-scale training on NVIDIA GPUs is one of the most important ingredients which power our research, and I am sure this will become even more important in the next future. We do that locally, with a distributed GPU cluster in our lab, and we do that at a bigger scale in conjunction with CINECA, the Italian supercomputing center, and with the NVIDIA AI Technical Centre (NVAITC) of Modena. The partnership we have with NVAITC and CINECA has not only increased our computational capacity, but has also provided us with the knowledge and support we needed to exploit the technologies NVIDIA provides, at their maximum. I would say this collaboration is really having an important impact on our research capabilities.

Did you achieve any breakthroughs in that research or any interesting results using NVIDIA technology?

Most, if not all, of the research works we carry out, are somehow powered by NVIDIA technologies. Apart from the results on the integration of vision, language, and action, we also have a few other research lines of which I am particularly proud. One is related to video understanding: detecting people and objects, understanding their relationships, and finding the best way of extracting Spatio-temporal features is an important challenge. Sometimes we also like to apply our research to the cultural heritage: using NVIDIA GPUs we have developed algorithms for retrieving paintings in natural language, and generative networks for translating artworks to reality

What is next for your research?

Even though things are evolving rapidly in our area, there are still a lot of key issues that need to be addressed, and that is what our lab concentrating on. One is that going beyond the limitations of traditional supervised learning and fighting dataset bias: in the end, we would like our algorithms to describe and understand any connection between images and text, not just those that are annotated in current datasets. To this end, are working towards algorithms that can describe objects which are not present in the training dataset, and we constantly explore the new possibilities given by self-supervised and weakly-supervised learning. How to properly manage the temporal dimension is also another key issue that has been central in our research, and which has brought advancements in terms of new architectural design, not only for managing sequences of words, but also for understanding video streams.

Any advice for new researchers?

There are at least three capabilities I would recommend pursuing. One is to learn to code well and elegantly because translating ideas to reality is always going to involve implementation. The second is to learn to have good ideas: that is potentially the trickiest part, but it is even more important because every valuable research needs to start from a good idea. I think reading papers, especially from the past, and think openly, freely and on a large-scale is of great help in this sense. The third is time management: always focus on what is impactful.

Baraldi’s colleague, Matteo Tomei, will be presenting their lab’s recent work at NVIDIA GTC in April, “More Efficient and Accurate Video Networks: A New Approach to Maximize the Accuracy/Computation Trade-off”.

Categories
Misc

I saw some tesla k80 graphics acceleration cards they have no display port there for helping workloads are these any good for tensorflow AI building

Price:140$ Specs

Cuda cores: 4,992

Core speed: 562-875 per card

RAM 24GB

RAM speed: 480GB/s

This is a 2 pci slot card basically 2 cards in 1 No cooling included (I got a plan for that)

Display card will be my old gtx 950

submitted by /u/isaiahii10
[visit reddit] [comments]

Categories
Misc

I’ve been working on an real time object detection project and I’ve been face with an error while trying to capture image to label and train.. please help

I've been working on an real time object detection project and I've been face with an error while trying to capture image to label and train.. please help
submitted by /u/Field_Great
[visit reddit] [comments]
Categories
Misc

I’ve been working on an real time object detection project and I’ve been face with an error while trying to capture image to label and train.. please help

I've been working on an real time object detection project and I've been face with an error while trying to capture image to label and train.. please help
submitted by /u/Field_Great
[visit reddit] [comments]
Categories
Misc

NVIDIA Clara Parabricks Pipelines v3.5 Accelerates Google’s DeepVariant v1.0

NVIDIA released NVIDIA Clara Parabricks Pipelines version 3.5, adding a set of new features to the software suite that accelerates end-to-end genome sequencing analysis.

NVIDIA recently released NVIDIA Clara Parabricks Pipelines version 3.5, adding a set of new features to the software suite that accelerates end-to-end genome sequencing analysis.

With the release of v3.5, Clara Parabricks Pipelines now provides acceleration to Google’s DeepVariant 1.0, in addition to a suite of existing DNA and RNA tools. The addition of DeepVariant to Parabricks Pipelines brings highly-accurate variant calling for both short- and long-read sequencing data to the community. 

This new release also enables graphical reports of QC metrics from binary alignment map (BAM) files to variant call files (VCF). Researchers can use these graphical reports to better assess the quality of their sequencing data and the subsequent variant calling before moving the results for additional downstream analysis. 

Parabricks Pipelines is packaged with enterprise support for A100 and other NVIDIA GPUs, offering one of the industry’s fastest compute frameworks for whole genome and whole exome applications. For a whole genome at 30x coverage, a server with 32 virtual CPUs takes about 1,200 minutes to generate a variant call file (VCF), while a server with eight A100 Tensor Core GPUs running Clara Parabricks takes less than 25 minutes to go from FASTQ to VCF.

Start a free one month trial of NVIDIA Clara Parabricks Pipelines today and learn how to get set up in just 10 minutes with this step-by-step instructional video.

Categories
Misc

Webinar: Create Gesture-Based Interactions with a Robot

Learn how to train your own gesture recognition deep learning pipeline. We’ll start with a pre-trained detection model, repurpose it for hand detection, and use it together with the purpose-built gesture recognition model.

In this webinar, you will learn how to train your own gesture recognition deep learning pipeline. We’ll start with a pre-trained detection model, repurpose it for hand detection, and use it together with the purpose-built gesture recognition model.

NVIDIA pre-trained deep learning models and the Transfer Learning Toolkit (TLT) give you a rapid path to building your next AI project. Whether you’re a DIY enthusiast or building a next-gen product with AI, you can use these models out of the box or fine-tune with your own dataset. The purpose-built, pre-trained models are trained on the large datasets collected and curated by NVIDIA and can be applied to a wide range of use cases. TLT is a simple AI toolkit, shipped with Jupyter notebooks, that requires little to no coding for taking pre-trained models and customizing them with your own data.

Date: March 3, 2021
Time: 11:00am – 12:00pm PT
Duration: 1 hour

Join this webinar to explore:

  • Highly optimized pre-trained models for various industry use cases
  • How to fine-tune with your own data on new pre-trained models and use them to reduce your total development time
  • Developing an end-to-end training pipeline and deploying the trained model on NVIDIA SDKs

Join us after the presentation for a live Q&A session.

Register now >

Categories
Misc

The following shows up in the command prompt

Im trying to create a chatbot using neuralnines tutorial but I ran into a problem

C:UserschakkDesktopchatbot>python main.py 2021-02-21 16:14:30.544425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2021-02-21 16:14:30.544542: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-02-21 16:14:31.738286: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-21 16:14:31.738724: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2021-02-21 16:14:31.757352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:01:00.0 name: GeForce RTX 2080 SUPER computeCapability: 7.5 coreClock: 1.815GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 462.00GiB/s 2021-02-21 16:14:31.757788: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2021-02-21 16:14:31.758162: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cublas64_11.dll'; dlerror: cublas64_11.dll not found 2021-02-21 16:14:31.759016: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cublasLt64_11.dll'; dlerror: cublasLt64_11.dll not found 2021-02-21 16:14:31.759344: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cufft64_10.dll'; dlerror: cufft64_10.dll not found 2021-02-21 16:14:31.759848: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'curand64_10.dll'; dlerror: curand64_10.dll not found 2021-02-21 16:14:31.760201: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found 2021-02-21 16:14:31.760504: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cusparse64_11.dll'; dlerror: cusparse64_11.dll not found 2021-02-21 16:14:31.760798: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found 2021-02-21 16:14:31.760831: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... 2021-02-21 16:14:31.761295: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-02-21 16:14:31.761890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-02-21 16:14:31.761963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 2021-02-21 16:14:31.762260: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set C:UserschakkDesktopchatbot>python main.py 2021-02-21 16:21:50.579668: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found 2021-02-21 16:21:50.579795: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-02-21 16:21:51.786031: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-02-21 16:21:51.786480: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll 2021-02-21 16:21:51.797963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 

submitted by /u/AviationAddiction21
[visit reddit] [comments]

Categories
Misc

[Help] How to optimize posenet or handpose javascript?

I’m working on an experiment where users can interact with a 3D object using their gestures or hands. Posenet/Handpose is a great library, but the performance is not up to par just yet, without any 3d object the frame rate hovers around 10-12FPS which is not enough if you want to build an interactive installation.

Is there a way to optimize this, especially on macOS?

I’ve tried the following;

  • Using web worker (didn’t help much)
  • Using WebSocket and run TensorFlow on the server (Didn’t help much, because I can’t run the GPU backend)

What I haven’t tried.

  • Run a TPU server, a bit excessive and perhaps costly? Or is there an alternative for this?
  • Run it on an Nvidia platform (Might need to rent)

submitted by /u/buangakun3
[visit reddit] [comments]

Categories
Misc

A package to sizeably boost your performance

A package to sizeably boost your performance

I am glad to present the TensorFlow implementation of “Gradient Centralization” a new optimization technique to sizeably boost your performance 🚀, available as a ready-to-use Python package!

Project Repo: https://github.com/Rishit-dagli/Gradient-Centralization-TensorFlow

Please consider giving it a ⭐ if you like it😎. Here is an example showing the impact of the package!

https://preview.redd.it/69woozdxjui61.png?width=1280&format=png&auto=webp&s=0f3acbaf28a0dbc05455e1633eee9a82a95dae17

submitted by /u/Rishit-dagli
[visit reddit] [comments]