Categories
Misc

Creating a tensor using linspace function

Hello Guys,

I’m challenging myself to create simple 1 dimension tensor that consist of integers, range from 1-10 using the linespace function and with a shape of 6. However I haven’t been successful doing that. How do I fix this ?

My code:

[1,2,3,4,5,6,7,8,9,10]

torch.linspace(1, 1, 10)

submitted by /u/destin95
[visit reddit] [comments]

Categories
Misc

Why getting NaN values for custom Dice loss in Keras?

I am using Keras for boundary/contour detection using a Unet. When I use binary cross-entropy as the loss, the losses decrease over time as expected the predicted boundaries look reasonable

However, I have tried custom loss for Dice with varying LRs, none of them are working well.

smooth = 1e-6 def dice_coef(y_true, y_pred): y_true_f = K.flatten(y_true) y_pred_f = K.flatten(y_pred) intersection = K.sum(y_true_f * y_pred_f) return (2. * intersection + smooth) / (K.sum(y_true_f) + K.sum(y_pred_f) + smooth) def dice(y_true, y_pred): return 1-dice_coef(y_true, y_pred) 

the loss values don’t improve. That is, it will show something like

loss: nan - dice: .9607 - val_loss: nan - val_dice: .9631 

I get NaNs for the losses and values for dice and val_dice that barely change as the epochs iterate. This is regardless of what I use for the LR, whether it be .01 to 1e-6

The dimensions of the train images/labels looks like N x H x W x 1, where N is the number of images, H/W are the height/width of each image

can anyone help?

submitted by /u/74throwaway
[visit reddit] [comments]

Categories
Misc

Feelin’ Like a Million MBUX: AI Cockpit Featured in Popular Mercedes-Benz C-Class

It’s hard not to feel your best when your car makes every commute a VIP experience. This week, Mercedes-Benz launched the redesigned C-Class sedan and C-Class wagon, packed with new features for the next generation of driving. Both models prominently feature the latest MBUX AI cockpit, powered by NVIDIA, delivering an intelligent user interface for Read article >

The post Feelin’ Like a Million MBUX: AI Cockpit Featured in Popular Mercedes-Benz C-Class appeared first on The Official NVIDIA Blog.

Categories
Misc

Omniverse Assets Available for Download on TurboSquid

TurboSquid and NVIDIA are collaborating to curate thousands of USD models that are available today and ready to use with NVIDIA Omniverse.

TurboSquid and NVIDIA are collaborating to curate thousands of USD models that are available today and ready to use with NVIDIA Omniverse.

Many developers using Omniverse are experiencing enhanced workflows with virtual collaboration and photorealistic simulation. The open platform, which is available now in open beta, enables teams around the world to simultaneously collaborate in real time, using their favorite 3D applications. 

TurboSquid has an extensive library of 3D models that users can easily drag and drop into Omniverse, allowing them to immediately start collaborating with others. This helps developers save time as they can immediately start exploring Omniverse without worrying about importing or exporting content, model preparation, or polycounts. Users can load TurboSquid’s USD models in Omniverse connectors, and Omniverse ensures consistent quality between teams, contractors, and ecosystems. 

To get started, download the NVIDIA Omniverse Launcher from nvidia.com/omniverse. Run the Omniverse Launcher and install Omniverse Create or Omniverse View apps, then import TurboSquid 3D content and start creating.

Learn more by visiting TurboSquid’s Omniverse page, and check out the 3D tool sets now available.

Categories
Offsites

The Technology Behind Cinematic Photos

Looking at photos from the past can help people relive some of their most treasured moments. Last December we launched Cinematic photos, a new feature in Google Photos that aims to recapture the sense of immersion felt the moment a photo was taken, simulating camera motion and parallax by inferring 3D representations in an image. In this post, we take a look at the technology behind this process, and demonstrate how Cinematic photos can turn a single 2D photo from the past into a more immersive 3D animation.

Camera 3D model courtesy of Rick Reitano.

Depth Estimation
Like many recent computational photography features such as Portrait Mode and Augmented Reality (AR), Cinematic photos requires a depth map to provide information about the 3D structure of a scene. Typical techniques for computing depth on a smartphone rely on multi-view stereo, a geometry method to solve for the depth of objects in a scene by simultaneously capturing multiple photos at different viewpoints, where the distances between the cameras is known. In the Pixel phones, the views come from two cameras or dual-pixel sensors.

To enable Cinematic photos on existing pictures that were not taken in multi-view stereo, we trained a convolutional neural network with encoder-decoder architecture to predict a depth map from just a single RGB image. Using only one view, the model learned to estimate depth using monocular cues, such as the relative sizes of objects, linear perspective, defocus blur, etc.

Because monocular depth estimation datasets are typically designed for domains such as AR, robotics, and self-driving, they tend to emphasize street scenes or indoor room scenes instead of features more common in casual photography, like people, pets, and objects, which have different composition and framing. So, we created our own dataset for training the monocular depth model using photos captured on a custom 5-camera rig as well as another dataset of Portrait photos captured on Pixel 4. Both datasets included ground-truth depth from multi-view stereo that is critical for training a model.

Mixing several datasets in this way exposes the model to a larger variety of scenes and camera hardware, improving its predictions on photos in the wild. However, it also introduces new challenges, because the ground-truth depth from different datasets may differ from each other by an unknown scaling factor and shift. Fortunately, the Cinematic photo effect only needs the relative depths of objects in the scene, not the absolute depths. Thus we can combine datasets by using a scale-and-shift-invariant loss during training and then normalize the output of the model at inference.

The Cinematic photo effect is particularly sensitive to the depth map’s accuracy at person boundaries. An error in the depth map can result in jarring artifacts in the final rendered effect. To mitigate this, we apply median filtering to improve the edges, and also infer segmentation masks of any people in the photo using a DeepLab segmentation model trained on the Open Images dataset. The masks are used to pull forward pixels of the depth map that were incorrectly predicted to be in the background.

Camera Trajectory
There can be many degrees of freedom when animating a camera in a 3D scene, and our virtual camera setup is inspired by professional video camera rigs to create cinematic motion. Part of this is identifying the optimal pivot point for the virtual camera’s rotation in order to yield the best results by drawing one’s eye to the subject.

The first step in 3D scene reconstruction is to create a mesh by extruding the RGB image onto the depth map. By doing so, neighboring points in the mesh can have large depth differences. While this is not noticeable from the “face-on” view, the more the virtual camera is moved, the more likely it is to see polygons spanning large changes in depth. In the rendered output video, this will look like the input texture is stretched. The biggest challenge when animating the virtual camera is to find a trajectory that introduces parallax while minimizing these “stretchy” artifacts.

The parts of the mesh with large depth differences become more visible (red visualization) once the camera is away from the “face-on” view. In these areas, the photo appears to be stretched, which we call “stretchy artifacts”.

Because of the wide spectrum in user photos and their corresponding 3D reconstructions, it is not possible to share one trajectory across all animations. Instead, we define a loss function that captures how much of the stretchiness can be seen in the final animation, which allows us to optimize the camera parameters for each unique photo. Rather than counting the total number of pixels identified as artifacts, the loss function triggers more heavily in areas with a greater number of connected artifact pixels, which reflects a viewer’s tendency to more easily notice artifacts in these connected areas.

We utilize padded segmentation masks from a human pose network to divide the image into three different regions: head, body and background. The loss function is normalized inside each region before computing the final loss as a weighted sum of the normalized losses. Ideally the generated output video is free from artifacts but in practice, this is rare. Weighting the regions differently biases the optimization process to pick trajectories that prefer artifacts in the background regions, rather than those artifacts near the image subject.

During the camera trajectory optimization, the goal is to select a path for the camera with the least amount of noticeable artifacts. In these preview images, artifacts in the output are colored red while the green and blue overlay visualizes the different body regions.

Framing the Scene
Generally, the reprojected 3D scene does not neatly fit into a rectangle with portrait orientation, so it was also necessary to frame the output with the correct right aspect ratio while still retaining the key parts of the input image. To accomplish this, we use a deep neural network that predicts per-pixel saliency of the full image. When framing the virtual camera in 3D, the model identifies and captures as many salient regions as possible while ensuring that the rendered mesh fully occupies every output video frame. This sometimes requires the model to shrink the camera’s field of view.

Heatmap of the predicted per-pixel saliency. We want the creation to include as much of the salient regions as possible when framing the virtual camera.

Conclusion
Through Cinematic photos, we implemented a system of algorithms – with each ML model evaluated for fairness – that work together to allow users to relive their memories in a new way, and we are excited about future research and feature improvements. Now that you know how they are created, keep an eye open for automatically created Cinematic photos that may appear in your recent memories within the Google Photos app!

Acknowledgments
Cinematic Photos is the result of a collaboration between Google Research and Google Photos teams. Key contributors also include: Andre Le, Brian Curless, Cassidy Curtis, Ce Liu‎, Chun-po Wang, Daniel Jenstad, David Salesin, Dominik Kaeser, Gina Reynolds, Hao Xu, Huiwen Chang, Huizhong Chen‎, Jamie Aspinall, Janne Kontkanen, Matthew DuVall, Michael Kucera, Michael Milne, Mike Krainin, Mike Liu, Navin Sarma, Orly Liba, Peter Hedman, Rocky Cai‎, Ruirui Jiang‎, Steven Hickson, Tracy Gu, Tyler Zhu, Varun Jampani, Yuan Hao, Zhongli Ding.

Categories
Misc

Meet the Researcher: Lorenzo Baraldi, Artificial Intelligence for Vision, Language and Embodied AI

This month, we spotlight Lorenzo Baraldi, Assistant Professor at the University of Modena and Reggio Emilia in Italy.

‘Meet the Researcher’ is a monthly series in which we spotlight different researchers in academia who are using NVIDIA technologies to accelerate their work. This month, we spotlight Lorenzo Baraldi, Assistant Professor at the University of Modena and Reggio Emilia in Italy.

Before working as a professor, Baraldi was a research intern at Facebook AI Research. He serves as an Associate Editor of the Pattern Recognition Letters journal and works at the integration of Vision, Language, and Embodied AI.

What are your research areas of focus?

I work within the AimageLab research group on Computer Vision and Deep Learning. I focus mainly on the integration of vision, language, and action. The final goal of our research is to develop agents that can perceive and act in our world while being capable of communicating with humans.

What motivated you to pursue this research area of focus?

Combining the ability to perceive the visual world around us, with that of acting and that of expressing in natural language is something that humans do quite naturally and is one of the keys to human intelligence. In the last few years, we have witnessed tremendous achievements in areas that consider only one of those abilities: Computer Vision, Natural Language Processing, and Robotics. How to combine these abilities, instead, still needs to be understood and is a thrilling field of research.

Tell us about your current research projects.

We are mainly working in three directions: 1 – we integrate vision and language, for example by developing algorithms that can describe images in natural language. A recent paper of this work was presented at CVPR Transformer-based model for image captioning; 2 – we integrate vision and action, by developing agents for autonomous navigation. We are interested in agents moving in indoor and outdoor scenarios, and possibly interacting with people, also in crowded situations; 3 – we integrate all of this with the ability to understand language, for instance by training agents that can move following an instruction or curiosity-driven agents that can describe what they see along their path.

Overview of Baraldi’s image captioning approach. Building on a Transformer-like encoder-decoder architecture, the approach includes a memory-aware region encoder that augments self-attention with memory vectors.

What problems or challenges does your research address?

I think one of the main challenges we need to solve is to find the right way of integrating multi-modal information, which can come from either visual, textual, or motorial perception. In other words, we need to find the right architecture for dealing with this information, that is why a lot of our research involves the design of new architectures. Secondly, most of the approaches we design are generative and sequential: we generate sentences, we generate actions or paths for robots, and so on. Again, how to generate sequences conditioned on multi-modal information is still a challenge.

Sentences generated on the ACVR Robotic Vision Challenge dataset.

What is the (expected) impact of your work on the field/community/world?

If the research efforts that the community is devoting to this area will be successful, we will have algorithms that can understand us and help us in our daily lives, seeing with us and acting in the world to help us. I think in the long run this might also change the way we interact with computers, which might become a lot easier and language-based.

How have you used NVIDIA technology either in your current or previous research?

Performing large-scale training on NVIDIA GPUs is one of the most important ingredients which power our research, and I am sure this will become even more important in the next future. We do that locally, with a distributed GPU cluster in our lab, and we do that at a bigger scale in conjunction with CINECA, the Italian supercomputing center, and with the NVIDIA AI Technical Centre (NVAITC) of Modena. The partnership we have with NVAITC and CINECA has not only increased our computational capacity, but has also provided us with the knowledge and support we needed to exploit the technologies NVIDIA provides, at their maximum. I would say this collaboration is really having an important impact on our research capabilities.

Did you achieve any breakthroughs in that research or any interesting results using NVIDIA technology?

Most, if not all, of the research works we carry out, are somehow powered by NVIDIA technologies. Apart from the results on the integration of vision, language, and action, we also have a few other research lines of which I am particularly proud. One is related to video understanding: detecting people and objects, understanding their relationships, and finding the best way of extracting Spatio-temporal features is an important challenge. Sometimes we also like to apply our research to the cultural heritage: using NVIDIA GPUs we have developed algorithms for retrieving paintings in natural language, and generative networks for translating artworks to reality

What is next for your research?

Even though things are evolving rapidly in our area, there are still a lot of key issues that need to be addressed, and that is what our lab concentrating on. One is that going beyond the limitations of traditional supervised learning and fighting dataset bias: in the end, we would like our algorithms to describe and understand any connection between images and text, not just those that are annotated in current datasets. To this end, are working towards algorithms that can describe objects which are not present in the training dataset, and we constantly explore the new possibilities given by self-supervised and weakly-supervised learning. How to properly manage the temporal dimension is also another key issue that has been central in our research, and which has brought advancements in terms of new architectural design, not only for managing sequences of words, but also for understanding video streams.

Any advice for new researchers?

There are at least three capabilities I would recommend pursuing. One is to learn to code well and elegantly because translating ideas to reality is always going to involve implementation. The second is to learn to have good ideas: that is potentially the trickiest part, but it is even more important because every valuable research needs to start from a good idea. I think reading papers, especially from the past, and think openly, freely and on a large-scale is of great help in this sense. The third is time management: always focus on what is impactful.

Baraldi’s colleague, Matteo Tomei, will be presenting their lab’s recent work at NVIDIA GTC in April, “More Efficient and Accurate Video Networks: A New Approach to Maximize the Accuracy/Computation Trade-off”.

Categories
Misc

I saw some tesla k80 graphics acceleration cards they have no display port there for helping workloads are these any good for tensorflow AI building

Price:140$ Specs

Cuda cores: 4,992

Core speed: 562-875 per card

RAM 24GB

RAM speed: 480GB/s

This is a 2 pci slot card basically 2 cards in 1 No cooling included (I got a plan for that)

Display card will be my old gtx 950

submitted by /u/isaiahii10
[visit reddit] [comments]

Categories
Misc

I’ve been working on an real time object detection project and I’ve been face with an error while trying to capture image to label and train.. please help

I've been working on an real time object detection project and I've been face with an error while trying to capture image to label and train.. please help
submitted by /u/Field_Great
[visit reddit] [comments]
Categories
Misc

I’ve been working on an real time object detection project and I’ve been face with an error while trying to capture image to label and train.. please help

I've been working on an real time object detection project and I've been face with an error while trying to capture image to label and train.. please help
submitted by /u/Field_Great
[visit reddit] [comments]
Categories
Misc

NVIDIA Clara Parabricks Pipelines v3.5 Accelerates Google’s DeepVariant v1.0

NVIDIA released NVIDIA Clara Parabricks Pipelines version 3.5, adding a set of new features to the software suite that accelerates end-to-end genome sequencing analysis.

NVIDIA recently released NVIDIA Clara Parabricks Pipelines version 3.5, adding a set of new features to the software suite that accelerates end-to-end genome sequencing analysis.

With the release of v3.5, Clara Parabricks Pipelines now provides acceleration to Google’s DeepVariant 1.0, in addition to a suite of existing DNA and RNA tools. The addition of DeepVariant to Parabricks Pipelines brings highly-accurate variant calling for both short- and long-read sequencing data to the community. 

This new release also enables graphical reports of QC metrics from binary alignment map (BAM) files to variant call files (VCF). Researchers can use these graphical reports to better assess the quality of their sequencing data and the subsequent variant calling before moving the results for additional downstream analysis. 

Parabricks Pipelines is packaged with enterprise support for A100 and other NVIDIA GPUs, offering one of the industry’s fastest compute frameworks for whole genome and whole exome applications. For a whole genome at 30x coverage, a server with 32 virtual CPUs takes about 1,200 minutes to generate a variant call file (VCF), while a server with eight A100 Tensor Core GPUs running Clara Parabricks takes less than 25 minutes to go from FASTQ to VCF.

Start a free one month trial of NVIDIA Clara Parabricks Pipelines today and learn how to get set up in just 10 minutes with this step-by-step instructional video.