NVIDIA today reported record revenue for the third quarter ended October 31, 2021, of $7.10 billion, up 50 percent from a year earlier and up 9 percent from the previous quarter, with record revenue from the company’s Gaming, Data Center and Professional Visualization market platforms.
Category: Misc
Error when installing tensoflow…
- Hey guys I get this error message when i try to run
- import tensorflow as tf
- print(tf. __version__)
-
-
-
- 2021-11-17 19:57:46.733325: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library ‘cudart64_110.dll’; dlerror: cudart64_110.dll not found 2021-11-17 19:57:46.739099: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2.7.0
-
- Can anybody help me figure out whats the problem?
submitted by /u/Davidescu-Vlad
[visit reddit] [comments]
How to use tensorflow with an AMD GPU
Learn more about the many ways scientists are applying advancements in Million-X computing and solving global challenges.
At NVIDIA GTC last week, Jensen Huang laid out the vision for realizing multi-Million-X speedups in computational performance. The breakthrough could solve the challenge of computational requirements faced in data-intensive research, helping scientists further their work.
Solving challenges with Million-X computing speedups
Million-X unlocks new worlds of potential and the applications are vast. Current examples from NVIDIA include accelerating drug discovery, accurately simulating climate change, and driving the future of manufacturing.
Drug discovery
Researchers at NVIDIA, CalTech, and the startup Entos blended machine learning and physics to create OrbNet, speeding up molecular simulations by many orders of magnitude. As a result, Entos can accelerate its drug discovery simulations by 1,000x, finishing in 3 hours what would have taken more than 3 months.
Climate change
Last week, Jensen Huang announced plans to build Earth 2, building a digital twin of the Earth in Omniverse. The world’s most powerful AI supercomputer will be dedicated to simulating climate models that predict the impacts of global warming in different places across the globe. Understanding these changes over time can help humanity plan for and mitigate these changes at a regional level.
Future manufacturing
The earth is not the first digital twin project enabled by NVIDIA. Researchers are already building physically accurate digital twins of cities and factories. The simulation frontier is still young and full of potential, waiting for the catalyst mass increases in computing will provide.
Share your Million-X challenge
Share how you are using Million-X computing on Facebook, LinkedIn, or Twitter using #MyMillionX and tagging @NVIDIAHPCDev.
The NVIDIA developer community is already changing the world, using technology to solve difficult challenges. Join the community. >>
Below are a handful of notable examples.
The community that is changing the world
Smart waterways, safer public transit, and eco-monitoring in Antarctica

The work of Johan Barthelemy is interdisciplinary and covers a variety of industries. As the head of the University of Wollongong’s Digital Living Lab, he aims to deliver innovative AIoT solutions that champion ethical and privacy-compliant AI.
Currently, Barthelemy is working on an assortment of projects including a smart waterways computer vision application that detects stormwater blockage in real-time, helping cities prevent city-wide issues.
Another project, currently being deployed in multiple cities is AI camera software, which detects and reports violence on Sydney trains through aggressive stance modeling.
An AIoT platform for remotely monitoring Antarctica’s terrestrial environment is also in the works. Built around an NVIDIA Jetson Xavier NX edge computer, the platform will be used to monitor the evolution of moss beds—their health being an early indicator of the impact of climate change. The data collected will also inform a variety of models developed by the Securing Antarctica’s Environmental Future community of researchers, in particular hydrology and microclimate models.
Connect: LinkedIn | Twitter | Digital Living Lab
Never-before-seen views of SARS-CoV-2

NVIDIA researchers and 14 partners successfully developed a platform to explore the composition, structure, and dynamics of aerosols and aerosolized viruses at the atomic level.
This work surmounts the previously limited ability to examine aerosols at the atomic and molecular level, obscuring our understanding of airborne transmission. Leveraging the platform, the team produced a series of novel discoveries regarding the SARS-CoV-2 Delta variant.
These breakthroughs dramatically extend the capabilities of multiscale computational microscopy in experimental methods. The full impact of the project has yet to be realized.
Species recognition, environmental monitoring, and adaptive streaming

Dr. Albert Bifet is the Director of the Te Ipu o te Mahara, The Artificial Intelligence Institute at the University of Waikato, and Professor of Big Data at Télécom Paris, Institute.
Bifet also leads the TAIAO project, a data science program using an NVIDIA DGX A100 to build deep learning models on species recognition. He is codeveloping a new machine-learning library in Python called River for online/streaming machine learning, and building a new data repository to improve reproducibility in environmental data science.
Additionally, researchers at TAIAO are building new approaches to compute GPU-based SHAP values for XGBoost, and developing a new adaptive streaming XGBoost.
Connect: Website | LinkedIn | Twitter
Medical imaging, therapy robots, and NLP depression detection

The current interests of Dr. Ekapol Chuangsuwanich fall within the medical imaging domain, including chest x-ray and histopathology technology. However, over the past few years his work has spanned across many industries including NLP, ASR, and medical imaging.
Last year, Chuangsuwanich and his team developed the PYLON architecture, which can learn precise pixel-level object location with only image-level annotation. This is deployed across hospitals in Thailand to provide rapid COVID-19 severity assessments and to facilitate screening of tuberculosis in high-risk communities.
Additionally, he is working on NLP and ASR robots for medical use, including a speech therapy helper and call center robot with depression detection functionality. His startup, Gowajee, is also providing state-of-the-art ASR and TTS for the Thai language. These projects have been created using the NVIDIA NeMo framework and deployed on NVIDIA Jetson Nano devices.
Connect: Website | Org | Facebook
Trillion atom quantum-accurate molecular dynamics simulations

Researchers from the University of South Florida, NVIDIA, Sandia National Labs, NERSC, and the Royal Institute of Technology collaborated to produce a LAMMPS trained machine learning kernel with interatomic potentials named SNAP (Spectral Neighborhood Analysis Potential).
SNAP was found to be accurate across a huge pressure-temperature range, from 0-50Mbars or 300-20,000 Kelvin. The peak Molecular Dynamic performance was greater than 22x the previous record—done on a 20-billion-atom system, and simulated on Summit for 1ns in a day.
The project qualified as a Gordon Bell Prize finalist, and the near perfect weak scaling of SNAP MD highlights the potential to launch quantum-accurate MD to trillion atom simulations on upcoming exascale platforms. This dramatically expands the scientific return of X-ray free electron laser diffraction experiments.
BioInformatics, smart cities, and translational research

Dr. Ng See-Kion is constantly in search of big data. A practicing data scientist, See-Kion is also a Professor of Practice and Director of Translational Research at the National University of Singapore.
Currently projects on his desk leverage the NVIDIA NeMo framework covering NLP for indigenous and vernacular languages across Singapore and New Zealand. See-Kion is also working on intelligent COVID-19 contact tracing and outbreak, intelligent social event sensing, and assessing the credibility of information in new media.
Connect: Website
Learn about the optimizations and techniques used across the full stack in the NVIDIA AI platform that led to a record-setting performance in MLPerf HPC v1.0.
In MLPerf HPC v1.0, NVIDIA-powered systems won four of five new industry metrics focused on AI performance in HPC. As an industry-wide AI consortium, MLPerf HPC evaluates a suite of performance benchmarks covering a range of widely used AI workloads.
In this round, NVIDIA delivered 5x better results for CosmoFlow, and 7x more performance on DeepCAM, compared to strong scaling results from MLPerf 0.7. The strong showing is the result of a mature NVIDIA AI platform with a full stack of software.
Offering a rich and diverse set of libraries, SDKs, tools, compilers, and profilers it can be difficult to know when and where to apply the right asset in the right situation. This post details the tools, techniques, and benefits for various scenarios, and outlines the results achieved for the CosmoFlow and DeepCAM benchmarks.
We have published similar guides for MLPerf Training v1.0 and MLPerf Inference v1.1, which are recommended for other benchmark-oriented cases.
The tuning plan
We tuned our code with tools including NVIDIA DALI to accelerate data processing, and CUDA Graphs to reduce small-batch latency for efficiently scaling out to 1,024 or more GPUs. We also applied NVIDIA SHARP to accelerate communications by offloading some operations to the network switch.
The software used in our submissions is available from the MLPerf repository. We regularly add new tools along with new versions to the NGC catalog—our software hub for pretrained AI models, industry application frameworks, GPU applications, and other software resources.
Major performance optimizations
In this section, we dive into the selected optimizations that are implemented for MLPerf HPC 1.0.
Using NVIDIA DALI library for data preprocessing
Data is fetched from the disk and preprocessed before each iteration. We moved from the default dataloader to NVIDIA DALI library. This provides optimized data loading and preprocessing functions for GPUs.
Instead of performing data loading and preprocessing on CPU and moving the result to GPU, DALI library uses a combination of CPU and GPU. This leads to more efficient preprocessing of the data for the upcoming iteration. The optimization results in significant speedup for both CosmoFlow and DeepCAM. DeepCAM achieved over a 50% end-to-end performance gain.
In addition, DALI also provides asynchronous data loading for the upcoming iteration to eliminate I/O overhead from the critical path. With this mode enabled, we saw an additional 70% gain on DeepCAM.
Applying the channels-last NHWC layout
By default, the DeepCAM benchmark uses NCHW layout, for the activation tensors. We used PyTorch’s channels-last (NHWC layout) support to avoid extra transpose kernels. Most convolution kernels in cuDNN are optimized for NHWC layout.
As a result, using NCHW layout in the framework requires additional transpose kernels to convert from NCHW to NHWC for efficient convolution operation. Using NHWC layout in-framework avoids these redundant copies, and delivered about 10% performance gains on the DeepCAM model. NHWC support is available in the PyTorch framework in beta mode.
CUDA Graphs
CUDA Graphs allow launching a single graph that consists of a sequence of kernels, instead of individually launching each of the kernels from CPU to GPU. This feature minimizes CPU involvement in each iteration, substantially improving performance by minimizing latencies—especially for strong scaling scenarios.
MXNet previously added CUDA Graphs support, and CUDA Graphs support was also recently added to PyTorch. CUDA Graphs support in PyTorch resulted in around a 15% end-to-end performance gain in DeepCAM for the strong scaling scenario, which is most sensitive to latency and jitter.
Efficient data staging with MPI
For the case of weak scaling, the performance of the distributed file system cannot sustain the demand from GPUs. To increase the aggregate total storage bandwidth, we stage the dataset into node-local NVME memory for DeepCAM.
Since the individual instances are small, we can shard the data statically, and thus only need to stage a fraction of the full dataset per node. This solution is depicted in Figure 1. Here we denote the number of instances with M and the number of ranks per instance with N.

Note that across instances, each rank with the same rank ID uses the same shard of data. This means that natively, each data shard is read M times. To reduce pressure on the file system, we created subshards of the data orthogonal to the instances, depicted in Figure 2.

This way, each file is read-only once from the global file system. Finally, each instance needs to receive all the data. For this purpose, we created new MPI communicators orthogonal to the intra-instance communicator, that is, we combine all instance ranks with the same rank id into the same interinstance communicators. Then we can use MPI allgather to combine the individual subshards into M copies of the original shard.

Instead of performing these steps sequentially, we use batching to create a pipeline that overlaps data reading and distribution of the subshards. In order to improve the read and write performance, we further implemented a small helper tool that uses O_DIRECT to improve I/O bandwidth.
The optimization resulted in more than 2x end-to-end speedup for the DeepCAM benchmark. This is available in the submission repository.
Loss hybridization
An imperative approach for model definition and execution is a flexible solution for defining a ML model like a standard Python program. On the other hand, symbolic programming is a way for declaring computation upfront, before execution. This approach allows the engine to perform various optimizations, but loses flexibility of the imperative approach.
Hybridization is a way of combining those two approaches in the MXNet framework. An imperatively defined calculation can be compiled into symbolic form and optimized when possible. CosmoFlow extends the model hybridization with loss.

This allows fusing element-wise operations in loss calculation with scaled activation output from CosmoFlow model, reducing overall iteration latency. The optimization resulted in close to a 5% end-to-end performance gain for CosmoFlow.
Employing SHARP for internode all-reduce collective
SHARP allows offloading collective operations from CPU to the switches in internode network fabric. This effectively doubles the internode bandwidth of InfiniBand network for the allreduce operation. This optimization results in up to 5% performance gain for MLPerf HPC benchmarks, especially for strong scaling scenarios.
Moving forward with MLPerf HPC
Scientists are making breakthroughs at an accelerated pace, in part because AI and HPC are combining to deliver insight faster and more accurately than could be done using traditional methods.
MLPerf HPC v1.0 reflects the supercomputing industry’s need for an objective, peer-reviewed method to measure and compare AI training performance for use cases relevant to HPC. In this round, the NVIDIA compute platform demonstrated clear leadership by winning all three benchmarks for performance, and also demonstrated the highest efficiency for both throughput measurements.
NVIDIA has also worked with several supercomputing centers around the world for their submissions with NVIDIA GPUs. One of them, the Jülich Supercomputing Centre, has the fastest submissions from Europe.
Read more stories of 2021 Gordon Bell finalists, as well as a discussion of how HPC and AI are making new types of science possible.
Learn more about the MLPerf benchmarks and results from NVIDIA.
Featured image of the Juwels Booster powered by NVIDIA A100 image courtesy of „Forschungszentrum Jülich / Sascha Kreklau”
Disclaimer:
MLPerf v1.0 HPC Closed Strong & Weak Scaling – Result retrieved from https://mlcommons.org/en/training-hpc-10 on Nov. 16, 2021.
The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
NVIDIA-powered systems won four of five tests in MLPerf HPC 1.0, an industry benchmark for AI performance on scientific applications in high performance computing. They’re the latest results from MLPerf, a set of industry benchmarks for deep learning first released in May 2018. MLPerf HPC addresses a style of computing that speeds and augments simulations Read article >
The post MLPerf HPC Benchmarks Show the Power of HPC+AI appeared first on The Official NVIDIA Blog.

![]() |
submitted by /u/lizziepika [visit reddit] [comments] |
The latest NVIDIA HPC SDK includes a variety of tools to maximize developer productivity, as well as the performance and portability of HPC applications.
Today, NVIDIA announced the upcoming HPC SDK 21.11 release with new Library enhancements. This software will be available free of charge in the coming weeks.
The NVIDIA HPC SDK is a comprehensive suite of compilers and libraries for high-performance computing development. It includes a wide variety of tools proven to maximize developer productivity, as well as the performance and portability of HPC applications.
The HPC SDK and its components are updated numerous times per year with new features, performance advancements, and other enhancements.
What’s new
This 21.11 release will include updates to HPC C++/Fortran compiler support and the developer environment, as well as new multinode mulitGPU library capabilities.
Compiler, build systems, and other enhancements
Introduced last year with version 20.11, the NVFORTRAN compiler automatically parallelizes code written using the DO CONCURRENT standard language feature as described in this post.
New in version 21.11, the programmer can use the REDUCE clause as described in the current working draft of the ISO Fortran Standard to perform reduction operations, a requirement of many scientific algorithms.
Starting with the 21.11 release, the HPC Compilers now support the –gcc-toolchain option, similarly to the clang-based compilers. This is provided in addition to the existing rc-file method of specifying nondefault GNU Compiler Collection (GCC) versions. The HPC Compilers leverage open source GCC libraries for things like common system operations and C++ standard library support.
Sometimes, a developer needs a different version of the GCC toolchain than the system default. Now 21.11 has both command line and file-based ways of making that specification. In addition to –gcc-toolchain, the 21.11 HPC Compilers add several GCC-compatible command line flags for specifying x86-64 target architecture details.
The 21.11 release now includes two new Fortran modules to integrate with NVIDIA libraries, Fortran applications maximize the benefit from NVIDIA platforms and Fortran developers be as productive as possible. HPC applications written in Fortran can directly use cufftX—a highly optimized multi-GPU FFT library from NVIDIA. It also enables easier use with the NVIDIA Tools Extension Library (NVTX) for performance and profiling studies with Nsight.
Version 21.11 will ship with CMake config files that define CMake targets for the various components of the HPC SDK. This offers application packagers and developers a more seamless code integration with the NVIDIA HPC SDK.
New multinode, multiGPU Math Libraries
HPC SDK version 21.11 will include the first of our upcoming multinode, multiGPU Math Library functionality, cuSOLVERMp. Initial functionality will include Cholesky and LU Decomposition, with and without pivoting. Future releases will include LU, with multiple RHS.
Learn more about:
- NVIDIA HPC SDK >>
- NVIDIA Math Libraries>>
- Download updates when available. >>
Meet The Spaghetti Detective, an AI-based failure-detection tool for 3D printer remote management and monitoring.
3D printing can be a quick and convenient way to prototype ideas and build useful everyday objects. But it can also be messy—and stressful—when a print job encounters an error that leaves your masterpiece buried in piles of plastic filament. Those tangles of twisted goop are known as “spaghetti monsters,” and they have the power to kill your project and raise your blood pressure.
Thankfully, there is a way to tame these monsters. Meet The Spaghetti Detective (TSD), an AI-based (deep learning) failure-detection tool for 3D printer remote management and monitoring. In other words, with TSD you can detect spaghetti monsters before they get out of hand. It issues an early warning that could save days of work and pounds and pounds of filament.
In fact, according to the team behind TSD, the tool has caught more than 560,000 failed prints by watching more than 47 million hours of 3D project printing time, saving more than 27,500 pounds of filament.
This short demo shows TSD in action:
Kenneth Jiang, founder of TSD, reported being “stunned” at just how outdated most 3D-printing software can be. So he and his team created TSD to bring new technologies to the world of 3D printing.
Every part of TSD is open source, including the plug-in, the backend, and the algorithm.
According to a post by Jiang in the NVIDIA Developer Forum, TSD is “based on a Convolutional Neural Network architecture called YOLO. It is essentially a super-fast objection-detection model.”
The Spaghetti Detective also communicates with OctoPrint, an open-source web interface for your 3D printer. The private TSD server has an array of advanced settings for all requirements, including enabling NVIDIA GPU acceleration, reverse proxies, NGINX settings, and more.
With more than 600 stars on GitHub, TSD is being used by hundreds of NVIDIA Jetson Nano fans who are also 3D printing enthusiasts. Inspired by their success, Jiang took it upon himself to set up TSD with Jetson Nano, and created a demo to show other users how to set it up.
The project requires an NVIDIA Jetson Nano with 4GB of memory to run (they do not advise trying to set this up with the 2GB model), an Ethernet cable to connect your network router, an HDMI cable, keyboard, and mouse. TSD is installed using Docker and Docker-compose, and the server uses e-mail delivery through SMTP for notifications. The web interface is written in Django and you can log in and create a password-secured account. Notifications from TSD can also be sent by SMS.
TSD is available as a free service for occasional 3D-print monitoring. If you expect to be printing daily, there is also a paid option starting at $4 per month.
The team working on TSD also plans to add event-based recording for fluid video capture, improvements to model accuracy and capability, and functionality to enable local hosting for increased data privacy. At the same time, they are clearly having fun figuring it all out, as you can see from their very popular videos on TikTok.
If you are interested in learning more about how Jetson Nano can be used to run The Spaghetti Detective, check out the code in GitHub.
A partial differential equation is “the most powerful tool humanity has ever created,” Cornell University mathematician Steven Strogatz wrote in a 2009 New York Times opinion piece. This quote opened last week’s GTC talk AI4Science: The Convergence of AI and Scientific Computing, presented by Anima Anandkumar, director of machine learning research at NVIDIA and professor Read article >
The post A Revolution in the Making: How AI and Science Can Mitigate Climate Change appeared first on The Official NVIDIA Blog.