Recommender systems, the economic engines of the internet, are getting a new turbocharger: the NVIDIA Grace Hopper Superchip. Every day, recommenders serve up trillions of search results, ads, products, music and news stories to billions of people. They’re among the most important AI models of our time because they’re incredibly effective at finding in the Read article >
NVIDIA and Booz Allen Hamilton (NYSE: BAH) today announced an expanded collaboration to bring an AI-enabled, GPU-accelerated cybersecurity platform to customers in the public and private sectors.
NVIDIA today announced the second generation of NVIDIA OVX™, powered by the NVIDIA Ada Lovelace GPU architecture and enhanced networking technology, to deliver groundbreaking real-time graphics, AI and digital twin simulation capabilities.
NVIDIA and Deloitte today announced an expansion of their alliance to help enable enterprises around the world to develop, implement and deploy hybrid-cloud solutions using the NVIDIA AI and NVIDIA Omniverse™ Enterprise platforms.
The latest release of NVIDIA Maxine is paving the way for real-time audio and video communications. Whether for a video conference, a call made to a customer service center, or a live stream, Maxine enables clear communications to enhance virtual interactions. NVIDIA Maxine is a suite of GPU-accelerated AI software development kits (SDKs) and cloud-native Read article >
Meet Violet, an AI-powered customer service assistant ready to take your order. Unveiled this week at GTC, Violet is a cloud-based avatar that represents the latest evolution in avatar development through NVIDIA Omniverse Avatar Cloud Engine (ACE), a suite of cloud-native AI microservices that make it easier to build and deploy intelligent virtual assistants and Read article >
New cloud services to support AI workflows and the launch of a new generation of GeForce RTX GPUs featured today in NVIDIA CEO Jensen Huang’s GTC keynote, which was packed with new systems, silicon, and software. “Computing is advancing at incredible speeds, the engine propelling this rocket is accelerated computing, and its fuel is AI,” Read article >
Today, NVIDIA announced the Jetson Orin Nano series of system-on-modules (SoMs). They deliver up to 80X the AI performance of NVIDIA Jetson Nano and set the new…
Today, NVIDIA announced the Jetson Orin Nano series of system-on-modules (SoMs). They deliver up to 80X the AI performance of NVIDIA Jetson Nano and set the new standard for entry-level edge AI and robotics applications.
For the first time, the Jetson family now includes NVIDIA Orin-based modules that span from the entry-level Jetson Orin Nano to the highest-performance Jetson AGX Orin. This gives customers the flexibility to scale their applications easily.
Jump-start your Jetson Orin Nano development today with full software emulation support provided by the Jetson AGX Orin Developer Kit.
The need for increased real-time processing capability continues to grow for everyday use cases across industries. Entry-level AI applications like smart cameras, handheld devices, service robots, intelligent drones, smart meters, and more all face similar challenges.
These applications require more low-latency processing on-device for the data flowing from their multimodal sensor pipelines while keeping within the constraints of a power-efficient, cost-optimized small form factor.
Jetson Orin Nano series
Jetson Orin Nano series production modules will be available in January starting at $199. The modules deliver up to 40 TOPS of AI performance in the smallest Jetson form-factor, with power options as little as 5W and up to 15W. The series comes with two different versions: Jetson Orin Nano 4GB and Jetson Orin Nano 8GB.
*NVIDIA Orin Architecture from Jetson Orin Nano 8GB, Jetson Orin Nano 4GB has 2 TPCs and 4 SMs.
As shown in Figure 1, Jetson Orin Nano showcases the NVIDIA Orin architecture with an NVIDIA Ampere Architecture GPU. It has up to eight streaming multiprocessors (SMs) composed of 1024 CUDA cores and up to 32 Tensor Cores for AI processing.
The NVIDIA Ampere Architecture third-generation Tensor Cores deliver better performance per watt than the previous generation and bring more performance with support for sparsity. With sparsity, you can take advantage of the fine-grained structured sparsity in deep learning networks to double the throughput for Tensor Core operations.
To accelerate all parts of your application pipeline, Jetson Orin Nano also includes a 6-core Arm Cortex-A78AE CPU, video decode engine, ISP, video image compositor, audio processing engine, and video input block.
Within its small, 70x45mm 260-pin SODIMM footprint, the Jetson Orin Nano modules include various high-speed interfaces:
Up to seven lanes of PCIe Gen3
Three high-speed 10-Gbps USB 3.2 Gen2 ports
Eight lanes of MIPI CSI-2 camera ports
Various sensor I/O
To reduce your engineering effort, we’ve made the Jetson Orin Nano and Jetson Orin NX modules completely pin– and form-factor–compatible. Table 1 shows the differences between the Jetson Orin Nano 4GB and the Jetson Orin Nano 8GB.
Jetson Orin Nano 4GB
Jetson Orin Nano 8GB
AI Performance
20 Sparse TOPs | 10 Dense TOPs
40 Sparse TOPs | 20 Dense TOPs
GPU
512-core NVIDIA Ampere Architecture GPU with 16 Tensor Cores
1024-core NVIDIA Ampere Architecture GPU with 32 Tensor Cores
GPU Max Frequency
625 MHz
CPU
6-core Arm Cortex-A78AE v8.2 64-bit CPU 1.5 MB L2 + 4 MB L3
* For more information about additional compatibility to DisplayPort 1.4a and HDMI 2.1 and virtual channels, see the Jetson Orin Nano series data sheet. † 1KU Volume For more information about supported features, see the Software Features section of the latest NVIDIA Jetson Linux Developer Guide.
Start your development today using the Jetson AGX Orin Developer Kit and emulation
The Jetson AGX Orin Developer Kit and all the Jetson Orin modules share one SoC architecture, enabling the developer kit to emulate any of the modules and make it easy for you to start developing your next product today.
You don’t have to wait for the Jetson Orin Nano hardware to be available before starting to port their applications to the new NVIDIA Orin architecture and latest NVIDIA JetPack. With the new overlay released today, you can emulate the Jetson Orin Nano modules with the developer kit, just as with the other Jetson Orin modules. With the developer kit configured to emulate Jetson Orin Nano 8GB or Jetson Orin Nano 4GB, you can develop and run your full application pipeline.
Performance benchmarks with Jetson Orin Nano
With Jetson AGX Orin, NVIDIA is leading the inference performance category of MLPerf. Jetson Orin modules provide a giant leap forward for your next-generation applications, and now the same NVIDIA Orin architecture is made accessible for entry-level AI devices.
We used emulation mode with NVIDIA JetPack 5.0.2 to run computer vision benchmarks with Jetson Orin Nano, and the results showcase how it sets the new standard. Testing included some of our dense INT8 and FP16 pretrained models from NGC, and a standard ResNet-50 model. We also ran the same models for comparison on Jetson Nano, TX2 NX, and Xavier NX.
Taking the geomean of these benchmarks, Jetson Orin Nano 8GB shows a 30x performance increase compared to Jetson Nano. With future software improvements, we expect this to approach a 45x performance increase. Other Jetson devices have increased performance 1.5x since their first supporting software release, and we expect the same with Jetson Orin Nano.
Jetson runs the NVIDIA AI software stack, and use-case-specific application frameworks are available, including NVIDIA Isaac for robotics, NVIDIA DeepStream for vision AI, and NVIDIA Riva for conversational AI. You can save significant time with the NVIDIA Omniverse Replicator for synthetic data generation (SDG), and with the NVIDIA TAO Toolkit for fine-tuning pretrained AI models from the NGC catalog.
Jetson compatibility with the overall NVIDIA AI accelerated computing platform makes for ease of development and seamless migration. For more information about the NVIDIA software technologies that we bring in Jetson Orin, join us for an upcoming webinar about NVIDIA JetPack 5.0.2.
Strengthen entry-level robots with NVIDIA Isaac ROS
The Jetson Orin platform is designed to solve the toughest robotics challenges and brings accelerated computing to over 700,000 ROS developers. Combined with the powerful hardware capabilities of Jetson Orin Nano, enhancements in the latest NVIDIA Isaac software for ROS deliver excellent performance and productivity in the hands of roboticists.
The new Isaac ROS DP release optimizes ROS2 node-processing pipelines that can be executed on the Jetson Orin platform and provides new DNN-based GEMS designed to increase throughput. The Jetson Orin Nano can take advantage of those highly optimized ROS2 packages for tasks such as localization, real-time 3D reconstruction, and depth estimation, which can be used for obstacle avoidance.
Unlike the original Jetson Nano, which can only process simple applications, the Jetson Orin Nano can run more complex applications. With a continuing commitment to improving NVIDIA Isaac ROS, you’ll see increased accuracy and throughput on the Jetson Orin Platform over time.
For roboticists developing the next generation of service robots, intelligent drones, and more, the Jetson Orin Nano is the ideal solution with up to 40 TOPS for modern AI inference pipelines in a power-efficient and small form factor.
Supercomputers are used to model and simulate the most complex processes in scientific computing, often for insight into new discoveries that otherwise would be…
Supercomputers are used to model and simulate the most complex processes in scientific computing, often for insight into new discoveries that otherwise would be impractical or impossible to demonstrate physically.
The NVIDIA BlueField data processing unit (DPU) is transforming high-performance computing (HPC) resources into more efficient systems, while accelerating problem solving across a breadth of scientific research, from mathematical modeling and molecular dynamics to weather forecasting, climate research, and even renewable energy.
BlueField has already made a marked impact in the areas of cloud networking, security, telecommunications, and edge computing. In addition, there are several areas across high-performance computing where it is sparking innovations for application performance and system efficiency.
NVIDIA BlueField-3 provides powerful computing based on multiple Arm AArch64 cores, a multithreaded datapath accelerator, integrated NVIDIA ConnectX-7 400Gb/s networking, and a broad range of programmable acceleration engines in the I/O path. It’s equipped with dual DDR 6500MT/s DRAM controllers and comes standard with 64 GB onboard memory. BlueField-3 is the third-generation data center infrastructure-on-a-chip that enables incredibly efficient and powerful software-defined, hardware-accelerated infrastructures from cloud to core data center to edge.
So, what does all this mean for high-performance computing?
Boosting HPC application performance and scalability
HPC is all about increasing performance and scalability. For nearly two decades, InfiniBand networking has been the proven leader in terms of performance and application scalability for several reasons.
From a high-level view, InfiniBand is just the most efficient way to move data: direct data placement. There’s no need for the CPU or operating system to be involved and no need for making multiple copies of the data as it makes its way from the network interface, through the system to the actual application that needs it.
If InfiniBand is already so efficient, what benefit would BlueField provide?
One of the key challenges that InfiniBand has been addressing for years is moving network communication overhead away from the CPU, enabling it to spend its time focusing on what it does best: application computation and branching code.
The CPU in today’s mainstream servers is overly general-purpose, sharing its compute cycles, time, and resources across hundreds or thousands of processes that have little to nothing to do with actual computing.
BlueField is bringing unprecedented innovation and efficiency to supercomputing by offloading, accelerating, and isolating a broad range of advanced networking, storage, and security services.
Why the era of AI ushered in the need for the BlueField DPU
The field of artificial intelligence research was founded as an academic discipline in 1956. Even a decade before that, scientists began to discuss the possibility of creating an artificial brain. It was much later that the concepts became reality, with more modern computer hardware and software.
In 2006, NVIDIA introduced CUDA, the industry’s first C-compiler developer environment for the GPU, solving complex computing problems up to 100x faster than traditional approaches. Today, artificial intelligence is prolific and driving nearly every area of scientific research, changing our lives and shaping the industrial landscape.
Similarly, references to the first proposals for nonblocking collective operations were introduced mid-2006. The proposed nonblocking interfaces for the collective group communication functions of Message Passing Interface (MPI) was certainly prolific in theory. However, it was not implemented across many applications. Perhaps this was because, until the introduction of the DPU, the full benefits could not be realized.
Today, with BlueField-3, the technology has arrived—providing the fundamental elements needed for innovation, performance, and efficiency. There is a renewed interest in nonblocking collective operations for increased application performance and scalability, and counter the effects of operating system jitter.
There are also several areas across scientific computing, including early examples, where BlueField is demonstrating how it can be used to transform HPC into highly efficient and sustainable computing.
Saving CPU cycles with in-network computing
NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology improves upon the performance of MPI operation, by offloading many blocking collective operations from the CPU to the switch network, and eliminating the need to send data multiple times between endpoints. This innovative approach decreases the amount of data traversing the network as aggregation nodes are reached, and dramatically reduces the MPI operations time.
BlueField extends additional in-network computing capabilities by leveraging its Arm cores to implement the nonblocking operations. This enables the system host CPU to perform computation with peak overlap.
Figure 2 shows an example of this using the MVAPICH2-DPU library, which is being optimized to take advantage of the full potential of BlueField. It shows the capability to extract peak overlap between computation happening at the host and MPI_Ialltoall communication.
Computational storage for HPC workloads
Computational storage, or in-storage computing, brings HPC capabilities to traditional storage devices. In-storage computing enables you to perform selected computing tasks within or next to a storage device, offloading host processing and reducing data movement. BlueField provides the ability to combine in-storage and in-networking computing on a single card.
BlueField enables storage software stacks to be offloaded from compute nodes while also existing as a fabric-attached NVMe controller capable of accelerating critical storage functions, such as compression, checksum calculation, and parity generation. Such services are offered in parallel file systems.
The entire storage system stack is transparently offloaded within the Linux kernel while enabling simple NVIDIA DOCA implementations of standard storage functions on the NVMe target side.
The next-generation open storage architecture offers a new paradigm for accelerating, isolating, and securing high-performance storage systems. The system employs hardware and software co-design, making the DPU incredibly efficient and transparent to the user.
Acceleration of the file system means increasing the performance of critical functions within the storage system, with storage system performance being a key enabler of deep-learning-based scientific inquiry.
The ability to fully offload both the storage client and server onto DPUs leads to previously unrealizable levels of security and performance isolation. Critical data plane and control plane functions are moved to a separate domain on the DPU. This relieves the server CPU from the work and protects the functions in case the CPU or its software are compromised.
NVIDIA DOCA software framework
The NVIDIA DOCA SDK is the key to unlocking the potential of BlueField. Together, NVIDIA DOCA and BlueField enable the development of applications that deliver breakthrough networking, security, storage, and application performance with a comprehensive, open development platform.
NVIDIA DOCA supports a range of operating systems and distributions and includes drivers, libraries, tools, documentation, and example applications. The upcoming NVIDIA DOCA 1.5 and 2.0 releases introduce a broad range of networking, storage, security capabilities, and enhancements that deliver breakthrough performance and advanced programmability for HPC developers:
A new communication channel library
Fast access to host memory for UCX accelerations
Storage emulation (SNAP) including storage encryption
New NVIDIA DOCA services including UCC offload service and telemetry service
NVIDIA DOCA security SDK
Transforming HPC today and tomorrow
There are many areas of innovation already on the horizon where BlueField, NVIDIA DOCA, and the community will continue to transform HPC.
Some ideas are already past the whiteboard, such as enhanced performance isolation at a data center scale or enhancing job schedulers for more intelligent job placement.
Because scientific applications are often highly synchronized, the negative effects of system noise on a large-scale HPC system can present a much greater impact on performance. Reducing system noise caused by other processes such as storage is critical.
Telemetry information is powerful. It’s not just about collecting information about routers, switches, and network traffic. Rather, it is possible to gather and share information by workload and I/O characterization.
AI frameworks precisely tune the performance isolation algorithms within the NVIDIA Quantum-2 InfiniBand platform. Multi-application environments sharing common data center resources, such as the network and storage, are ensured the best possible performance, as if the applications were running on bare metal as a single instance.
BlueField is perfectly positioned to address the challenges presented by large-scale computing. For more information on DPUs, add the following GTC session to your calendar:
Announced at GTC, technical artists, software developers, and ML engineers can now build custom, physically accurate, synthetic data generation pipelines in the…
Announced at GTC, technical artists, software developers, and ML engineers can now build custom, physically accurate, synthetic data generation pipelines in the cloud with NVIDIA Omniverse Replicator.
Omniverse Replicator is a highly extensible framework built on the NVIDIA Omniverse platform that enables physically accurate 3D synthetic data generation to accelerate the training and accuracy of perception networks.
Omniverse Replicator is now deployable in the cloud through containers hosted on NVIDIA NGC and SaaS available for early access by application. The Replicator suite of tools and content also now features a new Replicator Insight app for enhanced viewing and inspecting of generated data, plus new SimReady content and guides for plug-and-play synthetic data workflows.
Numerous partners are integrating Omniverse Replicator in their existing tools to extend their synthetic data workflows. Siemens with their SynthAI software, SmartCow, Mirage, and Lightning AI are among the first to use Omniverse Replicator to accelerate high-quality synthetic data generation.
Synthetic data: From local to cloud
For developers and enterprises who want the flexibility and scalability of cloud deployment, Omniverse Replicator is now available as container deployments on AWS. You can become a member of NGC to access containers and self-service deploy on Amazon EC2 G5 instances featuring A10G Tensor Core GPUs.
Enhanced inspecting and viewing
Generating synthetic data and improving AI models is an iterative process requiring the ability to view and analyze generated datasets along the way. This process can be quite cumbersome as data is not easily navigable and annotations are not easily inspected.
At GTC, we released Omniverse Replicator Insight early access, an app that enables you to view, inspect, and analyze generated datasets with a range of annotations efficiently and intuitively. Replicator Insight lets you browse generated datasets from different sensors on a frame-by-frame basis, select points of interest to view, and inspect specific annotations of specific objects.
Viewing, inspection, and analysis of generated datasets is efficient and intuitive on Replicator Insight with a range of annotations. It enables you to browse through the generated datasets from different sensors on a frame-by-frame basis. You can select objects of interest to view and inspect annotations for specific objects.
Replicator Insight lets developers and researchers take a leap towards data-centric AI training, integrating synthetic data more seamlessly into the model improvement process.
New SimReady assets
Omniverse Replicator SimReady Universal Scene Description (USD) assets help you get started generating synthetic data to narrow the gap between simulation and reality:
High-fidelity 3D assets that jump-start synthetic data generation at the pixel level.
Contextual content assets that help train data for diversity, context, and behaviors in a scene.
Conveyor belts, ramps, and cardboard boxes are just a few examples of SimReady assets available in the Omniverse Replicator library.
Several partners are using Omniverse Replicator to accelerate the training and performance of AI perception networks. Their applications span every phase of end-to-end synthetic data generation workflows. With Omniverse Replicator as a foundational platform for their applications, these partners are helping customers strengthen datasets and improve the accuracy of AI models for a variety of industry use cases.
Mirage is helping ML engineers understand where their dataset is weak and integrate synthetic data that fixes these weaknesses. Replicator is the backbone from which Mirage’s customers generate high-fidelity data to improve their ML models.
Lightning AI lets you build models and use or create Lightning Apps: powerful, end-to-end machine learning systems that are fully customizable. The Omniverse Replicator Lightning App lets you quickly generate synthetic data to reduce the cost and effort associated with gathering and labeling real-world data.
With Lightning AI, researchers and developers can run parallel AutoML jobs, find the best-performing object detection model, and verify performance on real-world data for synthetic data generation.
SmartCow leverages Omniverse Replicator to generate synthetic data with variations simply and effectively. Adding those variations through Replicator enables SmartCow to continuously improve model accuracy with ease. SmartCow uses Replicator in its iterative process to generate additional variations from data drift detections and create improved models.
Siemens is collaborating with NVIDIA to bring the Omniverse Replicator high-fidelity rendering capabilities and SDK to SynthAI’s cloud. This will ensure a simple, streamlined workflow from product design and collaboration to synthetic data generation and model training and ending with successful deployment.