submitted by /u/-is-it-tho
[visit reddit] [comments]
Month: April 2022
I am new to machine learning.
I got the intermediate result of layer 31 of my CNN using the following code:
conv2d = Model(inputs = self.model_ori.input, outputs= self.model_ori.layers[31].output) intermediateResult = conv2d.predict(img)
Lets say I have this output saved, but 10 days later, I want to take this output and feed it back into the next layer (32nd) and get the final result.
Is that possible?
My model.summary():
submitted by /u/lulzintosh123
[visit reddit] [comments]
MNIST and image segmentation
I’m very new to tensorflow and machine learning in general and wanted to make something that would take an image of an article of clothing and classify it and remove the background around it. I found a dataset called fashion-MNIST that looks like it might help me with classifying images.
Would it be possible to also use this dataset to get just the pixels of the clothing and remove any background around it? How would I go about doing it and are there any examples that would be helpful for me?
submitted by /u/razz-daddy
[visit reddit] [comments]
Hi,
I am new at using tensorflow-probability. I am using Categorical Distribution to sample a value and then get its probabilty and entropy but every time I sample from the distribution, I get the same entropy. This problem is of policy gradient algorithms. NN outputs logits which are then fed to Categorical distribution then action is sampled. Please let me know, what I am missing here.
submitted by /u/Better-Ad8608
[visit reddit] [comments]
Error when using LSTM
Hello everybody, I’ve run into an error using tensorflow-gpu version 2.7.0. and I am looking for help. Whenever I try to use tensorflow.keras.layers LSTM the kernel of my Jupiter Notebook dies when trying to run model.fit(). I can compile the model and the cell gets executed without an error. I’ve gotten an error message only once and it said:
NotImplementedError: Cannot convert a symbolic Tensor (lstm/strided_slice:0) to a bumpy array. This error may indicate that you’re trying to pass a Tensor to a Num Py call, which is not supported
I’ve only ever gotten this error message once and never again since. I looked it up online and people said it was a compatibility issue with numpy > 1.19.5 so I downgraded numpy but my kernel still dies when trying to do model.fit(). I then tried to pass my training data as a tf tensor by converting the numpy array using tf.convert_to_tensor(). But that didn’t help either. Everything else using tf seems to work, it’s just LSTM giving me issues.
Has anyone an idea how I could fix the issue? Thank you.
Version in use: tensorflow-gpu 2.7.0, Num Py 1.20.3/1.19.5, Cuda 11.3.1 and cudnn 8.1.0.77; GPU: RTX 3090
submitted by /u/0stkreutz
[visit reddit] [comments]
Introducing the NVIDIA HGX H100, a key GPU server building block powered by the Hopper architecture.
The NVIDIA mission is to accelerate the work of the Da Vincis and Einsteins of our time and empower them to solve the grand challenges of society. With the complexity of artificial intelligence (AI), high-performance computing (HPC), and data analytics increasing exponentially, scientists need an advanced computing platform that is able to drive million-X speedups in a single decade to solve these extraordinary challenges.
To answer this need, we introduce the NVIDIA HGX H100, a key GPU server building block powered by the NVIDIA Hopper Architecture. This state-of-the-art platform securely delivers high performance with low latency, and integrates a full stack of capabilities from networking to compute at data center scale, the new unit of computing.
In this post, I discuss how the NVIDIA HGX H100 is helping deliver the next massive leap in our accelerated compute data center platform.
HGX H100 8-GPU
The HGX H100 8-GPU represents the key building block of the new Hopper generation GPU server. It hosts eight H100 Tensor Core GPUs and four third-generation NVSwitch. Each H100 GPU has multiple fourth generation NVLink ports and connects to all four NVSwitches. Each NVSwitch is a fully non-blocking switch that fully connects all eight H100 Tensor Core GPU.
This fully connected topology from NVSwitch enables any H100 to talk to any other H100 concurrently. Notably, this communication runs at the NVLink bidirectional speed of 900 gigabytes per second (GB/s), which is more than 14x the bandwidth of the current PCIe Gen4 x16 bus.
The third-generation NVSwitch also provides new hardware acceleration for collective operations with multicast and NVIDIA SHARP in-network reductions. Combining with the faster NVLink speed, the effective bandwidth for common AI collective operations like all-reduce
go up by 3x compared to the HGX A100. The NVSwitch acceleration of collectives also significantly reduces the load on the GPU.
HGX A100 8-GPU | HGX H100 8-GPU | Improvement Ratio | |
FP8 | – | 32,000 TFLOPS | 6X (vs A100 FP16) |
FP16 | 4,992 TFLOPS | 16,000 TFLOPS | 3X |
FP64 | 156 TFLOPS | 480 TFLOPS | 3X |
In-Network Compute | 0 | 3.6 TFLOPS | Infinite |
Interface to host CPU | 8x PCIe Gen4 x16 | 8x PCIe Gen5 x16 | 2X |
Bisection Bandwidth | 2.4 TB/s | 3.6 TB/s | 1.5X |
*Note: FP performance includes sparsity
HGX H100 8-GPU with NVLink-Network support
The emerging class of exascale HPC and trillion parameter AI models for tasks like accurate conversational AI require months to train, even on supercomputers. Compressing this to the speed of business and completing training within hours requires high-speed, seamless communication between every GPU in a server cluster.
To tackle these large use cases, the new NVLink and NVSwitch are designed to enable HGX H100 8-GPU to scale up and support a much larger NVLink domain with the new NVLink-Network. Another version of HGX H100 8-GPU features this new NVLink-Network support.
System nodes built with HGX H100 8-GPU with NVLink-Network support can fully connect to other systems through the Octal Small Form Factor Pluggable (OSFP) LinkX cables and the new external NVLink Switch. This connection enables up to a maximum of 256 GPU NVLink domains. Figure 3 shows the cluster topology.
256 A100 GPU Pod | 256 H100 GPU Pod | Improvement Ratio | |
NVLINK Domain | 8 GPU | 256 GPU | 32X |
FP8 | – | 1,024 PFLOPS | 6X (vs A100 FP16) |
FP16 | 160 PFLOPS | 512 PFLOPS | 3X |
FP64 | 5 PFLOPS | 15 PFLOPS | 3X |
In-Network Compute | 0 | 192 TFLOPS | Infinite |
Bisection Bandwidth | 6.4 TB/s | 70 TB/s | 11X |
*Note: FP performance includes sparsity
Target use cases and performance benefit
With the dramatic increase in HGX H100 compute and networking capabilities, AI and HPC applications performances are vastly improved.
Today’s mainstream AI and HPC model can fully reside in the aggregate GPU memory of a single node. For example, BERT-Large, Mask R-CNN, and HGX H100 are the most performance-efficient training solutions.
For the more advanced and larger AI and HPC model, the model requires multiple nodes of aggregate GPU memory to fit. For example, a deep learning recommendation model (DLRM) with terabytes of embedded tables, a large mixture-of-experts (MoE) natural language processing model, and the HGX H100 with NVLink-Network accelerates the key communication bottleneck and is the best solution for this class of workload.
Figure 4 from the NVIDIA H100 GPU Architecture whitepaper shows the extra performance boost enabled by the NVLink-Network.
All performance numbers are preliminary based on current expectations and subject to change in shipping products. A100 cluster: HDR IB network. H100 cluster: NDR IB network with NVLink-Network where indicated.
# GPUs: Climate Modeling 1K, LQCD 1K, Genomics 8, 3D-FFT 256, MT-NLG 32 (batch sizes: 4 for A100, 60 for H100 at 1 sec, 8 for A100 and 64 for H100 at 1.5 and 2sec), MRCNN 8 (batch 32), GPT-3 16B 512 (batch 256), DLRM 128 (batch 64K), GPT-3 16K (batch 512), MoE 8K (batch 512, one expert per GPU)
HGX H100 4-GPU
In addition to the 8-GPU version, the HGX family also features a version with a 4-GPU, which is directly connected with fourth-generation NVLink.
The H100-to-H100 point-to-point peer NVLink bandwidth is 300 GB/s bidirectional, which is about 5X faster than today’s PCIe Gen4 x16 bus.
The HGX H100 4-GPU form factor is optimized for dense HPC deployment:
- Multiple HGX H100 4-GPUs can be packed in a 1U high liquid cooling system to maximize GPU density per rack.
- Fully PCIe switch-less architecture with HGX H100 4-GPU directly connects to the CPU, lowering system bill of materials and saving power.
- For workloads that are more CPU intensive, HGX H100 4-GPU can pair with two CPU sockets to increase the CPU-to-GPU ratio for a more balanced system configuration.
An accelerated server platform for AI and HPC
NVIDIA is working closely with our ecosystem to bring the HGX H100 based server platform to the market later this year. We are looking forward to putting this powerful computing tool in your hands, enabling you to innovate and fulfill your life’s work at the fastest pace in human history.
cuTENSOR is now able to distribute tensor contractions across multiple GPUs. This has been released as a new library called cuTENSORMg (multi-GPU).
Tensor contractions are at the core of many important workloads in machine learning, computational chemistry, and quantum computing. As scientists and engineers pursue ever-growing problems, the underlying data gets larger in size and calculations take longer and longer.
When a tensor contraction does not fit into a single GPU anymore, or if it takes too long on a single GPU, the natural next step is to distribute the contraction across multiple GPUs. We have been extending cuTENSOR with this new capability, and are releasing it as a new library called cuTENSORMg (multi-GPU). It provides single-process multi-GPU functionality on block-cyclic distributed tensors.
The copy
and contraction
operations for cuTENSORMg are broadly structured into handles, tensor descriptors, and descriptors. In this post, we explain the handle and the tensor descriptor and how copy operations work and demonstrate how to perform a tensor contraction. We then show how to measure the performance of the contraction operation for various workloads and GPU configurations.
Library handle
The library handle represents the set of devices that participate in the computation. The handle also contains data and resources that are reused across calls. You can create a library handle by passing the list of devices to the cutensorMgCreate
function:
cutensorMgCreate(&handle, numDevices, devices);
All objects in cuTENSORMg are heap allocated. As such, they must be freed with a matching destroy
call. For brevity, we do not show these in this post, but production code should destroy all objects that it creates to avoid leaks.
cutensorMgDestroy(handle);
All library calls return an error code of type cutensorStatus_t
. In production, you should always check the error code to detect failures or usage issues early. For brevity, we omit these checks in this post for brevity, but they are included in the corresponding example code.
In addition to error codes, cuTENSORMg also provides similar logging capabilities as cuTENSOR. Those logs can be activated by setting the CUTENSORMG_LOG_LEVEL
environment variable appropriately. For instance, CUTENSORMG_LOG_LEVEL=1
would provide you with additional information about a returned error code.
Tensor descriptor
The tensor descriptor describes how a tensor is laid out in memory and how it is distributed across devices. For each mode, there are three core concepts to determine the layout:
extent
: Logical size of each mode.blockSize
: Subdivides theextent
into equal-sized chunks, except for the final remainder block.deviceCount
: Determines how the blocks are distributed across devices.
Figure 1 shows how extent
and block size
subdivide a two-dimensional tensor.
Blocks are distributed in a cyclic fashion, which means that consecutive blocks are assigned to different devices. Figure 2 shows a two-by-two distribution of blocks to devices, with the assignment of devices to blocks being encoded with another array devices
. The array is a dense-column major tensor with extents like the device counts.
Finally, the exact on-device data layout is determined by the elementStride
and the blockStride
values for each mode. Respectively, they determine the displacement, in linear memory in units of elements, of two adjacent elements and adjacent blocks for a given mode (Figure 3).
These attributes are all set using the cutensorMgCreateTensorDescriptor
call:
cutensorMgCreateTensorDescriptor(handle, &desc, numModes, extent, elementStride, blockSize, blockStride, deviceCount, numDevices, devices, type);
It is possible to pass NULL
to the elementStride
, blockSize
, blockStride
, and deviceCount
.
If the elementStride
is NULL
, the data layout is assumed to be dense using a generalized column-major layout. If blockSize
is NULL
, it is equal to extent
. If blockStride
is NULL
, it is equal to blockSize * elementStride
, which results in an interleaved block format. If deviceCount
is NULL
, all device counts are set to 1. In this case, the tensor is distributed and entirely resides in the memory of devices[0]
.
By passing CUTENSOR_MG_DEVICE_HOST
as the owning device, you can specify that the tensor is located on the host in pinned, managed, or regularly allocated memory.
Copy operation
The copy
operation enables data layout changes including the redistribution of the tensor to different devices. Its parameters are a source and a destination tensor descriptor (descSrc
and descDst
), as well as a source and destination mode list (modesSrc
and modesDst
). The two tensors’ extents at coinciding modes must match, but everything else about them may be different. One may be located on the host, the other across devices, and they may have different blockings and different strides.
Like all operations in cuTENSORMg, it proceeds in three steps:
cutensorMgCopyDescriptor_t
: Encodes what operation should be performed.cutensorMgCopyPlan_t
: Encodes how the operation will be performed.cutensorMgCopy
: Performs the operation according to the plan.
The first step is to create the copy descriptor:
cutensorMgCreateCopyDescriptor(handle, &desc, descDst, modesDst, descSrc, modesSrc);
With the copy descriptor in hand, you can query the amount of device-side and host-side workspace that is required. The deviceWorkspaceSize
array has as many elements as there are devices in the handle. The i-th element is the amount of workspace required for the i-th device in the handle.
cutensorMgCopyGetWorkspace(handle, desc, deviceWorkspaceSize, &hostWorkspaceSize);
With the workspace sizes determined, plan the copy. You can pass a larger workspace size and the call may take advantage of more workspace, or you can try to pass a smaller size. The planning may be able to accommodate that or it may yield an error.
cutensorMgCreateCopyPlan(handle, &plan, desc, deviceWorkspaceSize, hostWorkspaceSize
Finally, with the planning complete, execute the copy
operation.
cutensorMgCopy(handle, plan, ptrDst, ptrSrc, deviceWorkspace, hostWorkspace, streams);
In this call, ptrDst
and ptrSrc
are arrays of pointers. They contain one pointer for each of the devices in the corresponding tensor descriptor. In this instance, ptrDst[0]
corresponds to the device that was passed as devices[0]
to cutensorMgCreateTensorDescriptor
.
On the other hand, deviceWorkspace
and streams
are also arrays where each entry corresponds to a device. They are ordered according to the order of devices in the library handle, such as deviceWorkspace[0]
and streams[0]
correspond to the device that was passed at devices[0]
to cutensorMgCreate
. The workspaces must be at least as large as the workspace sizes that were passed to cutensorMgCreateCopyPlan
.
Contraction operation
At the core of the cuTENSORMg library is the contraction
operation. It currently implements tensor contractions of tensors located on one or multiple devices, but may support tensors located on the host in the future. As a refresher, a contraction is an operation of the following form:
Where , , , and are tensors, and , , , and are mode lists that may be arbitrarily permuted and interleaved with each other.
Like the copy
operation, it proceeds in three stages:
cutensorMgCreateContractionDescriptor
: Encodes the problem.cutensorMgCreateContractionPlan
: Encodes the implementation.cutensorMgContraction
: Uses the plan and performs the actual contraction.
First, you create a contraction descriptor based on the tensor descriptors, mode lists, and the desired compute type, such as the lowest precision data that may be used during the calculation.
cutensorMgCreateContractionDescriptor(handle, &desc, descA, modesA, descB, modesB, descC, modesC, descD, modesD, compute);
As the contraction operation has more degrees of freedom, you must also initialize a find
object that gives you finer control over the plan creation for a given problem descriptor. For now, this find
object only has a default setting:
cutensorMgCreateContractionFind(handle, &find, CUTENSORMG_ALGO_DEFAULT);
Then, you can query the workspace requirement along the lines of what you did for the copy
operation. Compared to that operation, you also pass in the find
and a workspace
preference:
cutensorMgContractionGetWorkspace(handle, desc, find, CUTENSOR_WORKSPACE_RECOMMENDED, deviceWorkspaceSize, &hostWorkspaceSize);
Create a plan:
cutensorMgCreateContractionPlan(handle, &plan, desc, find, deviceWorkspaceSize, hostWorkspaceSize);
Finally, execute the contraction using the plan:
cutensorMgContraction(handle, plan, alpha, ptrA, ptrB, beta, ptrC, ptrD, deviceWorkspace, hostWorkspace, streams);
In this call, alpha and beta are host pointers of the same type as the tensor, unless the tensor is half or BFloat16
precision, in which case it is single precision. The order of pointers in the different arrays ptrA
, ptrB
, ptrC
, and ptrD
correspond to their order in their descriptor’s devices
array. The order of pointers in the deviceWorkspace
and streams
arrays corresponds to the order in the library handle’s devices
array.
Performance
You can find all these calls together in the CUDA Library Samples GitHub repo. We extended it to take two parameters: The number of GPUs and a scaling factor. Feel free to experiment with other contractions, block sizes, and scaling regimes. It is written in such a way that it scales up M and N while keeping K fixed. It implements an almost GEMM-shaped tensor contraction of the shape:
and scale up and the block size in those dimensions keeping the load approximately balanced. The plot underneath shows their scaling relationship when measured on a DGX A100.
Get started with cuTENSORMg
Interested in trying out cuTENSORMg to scale tensor contractions beyond a single GPU?
We continue working on improving cuTENSORMg, including out-of-core functionality. If you have questions or new feature requests, contact product manager Matthew Nicely US.
Get started with developing all four Jetson Orin modules for a new era of robotics.
The pace for development and deployment of AI-powered robots and other autonomous machines continues to grow rapidly. The next generation of applications require large increases in AI compute performance to handle multimodal AI applications running concurrently in real time.
Human-robot interactions are increasing in retail spaces, food delivery, hospitals, warehouses, factory floors, and other commercial applications. These autonomous robots must concurrently perform 3D perception, natural language understanding, path planning, obstacle avoidance, pose estimation, and many more actions that require both significant computational performance and highly accurate trained neural models for each application.
NVIDIA Jetson AGX Orin modules are the highest-performing and newest members of the NVIDIA Jetson family. These modules deliver tremendous performance with class-leading energy efficiency. They run the comprehensive NVIDIA AI software stack to power the next generation of demanding edge AI applications.
Jetson AGX Orin and Jetson Orin NX series
At GTC Spring 2022, we announced that four Jetson Orin modules will be available in Q4 2022. With up to 275 tera operations per second (TOPS) of performance, the Jetson Orin modules can run server class AI models at the edge with end-to-end application pipeline acceleration. Compared to Jetson Xavier modules, Jetson Orin brings even higher performance, power efficiency, and inference capabilities to modern AI applications.
JETSON AGX XAVIER 64GB | JETSON AGX ORIN 64GB | |
21 DENSE INT8 TOPS | 275 SPARSE|138 DENSE, INT8 TOPS | |
10W to 20W | 15W to 60W | |
$499 (1KU+) | $1,599 (1KU+) | |
JETSON XAVIER NX 8GB | JETSON AGX ORIN 32GB | |
32 DENSE INT8 TOPS | 200 SPARSE|100 DENSE, INT8 TOPS | |
10W to 30W | 15W to 40W | |
$899 (1KU+) | $899 (1KU+) | |
JETSON XAVIER NX 16GB | JETSON ORIN NX 16GB | |
21 DENSE INT8 TOPS | 100 SPARSE|50 DENSE, INT8 TOPS | |
10W to 20W | 10W to 25W | |
$499 (1KU+) | $599 (1KU+) | |
JETSON XAVIER NX 8GB | JETSON ORIN NX 16GB | |
21 DENSE INT8 TOPS | 100 SPARSE|50 DENSE, INT8 TOPS | |
10W to 20W | 10W to 25W | |
$399 (1KU+) | $599 (1KU+) |
The Jetson AGX Orin series includes the Jetson AGX Orin 64GB and the Jetson AGX Orin 32GB modules.
- Jetson AGX Orin 64GB delivers up to 275 TOPS with power configurable between 15W and 60W.
- Jetson AGX Orin 32GB delivers up to 200 TOPs with power configurable between 15W and 40W.
These modules have the same compact form factor and are pin compatible with Jetson AGX Xavier series modules, offering you an 8x performance upgrade, or up to 6x the performance at the same price.
Edge and embedded systems continue to be driven by the increasing number, performance, and bandwidth of sensors. The Jetson AGX Orin series brings not only additional compute for processing these sensors, but also additional I/O:
- Up to 22 lanes of PCIe Gen4
- Four 10Gb Ethernet
- Higher speed CSI lanes
- Double the storage with 64GB eMMC 5.1
- 1.5X the memory bandwidth
For more information, see the Jetson Orin product page and the Jetson AGX Orin Series Data Sheet.
USB 3.2, UFS, MGBE, and PCIe share UPHY Lanes. For the supported UPHY configurations, see the Design Guide.
The NVIDIA Orin NX series includes Jetson Orin NX 16GB with up to 100 TOPS of AI performance, and Jetson Orin NX 8GB with up to 70 TOPS. With this series, we followed a similar design philosophy as with Jetson Xavier NX. We brought the NVIDIA Orin architecture and brought it to the smallest Jetson form factor, 260-pin SODIMM, with lower power consumption.
You can bring this higher class of performance to your next-generation, small form factor products like drones and handheld devices. Jetson Orin NX 16GB comes with power configurable between 10W and 25W, and Jetson Orin NX 8GB comes with power configurable between 10W and 20W.
The Orin NX Series is form factor compatible with the Jetson Xavier NX series, and delivers up to 5x the performance, or up to 3X the performance at the same price. The Orin NX series also brings additional high speed I/O capabilities with up to seven PCIe lanes and three 10Gbps USB 3.2 interfaces. For storage, you can leverage the additional PCIe lanes to connect to external NVMe. For more information, see the Jetson Orin product page.
Jetson AGX Xavier was designed around the NVIDIA Xavier SoC, our first architecture developed from the ground up for autonomous machines. The NVIDIA Orin architecture takes this class of product to the next level. It continues to showcase multiple different on-chip processors, but brings greater capability, higher performance, and more power efficiency.
The Jetson Orin modules contain the following:
- An NVIDIA Ampere Architecture GPU with up to 2048 CUDA cores and up to 64 Tensor Cores
- Up to 12 Arm A78AE CPU cores
- Two next-generation deep learning accelerators (DLA)
- A computer vision accelerator
- Various other processors to offload the GPU and CPU:
- Video encoder
- Video decoder
- Video image compositor
- Image signal processor
- Sensor processing engine
- Audio processing engine
Like the other Jetson modules, Jetson Orin is built using a system-on-module (SOM) design. All the processing, memory, and power rails are contained on the module. All the high-speed I/O is available through a 699-pin connector (Jetson AGX Orin series) or a 260-pin SODIMM connector (Jetson Orin NX series). This SOM design makes it easy for you to integrate the modules into your system designs.
Jetson AGX Orin Developer Kit
At GTC 2022, NVIDIA also announced availability of the Jetson AGX Orin Developer Kit. The developer kit contains everything needed for you to get up and running quickly. It includes a Jetson AGX Orin module with the highest performance and runs the world’s most advanced deep learning software stack. This kit delivers the flexibility to create sophisticated AI solutions now and well into the future.
Compact size, high-speed interfaces, and lots of connectors make this developer kit perfect for prototyping advanced AI-powered robots and edge applications for manufacturing, logistics, retail, service, agriculture, smart cities, healthcare, life sciences, and more.
Jetson AGX Orin Developer Kit features:
- An NVIDIA Ampere Architecture GPU and 12-core Arm Cortex-A78AE 64-bit CPU, together with next-generation deep learning and vision accelerators
- High-speed I/O, 204.8 GB/s of memory bandwidth, and 32 GB of DRAM capable of feeding multiple concurrent AI application pipelines
- Powerful NVIDIA AI software stack with support for SDKs and software platforms, including the following:
- NVIDIA JetPack
- NVIDIA Riva
- NVIDIA DeepStream
- NVIDIA Isaac
- NVIDIA TAO
The Jetson AGX Orin Developer Kit runs the latest NVIDIA JetPack 5.0 software. NVIDIA JetPack 5.0 supports emulating the performance and clock frequencies of Jetson Orin NX and Jetson AGX Orin series modules with a Jetson AGX Orin Developer Kit. You can kickstart your development for any of those modules today.
Jetson AGX Orin Developer Kit is available for purchase through NVIDIA authorized distributors worldwide. Get started today by following the Getting Started guide.
Developer Kit | AGX Orin 64GB | AGX Orin 32GB | |
AI Performance | 275 INT8 Sparse TOPs | 200 INT8 Sparse TOPs | |
GPU | 2048-core NVIDIA Ampere Architecture GPU with 64 Tensor Cores |
1792-core NVIDIA Ampere Architecture GPU with 56 Tensor Cores | |
CPU | 12-core Arm Cortex-A78AE v8.2 64-bit CPU 3MB L2 + 6MB L3 |
8-core Arm Cortex-A78AE v8.2 64-bit CPU 2MB L2 + 4MB L3 |
|
Power | 15W-60W | 15W-40W | |
Memory | 32 GB | 64 GB | 32GB |
MSRP | $1,999 | $1,599 | $899 |
Table 2. Summary comparison of Jetson AGX Orin series modules and Developer Kit
Best-in-class performance
Jetson Orin provides a giant leap forward for your next generation applications. Using the Jetson AGX Orin Developer Kit, we have taken the geometric mean of measured performance for our highly accurate, production-ready, pretrained models for computer vision and conversational AI. Testing included the following benchmarks:
- NVIDIA PeopleNet for people detection
- NVIDIA ActionRecognitionNet 2D and 3D models
- NVIDIA LPRNet for license plate recognition
- NVIDIA DashcamNet, BodyPoseNet for multiperson human pose estimation
- Citrinet-1024 for speech recognition
- BERT-base for natural language processing
- FastPitchHifiGanE2E for text to speech
With the NVIDIA JetPack 5.0 Developer Preview, Jetson AGX Orin shows a 3.3X performance increase compared to Jetson AGX Xavier. With future software improvements, we expect this to approach a 5X performance increase. Jetson AGX Xavier performance has increased 1.5X since NVIDIA JetPack 4.1.1 Developer Preview, the first software release to support it.
The benchmarks have been run on our Jetson AGX Orin Developer Kit. PeopleNet and DashcamNet provide examples of dense models that can be run concurrently on the GPU and the two DLAs. The DLA can be used to offload some AI applications from the GPU and this concurrent capability enables them to operate in parallel.
PeopleNet, LPRNet, DashcamNet, and BodyPoseNet provide examples of dense INT8 benchmarks run on Jetson. ActionRecognitionNet 2D and 3D and the conversational AI benchmarks provide examples of dense FP16 performance. All these models can be found on NVIDIA NGC.
Moreover, Jetson Orin continues to raise the bar for AI at the edge, adding to the NVIDIA overall top rankings in the latest MLPerf industry inference benchmarks. Jetson AGX Orin provides up to a 5X performance increase on these MLPerf benchmarks compared to previous results on Jetson AGX Xavier, while delivering an average of 2x better energy efficiency.
Accelerate time-to-market with Jetson software
The class-leading performance and energy efficiency of Jetson Orin is backed by the same powerful NVIDIA AI software that is deployed in GPU-accelerated data centers, hyperscale servers, and powerful AI workstations.
NVIDIA JetPack is the foundational SDK for the Jetson platform. NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge development. Jetson Orin is supported by NVIDIA JetPack 5.0, which includes the following:
- LTS Kernel 5.10
- A root file system based on Ubuntu 20.04
- A UEFI-based bootloader
- The latest compute stack with CUDA 11.4, TensorRT 8.4, and cuDNN 8.3
NVIDIA JetPack 5.0 also supports Jetson Xavier modules.
For you to develop fully accelerated applications quickly on the Jetson platform, NVIDIA provides application frameworks for various use cases:
- With DeepStream, rapidly develop and deploy vision AI applications and services. DeepStream offers hardware acceleration beyond inference, as it offers hardware accelerated plug-ins for end-to-end AI pipeline acceleration.
- NVIDIA Isaac provides hardware-accelerated ROS packages that make it easier for ROS developers to build high-performance robotics solutions.
- NVIDIA Isaac Sim, powered by Omniverse, is a tool that powers photo-realistic, physically accurate virtual environments to develop, test, and manage AI-based robots.
- NVIDIA Riva provides state-of-the-art, pretrained models for automatic speech recognition (ASR) and text-to-speech (TTS), which can be easily customizable. The models enable you to quickly develop GPU-accelerated conversational AI applications.
To accelerate the time to develop production-ready and highly accurate AI models, NVIDIA provides various tools to generate training data, train and optimize models, and quickly create ready to deploy AI models.
The NVIDIA Omniverse Replicator for synthetic data generation helps in creating high-quality datasets to boost model training. With Omniverse Replicator, you can create large and diverse synthetic datasets that is not only hard but sometimes impossible to create in the real world. Using synthetic data along with real data for training the model, you can significantly improve the model accuracy.
NVIDIA pretrained models from NGC start you off with highly accurate and optimized models and model architectures for various use cases. Pretrained models are production-ready. You can further customize these models by training with your own real or synthetic data, using the NVIDIA TAO (Train-Adapt-Optimize) workflow to quickly build an accurate and ready to deploy model.
Watch these NVIDIA technologies coming together on Jetson AGX Orin for a robotic use case.
Learn about everything in the Jetson AGX Orin Developer Kit in this getting started video:
For more information about all the NVIDIA technologies that we bring in NVIDIA Jetson Orin modules, watch a webinar on Jetson software.
Usher in the new era of autonomous machines and robotics
Get started with developing all four Jetson Orin modules by placing an order for the Jetson AGX Orin Developer Kit, and downloading the NVIDIA JetPack 5.0 SDK. Additional documentation for Jetson AGX Orin can be found at the download center. For information and support, visit the NVIDIA Embedded Developer page and forums for help from community experts.
A team of scientists have created a new AI-based tool to help lock up greenhouse gases like CO2 in porous rock formations faster and more precisely than ever before. Carbon capture technology, also referred to as carbon sequestration, is a climate change mitigation method that redirects CO2 emitted from power plants back underground. While doing Read article >
The post Rock On: Scientists Use AI to Improve Sequestering Carbon Underground appeared first on NVIDIA Blog.
This post describes different ways to compile an application using various development environments for the BlueField DPU.
Step-A
Step-B
Go get a cup of coffee…
Step-C
How often have you seen “Go get a coffee” in the instructions? As a developer, I found early on that this pesky quip is the bane of my life. Context switches, no matter the duration, are a high cost to pay in the application development cycle. Of all the steps that require you to step away, waiting for an application to compile is the hardest to shake off.
As we all enter the new world of NVIDIA Bluefield DPU application development, it is important to set up the build-step efficiently, to allow you to {code => compile => unit-test}
seamlessly. In this post, I go over different ways to compile an application for the DPU.
Free range routing with the DOCA dataplane plugin
In the DPU application development series, I talked about creating a DOCA dataplane plugin in FRR for offloading policies. FRR’s code count is close to a million lines (789,678 SLOC), which makes it a great candidate for measuring build times.
Developing directly on the Bluefield DPU
The DPU has an Arm64 architecture and one quick way to get started on DPU applications is to develop directly on the DPU. This test is with an NVIDIA BlueField2 with 8G RAM and 8xCortex-A72 CPUs.
I installed the Bluefield boot file (BFB), which provides the Ubuntu 20.04.3 OS image for the DPU. It also includes the libraries for DOCA-1.2 and DPDK-20.11.3. To build an application with the DOCA libraries, I add the DPDK pkgconfig
location to the PKG_CONFIG
path.
root@dpu-arm:~# export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/mellanox/dpdk/lib/aarch64-linux-gnu/pkgconfig
Next, I set up my code workspace on the DPU by cloning FRR and switching to the DOCA dataplane plugin branch.
root@dpu-arm:~/code# git clone https://github.com/AnuradhaKaruppiah/frr.git
root@dpu-arm:~/code# cd frr
root@dpu-arm:~/code/frr# git checkout dp-doca
FRR requires a list of constantly evolving prerequisites that are enumerated in the FRR community docs. With those dependencies installed, I configured FRR to include the DPDK and DOCA dataplane plugins.
root@dpu-arm:~/code/frr# ./bootstrap.sh
root@dpu-arm:~/code/frr# ./configure --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --libexecdir=${prefix}/lib/aarch64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --enable-exampledir=/usr/share/doc/frr/examples/ --localstatedir=/var/run/frr --sbindir=/usr/lib/frr --sysconfdir=/etc/frr --with-vtysh-pager=/usr/bin/pager --libdir=/usr/lib/aarch64-linux-gnu/frr --with-moduledir=/usr/lib/aarch64-linux-gnu/frr/modules "LIBTOOLFLAGS=-rpath /usr/lib/aarch64-linux-gnu/frr" --disable-dependency-tracking --disable-dev-build --enable-systemd=yes --enable-rpki --with-libpam --enable-doc --enable-doc-html --enable-snmp --enable-fpm --disable-zeromq --enable-ospfapi --disable-bgp-vnc --enable-multipath=128 --enable-user=root --enable-group=root --enable-vty-group=root --enable-configfile-mask=0640 --enable-logfile-mask=0640 --disable-address-sanitizer --enable-cumulus=yes --enable-datacenter=yes --enable-bfdd=no --enable-sharpd=yes --enable-dp-doca=yes --enable-dp-dpdk=yes
As I used the DPU as my development environment, I built and installed the FRR binaries in place:
root@dpu-arm:~/code# make –j12 all; make install
Here’s how the build times fared. I measured that multiple ways:
- Time to build and install the binaries using
make -j12 all
andmake install
- Time to build the same binaries but also assemble them into a Debian package using
dpkg-buildpackage –j12 –uc –us
The first method is used for coding and unit testing. The second method of generating debs is needed to compare with build times on other external development environments.
DPU-ARM build Times |
Real |
User |
Sys |
DPU Arm (Complete make) |
2min 40.529 sec |
16min 29.855 sec |
2min 1.534 sec |
DPU Arm (Debian package) |
5min 23.067 sec |
20min 33.614 sec |
2min 49.628sec |
The difference in times is expected. Generating a package involves several additional steps.
There are some clear advantages to using the DPU as your development environment.
- You can code, build and install, and then unit-test without leaving your workspace.
- You can optimize the build for incremental code changes.
The last option is usually a massive reduction in build time compared to a complete build. For example, I modified the DOCA dataplane code in FRR and rebuilt with these results:
root@dpu-arm:~/code/frr# time make –j12
>>>>>>>>>>>>> snipped make output >>>>>>>>>>>>
real 0m3.119s
user 0m2.794s
sys 0m0.479s
While that may make things easier, it requires reserving a DPU indefinitely for every developer for the sole purpose of application development or maintenance. Your development environment may also require more memory and horsepower, making this a less viable option long-term.
Developing on an x86 server
My Bluefield2 DPU was hosted by an x86-64 Ubuntu 20.04 server, and I used this server for my development environment.
root@server1-x86:~# lscpu |grep "CPU(s):|Model name"
CPU(s): 32
Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
root@server1-x86:~# grep MemTotal /proc/meminfo
MemTotal: 131906300 kB
In this case, the build-machine is x86 and the host-machine where the app is going to run is DPU-Arm64. There are several ways to do this:
- Use an Arm emulation on the x86 build-machine. A DOCA development container is available as a part of the DOCA packages.
- Use a cross-compilation toolchain.
In this test, I used the first option as it was the easiest. The second option can give you a different performance but creating that toolchain has its challenges.
I downloaded and loaded the bfb_builder_doca_ubuntu_20.04
container on my x86 server and fired it up.
root@server1-x86:~# sudo docker load -i bfb_builder_doca_ubuntu_20.04-mlnx-5.4.tar
root@server1-x86:~# docker run -v ~/code:/code --privileged -it -e container=dock
er doca_v1.11_bluefield_os_ubuntu_20.04-mlnx-5.4:latest
The DOCA and DPDK libraries come preinstalled in this container, and I just had to add them to the PKG_CONFIG
path.
root@86b87b0ab0c2:/code # export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/mellanox/dpdk/lib/aarch64-linux-gnu/pkgconfig
I set up the workspace and FRR prerequisites within the container, same as with the previous option.
root@86b87b0ab0c2:/code # git clone https://github.com/AnuradhaKaruppiah/frr.git
root@86b87b0ab0c2:/code # cd frr
root@86b87b0ab0c2:/code/frr # git checkout dp-doca
I could build my application within this DOCA container, but I couldn’t test it in place. So, the FRR binaries had to be built and packaged into debs, which are then copied over to the Bluefield DPU for testing. I set up the FRR Debian rules to match the FRR build configuration used in the previous option and generated the package:
root@86b87b0ab0c2:/code/frr # dpkg-buildpackage –j12 –uc -us
Table 2 shows how the build time compares with previous methods.
DPU-Arm & X86 Build Times |
Real |
User |
Sys |
DPU Arm (Complete make) |
2min 40.529sec |
16min 29.855sec |
2min 1.534sec |
DPU Arm (Debian package) |
5min 23.067sec |
20min 33.614sec |
2min 49.628sec |
X86 + DOCA dev container (Debian package) |
24min 19.051sec
|
139min 39.286s
|
3min 58.081sec
|
The giant jump in build time surprised me because I have an amply stocked x86 server and no Docker limits. So, it seems throwing CPUs and RAM at a problem doesn’t always help! This performance degradation is because of the cross architecture, as you can see with the next option.
Developing in an AWS Graviton instance
Next, I tried building my app natively on Arm but this time on an external server with more horsepower. I used an Amazon EC2 Graviton instance for this purpose with specs comparable to my x86 server.
- Arm64 arch, Ubuntu 20.04 OS
- 128G RAM
- 32 vCPUs
root@ip-172-31-28-243:~# lscpu |grep "CPU(s):|Model name"
CPU(s): 32
Model name: Neoverse-N1
root@ip-172-31-28-243:~# grep MemTotal /proc/meminfo
MemTotal: 129051172 kB
To set up the DOCA and DPDK libraries in this instance, I installed the DOCA SDK repo meta package.
root@ip-172-31-28-243:~# dpkg -i doca-repo-aarch64-ubuntu2004-local_1.1.1-1.5.4.2.4.1.3.bf.3.7.1.11866_arm64.deb
root@ip-172-31-28-243:~# apt update
root@ip-172-31-28-243:~# apt install doca-sdk
The remaining steps for cloning and building the FRR Debian package are the same as the previous option.
Table 3 shows how the build fared on the AWS Arm instance.
DPU-Arm, X86 & AWS-Arm Build Times |
Real |
User |
Sys |
DPU Arm (Complete make) |
2min 40.529sec |
16min 29.855sec |
2min 1.534sec |
DPU Arm (Debian package) |
5min 23.067sec |
20min 33.614sec |
2min 49.628sec |
X86 + DOCA dev container (Generate Debian package) |
24min 19.051sec
|
139min 39.286sec
|
3min 58.081sec
|
AWS-Arm (Generate Debian package) |
1min 30.480sec
|
6min 6.056sec |
0min 35.921sec
|
This is a clear winner, no coffee needed.
Figure 1 shows the compile times in these environments.
Summary
In this post, I discussed several development environments for DPU applications:
- Bluefield DPU
- DOCA dev container on an x86 server
- AWS Graviton compute instance
You can prototype your app directly on the DPU, experiment with developing in the x86 DOCA development container, and grab an AWS Graviton instance with DOCA to punch it into hyperspeed!
For more information, see the following resources: