Categories
Misc

Splitting strings during graph execution – Tensorflow

submitted by /u/-is-it-tho
[visit reddit] [comments]

Categories
Misc

Trying to start TensorFlow from an intermediate layer with the output of previous layer that I have saved?

I am new to machine learning.

I got the intermediate result of layer 31 of my CNN using the following code:

conv2d = Model(inputs = self.model_ori.input, outputs= self.model_ori.layers[31].output) intermediateResult = conv2d.predict(img) 

Lets say I have this output saved, but 10 days later, I want to take this output and feed it back into the next layer (32nd) and get the final result.

Is that possible?

My model.summary():

https://pastebin.com/8y0mYahB

submitted by /u/lulzintosh123
[visit reddit] [comments]

Categories
Misc

MNIST and image segmentation

I’m very new to tensorflow and machine learning in general and wanted to make something that would take an image of an article of clothing and classify it and remove the background around it. I found a dataset called fashion-MNIST that looks like it might help me with classifying images.

Would it be possible to also use this dataset to get just the pixels of the clothing and remove any background around it? How would I go about doing it and are there any examples that would be helpful for me?

submitted by /u/razz-daddy
[visit reddit] [comments]

Categories
Misc

Can’t understand entropy in tensorflow-probability.

Hi,

I am new at using tensorflow-probability. I am using Categorical Distribution to sample a value and then get its probabilty and entropy but every time I sample from the distribution, I get the same entropy. This problem is of policy gradient algorithms. NN outputs logits which are then fed to Categorical distribution then action is sampled. Please let me know, what I am missing here.

submitted by /u/Better-Ad8608
[visit reddit] [comments]

Categories
Misc

Error when using LSTM

Hello everybody, I’ve run into an error using tensorflow-gpu version 2.7.0. and I am looking for help. Whenever I try to use tensorflow.keras.layers LSTM the kernel of my Jupiter Notebook dies when trying to run model.fit(). I can compile the model and the cell gets executed without an error. I’ve gotten an error message only once and it said:

NotImplementedError: Cannot convert a symbolic Tensor (lstm/strided_slice:0) to a bumpy array. This error may indicate that you’re trying to pass a Tensor to a Num Py call, which is not supported

I’ve only ever gotten this error message once and never again since. I looked it up online and people said it was a compatibility issue with numpy > 1.19.5 so I downgraded numpy but my kernel still dies when trying to do model.fit(). I then tried to pass my training data as a tf tensor by converting the numpy array using tf.convert_to_tensor(). But that didn’t help either. Everything else using tf seems to work, it’s just LSTM giving me issues.

Has anyone an idea how I could fix the issue? Thank you.

Version in use: tensorflow-gpu 2.7.0, Num Py 1.20.3/1.19.5, Cuda 11.3.1 and cudnn 8.1.0.77; GPU: RTX 3090

submitted by /u/0stkreutz
[visit reddit] [comments]

Categories
Misc

Introducing NVIDIA HGX H100: An Accelerated Server Platform for AI and High-Performance Computing

The NVIDIA HGX H100 is a key GPU server building block powered by the Hopper architecture.Introducing the NVIDIA HGX H100, a key GPU server building block powered by the Hopper architecture.The NVIDIA HGX H100 is a key GPU server building block powered by the Hopper architecture.

The NVIDIA mission is to accelerate the work of the Da Vincis and Einsteins of our time and empower them to solve the grand challenges of society. With the complexity of artificial intelligence (AI), high-performance computing (HPC), and data analytics increasing exponentially, scientists need an advanced computing platform that is able to drive million-X speedups in a single decade to solve these extraordinary challenges.

To answer this need, we introduce the NVIDIA HGX H100, a key GPU server building block powered by the NVIDIA Hopper Architecture. This state-of-the-art platform securely delivers high performance with low latency, and integrates a full stack of capabilities from networking to compute at data center scale, the new unit of computing.

In this post, I discuss how the NVIDIA HGX H100 is helping deliver the next massive leap in our accelerated compute data center platform.

HGX H100 8-GPU

The HGX H100 8-GPU represents the key building block of the new Hopper generation GPU server. It hosts eight H100 Tensor Core GPUs and four third-generation NVSwitch. Each H100 GPU has multiple fourth generation NVLink ports and connects to all four NVSwitches.  Each NVSwitch is a fully non-blocking switch that fully connects all eight H100 Tensor Core GPU.

The HGX H100 8-GPU represents the key building block of the new Hopper generation GPU server and hosts eight H100 Tensor Core GPUs and four third generation NVSwitch.
Figure 1. High-level block diagram of HGX H100 8-GPU

This fully connected topology from NVSwitch enables any H100 to talk to any other H100 concurrently. Notably, this communication runs at the NVLink bidirectional speed of 900 gigabytes per second (GB/s), which is more than 14x the bandwidth of the current PCIe Gen4 x16 bus.

The  third-generation NVSwitch also provides new hardware acceleration for collective operations with multicast and NVIDIA SHARP in-network reductions. Combining with the faster NVLink speed, the effective bandwidth for common AI collective operations like all-reduce go up by 3x compared to the HGX A100. The NVSwitch acceleration of collectives also significantly reduces the load on the GPU.

  HGX A100 8-GPU HGX H100 8-GPU  Improvement Ratio
FP8 32,000 TFLOPS 6X (vs A100 FP16)
FP16 4,992 TFLOPS 16,000 TFLOPS 3X
FP64 156 TFLOPS 480 TFLOPS 3X
In-Network Compute 0 3.6 TFLOPS Infinite
Interface to host CPU 8x PCIe Gen4 x16 8x PCIe Gen5 x16 2X 
Bisection Bandwidth 2.4 TB/s 3.6 TB/s 1.5X
Table 1. Comparing HGX A100 8-GPU with the new HGX H100 8-GPU

*Note: FP performance includes sparsity

HGX H100 8-GPU with NVLink-Network support

The emerging class of exascale HPC and trillion parameter AI models for tasks like accurate conversational AI require months to train, even on supercomputers. Compressing this to the speed of business and completing training within hours requires high-speed, seamless communication between every GPU in a server cluster.

To tackle these large use cases, the new NVLink and NVSwitch are designed to enable HGX H100 8-GPU to scale up and support a much larger NVLink domain with the new NVLink-Network. Another version of HGX H100 8-GPU features this new NVLink-Network support.

The HGX H100 8-GPU was designed to scale up to support a larger NVLink domain with the new NVLink-Network.
Figure 2. High-level block diagram of HGX H100 8-GPU with NVLink-Network support

System nodes built with HGX H100 8-GPU with NVLink-Network support can fully connect to other systems through the Octal Small Form Factor Pluggable (OSFP) LinkX cables and the new external NVLink Switch. This connection enables up to a maximum of 256 GPU NVLink domains. Figure 3 shows the cluster topology.

The cluster topology of the HGX H100 8-GPU with NVLink-Network support enables up to a maximum of 256 GPU NVLink domains.
Figure 3. 256 H100 GPU Pod
  256 A100 GPU Pod 256 H100 GPU Pod Improvement Ratio
NVLINK Domain 8 GPU 256 GPU 32X
FP8 1,024 PFLOPS 6X (vs A100 FP16)
FP16 160 PFLOPS 512 PFLOPS 3X
FP64 5 PFLOPS 15 PFLOPS 3X
In-Network Compute 0 192 TFLOPS Infinite
Bisection Bandwidth 6.4 TB/s 70 TB/s 11X
Table 2. Comparing 256 A100 GPU Pod vs. 256 H100 GPU Pod

*Note: FP performance includes sparsity

Target use cases and performance benefit

With the dramatic increase in HGX H100 compute and networking capabilities, AI and HPC applications performances are vastly improved.

Today’s mainstream AI and HPC model can fully reside in the aggregate GPU memory of a single node. For example, BERT-Large, Mask R-CNN, and HGX H100 are the most performance-efficient training solutions.

For the more advanced and larger AI and HPC model, the model requires multiple nodes of aggregate GPU memory to fit. For example, a deep learning recommendation model (DLRM) with terabytes of embedded tables, a large mixture-of-experts (MoE) natural language processing model, and the HGX H100 with NVLink-Network accelerates the key communication bottleneck and is the best solution for this class of workload.

Figure 4 from the NVIDIA H100 GPU Architecture whitepaper shows the extra performance boost enabled by the NVLink-Network.

HPC, AI Inference, and AI Training diagrams all show the extra performance boost enabled by the NVLink-Network.
Figure 4. Application performance gain comparing different system configurations

All performance numbers are preliminary based on current expectations and subject to change in shipping products. A100 cluster: HDR IB network. H100 cluster: NDR IB network with NVLink-Network where indicated.

# GPUs: Climate Modeling 1K, LQCD 1K, Genomics 8, 3D-FFT 256, MT-NLG 32 (batch sizes: 4 for A100, 60 for H100 at 1 sec, 8 for A100 and 64 for H100 at 1.5 and 2sec), MRCNN 8 (batch 32), GPT-3 16B 512 (batch 256), DLRM 128 (batch 64K), GPT-3 16K (batch 512), MoE 8K (batch 512, one expert per GPU)​

HGX H100 4-GPU

In addition to the 8-GPU version, the HGX family also features a version with a 4-GPU, which is directly connected with fourth-generation NVLink.

The HGX family also features a version with a 4-GPU which is directly connected with fourth generation NVLink.
Figure 5. High-level block diagram of HGX H100 4-GPU

The H100-to-H100 point-to-point peer NVLink bandwidth is 300 GB/s bidirectional, which is about 5X faster than today’s PCIe Gen4 x16 bus.

The HGX H100 4-GPU form factor is optimized for dense HPC deployment:

  • Multiple HGX H100 4-GPUs can be packed in a 1U high liquid cooling system to maximize GPU density per rack.
  • Fully PCIe switch-less architecture with HGX H100 4-GPU directly connects to the CPU, lowering system bill of materials and saving power.
  • For workloads that are more CPU intensive, HGX H100 4-GPU can pair with two CPU sockets to increase the CPU-to-GPU ratio for a more balanced system configuration.

An accelerated server platform for AI and HPC

NVIDIA is working closely with our ecosystem to bring the HGX H100 based server platform to the market later this year. We are looking forward to putting this powerful computing tool in your hands, enabling you to innovate and fulfill your life’s work at the fastest pace in human history.

Categories
Misc

Extending Block-Cyclic Tensors for Multi-GPU with NVIDIA cuTENSORMg

The cuTensor hero image with a floating grey block.cuTENSOR is now able to distribute tensor contractions across multiple GPUs. This has been released as a new library called cuTENSORMg (multi-GPU).The cuTensor hero image with a floating grey block.

Tensor contractions are at the core of many important workloads in machine learning, computational chemistry, and quantum computing. As scientists and engineers pursue ever-growing problems, the underlying data gets larger in size and calculations take longer and longer.

When a tensor contraction does not fit into a single GPU anymore, or if it takes too long on a single GPU, the natural next step is to distribute the contraction across multiple GPUs. We have been extending cuTENSOR with this new capability, and are releasing it as a new library called cuTENSORMg (multi-GPU). It provides single-process multi-GPU functionality on block-cyclic distributed tensors. 

The copy and contraction operations for cuTENSORMg are broadly structured into handles, tensor descriptors, and descriptors. In this post, we explain the handle and the tensor descriptor and how copy operations work and demonstrate how to perform a tensor contraction. We then show how to measure the performance of the contraction operation for various workloads and GPU configurations.

Library handle

The library handle represents the set of devices that participate in the computation. The handle also contains data and resources that are reused across calls. You can create a library handle by passing the list of devices to the cutensorMgCreate function:

cutensorMgCreate(&handle, numDevices, devices);

All objects in cuTENSORMg are heap allocated. As such, they must be freed with a matching destroy call. For brevity, we do not show these in this post, but production code should destroy all objects that it creates to avoid leaks.

cutensorMgDestroy(handle);

All library calls return an error code of type cutensorStatus_t. In production, you should always check the error code to detect failures or usage issues early. For brevity, we omit these checks in this post for brevity, but they are included in the corresponding example code. 

In addition to error codes, cuTENSORMg also provides similar logging capabilities as cuTENSOR. Those logs can be activated by setting the CUTENSORMG_LOG_LEVEL environment variable appropriately. For instance, CUTENSORMG_LOG_LEVEL=1 would provide you with additional information about a returned error code.

Tensor descriptor

The tensor descriptor describes how a tensor is laid out in memory and how it is distributed across devices. For each mode, there are three core concepts to determine the layout:

  • extent: Logical size of each mode.
  • blockSize: Subdivides the extent into equal-sized chunks, except for the final remainder block.
  • deviceCount: Determines how the blocks are distributed across devices.

Figure 1 shows how extent and block size subdivide a two-dimensional tensor.

A 3x3 square showing block size vs extent. Block size accounts for a 1x1 block whereas extent is a large square layered on top of several blocks but does not exceed the full parameter of the 3x3 square.
Figure 1. Tensor data layout with extents and blocks. Green is the two-dimensional tensor, and blue shows the blocks that the block size induces.
A 3x3 square showing deviceCount [0] on the Y axis and deviceCount[1] on the X axis.
Figure 2. Distribution of a blocked tensor across devices in a block-cyclic fashion; different colors represent different devices.

Blocks are distributed in a cyclic fashion, which means that consecutive blocks are assigned to different devices. Figure 2 shows a two-by-two distribution of blocks to devices, with the assignment of devices to blocks being encoded with another array devices. The array is a dense-column major tensor with extents like the device counts.

A 4x4 block with Y axis as blockStride[0] and X axis blockStride[1]. This block is comprised of smaller by 4x4 blocks with elementStride[1] as the X axis and and elementStride[0] as the Y axis.
Figure 3. On device data layout using element strides and block strides.

Finally, the exact on-device data layout is determined by the elementStride and the blockStride values for each mode. Respectively, they determine the displacement, in linear memory in units of elements, of two adjacent elements and adjacent blocks for a given mode (Figure 3).

These attributes are all set using the cutensorMgCreateTensorDescriptor call:

cutensorMgCreateTensorDescriptor(handle, &desc, numModes, extent, elementStride, blockSize, blockStride, deviceCount, numDevices, devices, type);

It is possible to pass NULL to the elementStride, blockSize, blockStride, and deviceCount.

If the elementStride is NULL, the data layout is assumed to be dense using a generalized column-major layout. If blockSize is NULL, it is equal to extent. If blockStride is NULL, it is equal to blockSize * elementStride, which results in an interleaved block format. If deviceCount is NULL, all device counts are set to 1. In this case, the tensor is distributed and entirely resides in the memory of devices[0].

By passing CUTENSOR_MG_DEVICE_HOST as the owning device, you can specify that the tensor is located on the host in pinned, managed, or regularly allocated memory.

Copy operation

The copy operation enables data layout changes including the redistribution of the tensor to different devices. Its parameters are a source and a destination tensor descriptor (descSrc and descDst), as well as a source and destination mode list (modesSrc and modesDst). The two tensors’ extents at coinciding modes must match, but everything else about them may be different. One may be located on the host, the other across devices, and they may have different blockings and different strides.

Like all operations in cuTENSORMg, it proceeds in three steps:

  • cutensorMgCopyDescriptor_t: Encodes what operation should be performed.
  • cutensorMgCopyPlan_t: Encodes how the operation will be performed.
  • cutensorMgCopy: Performs the operation according to the plan.

The first step is to create the copy descriptor:

cutensorMgCreateCopyDescriptor(handle, &desc, descDst, modesDst, descSrc, modesSrc);

With the copy descriptor in hand, you can query the amount of device-side and host-side workspace that is required. The deviceWorkspaceSize array has as many elements as there are devices in the handle. The i-th element is the amount of workspace required for the i-th device in the handle.

cutensorMgCopyGetWorkspace(handle, desc, deviceWorkspaceSize, &hostWorkspaceSize);

With the workspace sizes determined, plan the copy. You can pass a larger workspace size and the call may take advantage of more workspace, or you can try to pass a smaller size. The planning may be able to accommodate that or it may yield an error.

cutensorMgCreateCopyPlan(handle, &plan, desc, deviceWorkspaceSize, hostWorkspaceSize

Finally, with the planning complete, execute the copy operation.

cutensorMgCopy(handle, plan, ptrDst, ptrSrc, deviceWorkspace, hostWorkspace, streams);

In this call, ptrDst and ptrSrc are arrays of pointers. They contain one pointer for each of the devices in the corresponding tensor descriptor. In this instance, ptrDst[0] corresponds to the device that was passed as devices[0] to cutensorMgCreateTensorDescriptor.

On the other hand, deviceWorkspace and streams are also arrays where each entry corresponds to a device. They are ordered according to the order of devices in the library handle, such as deviceWorkspace[0] and streams[0] correspond to the device that was passed at devices[0] to cutensorMgCreate. The workspaces must be at least as large as the workspace sizes that were passed to cutensorMgCreateCopyPlan.

Contraction operation

At the core of the cuTENSORMg library is the contraction operation. It currently implements tensor contractions of tensors located on one or multiple devices, but may support tensors located on the host in the future. As a refresher, a contraction is an operation of the following form:

D_{M,N,L} leftarrow alpha sum_{K} A_{K,M,L} cdot B_{K,N,L} + beta C_{M,N,L^{3}}

Where A, B, C, and D are tensors, and M, N, L, and K are mode lists that may be arbitrarily permuted and interleaved with each other.

Like the copy operation, it proceeds in three stages:

  • cutensorMgCreateContractionDescriptor: Encodes the problem.
  • cutensorMgCreateContractionPlan: Encodes the implementation.
  • cutensorMgContraction: Uses the plan and performs the actual contraction.

First, you create a contraction descriptor based on the tensor descriptors, mode lists, and the desired compute type, such as the lowest precision data that may be used during the calculation. 

cutensorMgCreateContractionDescriptor(handle, &desc, descA, modesA, descB, modesB, descC, modesC, descD, modesD, compute);

As the contraction operation has more degrees of freedom, you must also initialize a find object that gives you finer control over the plan creation for a given problem descriptor. For now, this find object only has a default setting:

cutensorMgCreateContractionFind(handle, &find, CUTENSORMG_ALGO_DEFAULT);

Then, you can query the workspace requirement along the lines of what you did for the copy operation. Compared to that operation, you also pass in the find and a workspace preference:

cutensorMgContractionGetWorkspace(handle, desc, find, CUTENSOR_WORKSPACE_RECOMMENDED, deviceWorkspaceSize, &hostWorkspaceSize);

Create a plan:

cutensorMgCreateContractionPlan(handle, &plan, desc, find, deviceWorkspaceSize, hostWorkspaceSize);

Finally, execute the contraction using the plan:

cutensorMgContraction(handle, plan, alpha, ptrA, ptrB, beta, ptrC, ptrD, deviceWorkspace, hostWorkspace, streams);

In this call, alpha and beta are host pointers of the same type as the D tensor, unless the D tensor is half or BFloat16 precision, in which case it is single precision. The order of pointers in the different arrays ptrA, ptrB, ptrC, and ptrD correspond to their order in their descriptor’s devices array. The order of pointers in the deviceWorkspace and streams arrays corresponds to the order in the library handle’s devices array.

Performance

You can find all these calls together in the CUDA Library Samples GitHub repo. We extended it to take two parameters: The number of GPUs and a scaling factor. Feel free to experiment with other contractions, block sizes, and scaling regimes. It is written in such a way that it scales up M and N while keeping K fixed. It implements an almost GEMM-shaped tensor contraction of the shape:

C_{M^{0}N^{0}M^{1}N^{1}M^{2}N^{2}} leftarrow A_{M^{0}K^{0}M^{1}K^{1}M^{2}K^{2}  B_K^{0}N^{0}K^{1}N^{1}K^{2}N^{2}}

M_1 and N_1 scale up and the block size in those dimensions keeping the load approximately balanced. The plot underneath shows their scaling relationship when measured on a DGX A100.

Graph shows performance between four GPU counts (1, 2, 4, 8) with reaching the highest performance of 120000.
Figure 4. Performance of the example contraction across various GPU counts and scaling factors, measured on a DGX A100 node

Get started with cuTENSORMg

Interested in trying out cuTENSORMg to scale tensor contractions beyond a single GPU?

We continue working on improving cuTENSORMg, including out-of-core functionality. If you have questions or new feature requests, contact product manager Matthew Nicely US.

Categories
Misc

Delivering Server-Class Performance at the Edge with NVIDIA Jetson Orin

Get started with developing all four Jetson Orin modules for a new era of robotics.

The pace for development and deployment of AI-powered robots and other autonomous machines continues to grow rapidly. The next generation of applications require large increases in AI compute performance to handle multimodal AI applications running concurrently in real time.

Human-robot interactions are increasing in retail spaces, food delivery, hospitals, warehouses, factory floors, and other commercial applications. These autonomous robots must concurrently perform 3D perception, natural language understanding, path planning, obstacle avoidance, pose estimation, and many more actions that require both significant computational performance and highly accurate trained neural models for each application.

NVIDIA Jetson AGX Orin modules are the highest-performing and newest members of the NVIDIA Jetson family. These modules deliver tremendous performance with class-leading energy efficiency. They run the comprehensive NVIDIA AI software stack to power the next generation of demanding edge AI applications.

Picture of Jetson AGX Orin Module.
Figure 1. Jetson AGX Orin module

Jetson AGX Orin and Jetson Orin NX series

At GTC Spring 2022, we announced that four Jetson Orin modules will be available in Q4 2022. With up to 275 tera operations per second (TOPS) of performance, the Jetson Orin modules can run server class AI models at the edge with end-to-end application pipeline acceleration. Compared to Jetson Xavier modules, Jetson Orin brings even higher performance, power efficiency, and inference capabilities to modern AI applications.

JETSON AGX XAVIER 64GB JETSON AGX ORIN 64GB
21 DENSE INT8 TOPS 275 SPARSE|138 DENSE, INT8 TOPS
10W to 20W 15W to 60W
$499 (1KU+) $1,599 (1KU+)
JETSON XAVIER NX 8GB JETSON AGX ORIN 32GB
32 DENSE INT8 TOPS 200 SPARSE|100 DENSE, INT8 TOPS
10W to 30W 15W to 40W
$899 (1KU+) $899 (1KU+)
JETSON XAVIER NX 16GB JETSON ORIN NX 16GB
21 DENSE INT8 TOPS 100 SPARSE|50 DENSE, INT8 TOPS
10W to 20W 10W to 25W
$499 (1KU+) $599 (1KU+)
JETSON XAVIER NX 8GB JETSON ORIN NX 16GB
21 DENSE INT8 TOPS 100 SPARSE|50 DENSE, INT8 TOPS
10W to 20W 10W to 25W
$399 (1KU+) $599 (1KU+)
Table 1. Jetson Xavier vs. Jetson Orin capability and price comparison
A graph of the performance of the Jetson Orin and Jetson Xavier modules
Figure 2. Jetson Xavier and Jetson Orin modules AI TOPS performance comparison

The Jetson AGX Orin series includes the Jetson AGX Orin 64GB and the Jetson AGX Orin 32GB modules.

  • Jetson AGX Orin 64GB delivers up to 275 TOPS with power configurable between 15W and 60W.
  • Jetson AGX Orin 32GB delivers up to 200 TOPs with power configurable between 15W and 40W.

These modules have the same compact form factor and are pin compatible with Jetson AGX Xavier series modules, offering you an 8x performance upgrade, or up to 6x the performance at the same price.

Edge and embedded systems continue to be driven by the increasing number, performance, and bandwidth of sensors. The Jetson AGX Orin series brings not only additional compute for processing these sensors, but also additional I/O:

  • Up to 22 lanes of PCIe Gen4
  • Four 10Gb Ethernet
  • Higher speed CSI lanes
  • Double the storage with 64GB eMMC 5.1
  • 1.5X the memory bandwidth

For more information, see the Jetson Orin product page and the Jetson AGX Orin Series Data Sheet.

Diagram shows key components of the Jetson AGX Orin series including GPU, CPU, DLA, PVA, Multimedia blocks, Power Subsystem, and I/O.
Figure 3. Jetson AGX Orin Series Block

USB 3.2, UFS, MGBE, and PCIe share UPHY Lanes. For the supported UPHY configurations, see the Design Guide.

The NVIDIA Orin NX series includes Jetson Orin NX 16GB with up to 100 TOPS of AI performance, and Jetson Orin NX 8GB with up to 70 TOPS. With this series, we followed a similar design philosophy as with Jetson Xavier NX. We brought the NVIDIA Orin architecture and brought it to the smallest Jetson form factor, 260-pin SODIMM, with lower power consumption.

You can bring this higher class of performance to your next-generation, small form factor products like drones and handheld devices. Jetson Orin NX 16GB comes with power configurable between 10W and 25W, and Jetson Orin NX 8GB comes with power configurable between 10W and 20W.

The Orin NX Series is form factor compatible with the Jetson Xavier NX series, and delivers up to 5x the performance, or up to 3X the performance at the same price. The Orin NX series also brings additional high speed I/O capabilities with up to seven PCIe lanes and three 10Gbps USB 3.2 interfaces. For storage, you can leverage the additional PCIe lanes to connect to external NVMe. For more information, see the Jetson Orin product page.

A graph of the performance of the Jetson Orin and Jetson Xavier modules
Figure 4. Jetson Orin NX Series Block

Jetson AGX Xavier was designed around the NVIDIA Xavier SoC, our first architecture developed from the ground up for autonomous machines. The NVIDIA Orin architecture takes this class of product to the next level. It continues to showcase multiple different on-chip processors, but brings greater capability, higher performance, and more power efficiency.

The Jetson Orin modules contain the following:

  • An NVIDIA Ampere Architecture GPU with up to 2048 CUDA cores and up to 64 Tensor Cores
  • Up to 12 Arm A78AE CPU cores
  • Two next-generation deep learning accelerators (DLA)
  • A computer vision accelerator
  • Various other processors to offload the GPU and CPU:
    • Video encoder
    • Video decoder
    • Video image compositor
    • Image signal processor
    • Sensor processing engine
    • Audio processing engine

Like the other Jetson modules, Jetson Orin is built using a system-on-module (SOM) design. All the processing, memory, and power rails are contained on the module. All the high-speed I/O is available through a 699-pin connector (Jetson AGX Orin series) or a 260-pin SODIMM connector (Jetson Orin NX series). This SOM design makes it easy for you to integrate the modules into your system designs.

Jetson AGX Orin Developer Kit

At GTC 2022, NVIDIA also announced availability of the Jetson AGX Orin Developer Kit. The developer kit contains everything needed for you to get up and running quickly. It includes a Jetson AGX Orin module with the highest performance and runs the world’s most advanced deep learning software stack. This kit delivers the flexibility to create sophisticated AI solutions now and well into the future.

Compact size, high-speed interfaces, and lots of connectors make this developer kit perfect for prototyping advanced AI-powered robots and edge applications for manufacturing, logistics, retail, service, agriculture, smart cities, healthcare, life sciences, and more.

Image of the Jetson AGX Orin Developer Kit
Figure 5. Jetson AGX Orin Developer Kit

Jetson AGX Orin Developer Kit features:

  • An NVIDIA Ampere Architecture GPU and 12-core Arm Cortex-A78AE 64-bit CPU, together with next-generation deep learning and vision accelerators
  • High-speed I/O, 204.8 GB/s of memory bandwidth, and 32 GB of DRAM capable of feeding multiple concurrent AI application pipelines
  • Powerful NVIDIA AI software stack with support for SDKs and software platforms, including the following:
    • NVIDIA JetPack
    • NVIDIA Riva
    • NVIDIA DeepStream
    • NVIDIA Isaac
    • NVIDIA TAO

The Jetson AGX Orin Developer Kit runs the latest NVIDIA JetPack 5.0 software. NVIDIA JetPack 5.0 supports emulating the performance and clock frequencies of Jetson Orin NX and Jetson AGX Orin series modules with a Jetson AGX Orin Developer Kit. You can kickstart your development for any of those modules today.

Jetson AGX Orin Developer Kit is available for purchase through NVIDIA authorized distributors worldwide. Get started today by following the Getting Started guide.

  Developer Kit AGX Orin 64GB AGX Orin 32GB
AI Performance 275 INT8 Sparse TOPs 200 INT8 Sparse TOPs
GPU 2048-core NVIDIA Ampere Architecture GPU
with 64 Tensor Cores    
1792-core NVIDIA Ampere Architecture GPU with 56 Tensor Cores
CPU 12-core Arm Cortex-A78AE v8.2
64-bit CPU 3MB L2 + 6MB L3
8-core Arm Cortex-A78AE v8.2
64-bit CPU 2MB L2 + 4MB L3
Power 15W-60W 15W-40W
Memory 32 GB 64 GB 32GB
MSRP $1,999 $1,599 $899

Table 2. Summary comparison of Jetson AGX Orin series modules and Developer Kit

Best-in-class performance

Jetson Orin provides a giant leap forward for your next generation applications. Using the Jetson AGX Orin Developer Kit, we have taken the geometric mean of measured performance for our highly accurate, production-ready, pretrained models for computer vision and conversational AI. Testing included the following benchmarks:

With the NVIDIA JetPack 5.0 Developer Preview, Jetson AGX Orin shows a 3.3X performance increase compared to Jetson AGX Xavier. With future software improvements, we expect this to approach a 5X performance increase. Jetson AGX Xavier performance has increased 1.5X since NVIDIA JetPack 4.1.1 Developer Preview, the first software release to support it.

Graph of benchmark data from measured pretrained model results
Figure 6. Pretrained models performance benchmark chart

The benchmarks have been run on our Jetson AGX Orin Developer Kit. PeopleNet and DashcamNet provide examples of dense models that can be run concurrently on the GPU and the two DLAs. The DLA can be used to offload some AI applications from the GPU and this concurrent capability enables them to operate in parallel.

PeopleNet, LPRNet, DashcamNet, and BodyPoseNet provide examples of dense INT8 benchmarks run on Jetson. ActionRecognitionNet 2D and 3D and the conversational AI benchmarks provide examples of dense FP16 performance. All these models can be found on NVIDIA NGC.

Moreover, Jetson Orin continues to raise the bar for AI at the edge, adding to the NVIDIA overall top rankings in the latest MLPerf industry inference benchmarks. Jetson AGX Orin provides up to a 5X performance increase on these MLPerf benchmarks compared to previous results on Jetson AGX Xavier, while delivering an average of 2x better energy efficiency.

Chart shows greater inference performance and energy efficiency of the Jetson AGX Orin, when compared to the Jetson AGX Xavier.
Figure 7. Jetson AGX Orin performance

Accelerate time-to-market with Jetson software

The class-leading performance and energy efficiency of Jetson Orin is backed by the same powerful NVIDIA AI software that is deployed in GPU-accelerated data centers, hyperscale servers, and powerful AI workstations.

Image of key Jetson software components: AI model development, application frameworks, and the NVIDIA Jetpack SDK.
Figure 8. Jetson software overview

NVIDIA JetPack is the foundational SDK for the Jetson platform. NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge development. Jetson Orin is supported by NVIDIA JetPack 5.0, which includes the following:

  • LTS Kernel 5.10
  • A root file system based on Ubuntu 20.04
  • A UEFI-based bootloader
  • The latest compute stack with CUDA 11.4, TensorRT 8.4, and cuDNN 8.3

NVIDIA JetPack 5.0 also supports Jetson Xavier modules.

For you to develop fully accelerated applications quickly on the Jetson platform, NVIDIA provides application frameworks for various use cases:

  • With DeepStream, rapidly develop and deploy vision AI applications and services. DeepStream offers hardware acceleration beyond inference, as it offers hardware accelerated plug-ins for end-to-end AI pipeline acceleration.
  • NVIDIA Isaac provides hardware-accelerated ROS packages that make it easier for ROS developers to build high-performance robotics solutions.
  • NVIDIA Isaac Sim, powered by Omniverse, is a tool that powers photo-realistic, physically accurate virtual environments to develop, test, and manage AI-based robots.
  • NVIDIA Riva provides state-of-the-art, pretrained models for automatic speech recognition (ASR) and text-to-speech (TTS), which can be easily customizable. The models enable you to quickly develop GPU-accelerated conversational AI applications.

To accelerate the time to develop production-ready and highly accurate AI models, NVIDIA provides various tools to generate training data, train and optimize models, and quickly create ready to deploy AI models.

The NVIDIA Omniverse Replicator for synthetic data generation helps in creating high-quality datasets to boost model training. With Omniverse Replicator, you can create large and diverse synthetic datasets that is not only hard but sometimes impossible to create in the real world. Using synthetic data along with real data for training the model, you can significantly improve the model accuracy.

NVIDIA pretrained models from NGC start you off with highly accurate and optimized models and model architectures for various use cases. Pretrained models are production-ready. You can further customize these models by training with your own real or synthetic data, using the NVIDIA TAO (Train-Adapt-Optimize) workflow to quickly build an accurate and ready to deploy model.

Watch these NVIDIA technologies coming together on Jetson AGX Orin for a robotic use case.

Video 1. NVIDIA Jetson AGX Orin: Next-level AI Performance for Next-Gen Robotics

Learn about everything in the Jetson AGX Orin Developer Kit in this getting started video:

Video 2. Getting Started with Jetson AGX Orin Developer Kit

For more information about all the NVIDIA technologies that we bring in NVIDIA Jetson Orin modules, watch a webinar on Jetson software.

Usher in the new era of autonomous machines and robotics

Get started with developing all four Jetson Orin modules by placing an order for the Jetson AGX Orin Developer Kit, and downloading the NVIDIA JetPack 5.0 SDK. Additional documentation for Jetson AGX Orin can be found at the download center. For information and support, visit the NVIDIA Embedded Developer page and forums for help from community experts.

Categories
Misc

Rock On: Scientists Use AI to Improve Sequestering Carbon Underground

A team of scientists have created a new AI-based tool to help lock up greenhouse gases like CO2 in porous rock formations faster and more precisely than ever before. Carbon capture technology, also referred to as carbon sequestration, is a climate change mitigation method that redirects CO2 emitted from power plants back underground. While doing Read article >

The post Rock On: Scientists Use AI to Improve Sequestering Carbon Underground appeared first on NVIDIA Blog.

Categories
Misc

Choosing a Development Environment for NVIDIA BlueField DPU Applications

NVIDIA DOCA libraries simplify the development process of BlueField DPU applicationsThis post describes different ways to compile an application using various development environments for the BlueField DPU.NVIDIA DOCA libraries simplify the development process of BlueField DPU applications

Step-A 

Step-B 

Go get a cup of coffee… 

Step-C 

How often have you seen “Go get a coffee” in the instructions? As a developer, I found early on that this pesky quip is the bane of my life. Context switches, no matter the duration, are a high cost to pay in the application development cycle. Of all the steps that require you to step away, waiting for an application to compile is the hardest to shake off. 

As we all enter the new world of NVIDIA Bluefield DPU application development, it is important to set up the build-step efficiently, to allow you to {code => compile => unit-test} seamlessly. In this post, I go over different ways to compile an application for the DPU. 

Free range routing with the DOCA dataplane plugin 

In the DPU application development series, I talked about creating a DOCA dataplane plugin in FRR for offloading policies. FRR’s code count is close to a million lines (789,678 SLOC), which makes it a great candidate for measuring build times.  

Developing directly on the Bluefield DPU 

The DPU has an Arm64 architecture and one quick way to get started on DPU applications is to develop directly on the DPU. This test is with an NVIDIA BlueField2 with 8G RAM and 8xCortex-A72 CPUs. 

I installed the Bluefield boot file (BFB), which provides the Ubuntu 20.04.3 OS image for the DPU. It also includes the libraries for DOCA-1.2 and DPDK-20.11.3. To build an application with the DOCA libraries, I add the DPDK pkgconfig location to the PKG_CONFIG path.

root@dpu-arm:~# export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/mellanox/dpdk/lib/aarch64-linux-gnu/pkgconfig 

Next, I set up my code workspace on the DPU by cloning FRR and switching to the DOCA dataplane plugin branch.

root@dpu-arm:~/code# git clone https://github.com/AnuradhaKaruppiah/frr.git 
root@dpu-arm:~/code# cd frr 
root@dpu-arm:~/code/frr# git checkout dp-doca 

FRR requires a list of constantly evolving prerequisites that are enumerated in the FRR community docs. With those dependencies installed, I configured FRR to include the DPDK and DOCA dataplane plugins.

root@dpu-arm:~/code/frr# ./bootstrap.sh 

root@dpu-arm:~/code/frr# ./configure --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --libexecdir=${prefix}/lib/aarch64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --enable-exampledir=/usr/share/doc/frr/examples/ --localstatedir=/var/run/frr --sbindir=/usr/lib/frr --sysconfdir=/etc/frr --with-vtysh-pager=/usr/bin/pager --libdir=/usr/lib/aarch64-linux-gnu/frr --with-moduledir=/usr/lib/aarch64-linux-gnu/frr/modules "LIBTOOLFLAGS=-rpath /usr/lib/aarch64-linux-gnu/frr" --disable-dependency-tracking --disable-dev-build --enable-systemd=yes --enable-rpki --with-libpam --enable-doc --enable-doc-html --enable-snmp --enable-fpm --disable-zeromq --enable-ospfapi --disable-bgp-vnc --enable-multipath=128 --enable-user=root --enable-group=root --enable-vty-group=root --enable-configfile-mask=0640 --enable-logfile-mask=0640 --disable-address-sanitizer --enable-cumulus=yes --enable-datacenter=yes --enable-bfdd=no --enable-sharpd=yes --enable-dp-doca=yes --enable-dp-dpdk=yes 

As I used the DPU as my development environment, I built and installed the FRR binaries in place:

root@dpu-arm:~/code# make –j12 all; make install 

Here’s how the build times fared. I measured that multiple ways:

  • Time to build and install the binaries using make -j12 all and make install
  • Time to build the same binaries but also assemble them into a Debian package using dpkg-buildpackage –j12 –uc –us 

The first method is used for coding and unit testing. The second method of generating debs is needed to compare with build times on other external development environments.

DPU-ARM build Times

Real  

User 

Sys 

DPU Arm  

(Complete make) 

2min 40.529 sec 

16min 29.855 sec 

2min 1.534 sec 

DPU Arm  

(Debian package) 

5min 23.067 sec 

20min 33.614 sec 

2min 49.628sec 

Table 1. DPU-Arm build times

The difference in times is expected. Generating a package involves several additional steps. 

There are some clear advantages to using the DPU as your development environment.

  • You can code, build and install, and then unit-test without leaving your workspace.
  • You can optimize the build for incremental code changes.

The last option is usually a massive reduction in build time compared to a complete build. For example, I modified the DOCA dataplane code in FRR and rebuilt with these results:

root@dpu-arm:~/code/frr# time make –j12 

>>>>>>>>>>>>> snipped make output >>>>>>>>>>>> 

real    0m3.119s 

user   0m2.794s 

sys     0m0.479s 

While that may make things easier, it requires reserving a DPU indefinitely for every developer for the sole purpose of application development or maintenance. Your development environment may also require more memory and horsepower, making this a less viable option long-term. 

Developing on an x86 server 

My Bluefield2 DPU was hosted by an x86-64 Ubuntu 20.04 server, and I used this server for my development environment.

root@server1-x86:~# lscpu |grep "CPU(s):|Model name" 

CPU(s):               32 

Model name:    Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz 

root@server1-x86:~# grep MemTotal /proc/meminfo 

MemTotal:       131906300 kB 

In this case, the build-machine is x86 and the host-machine where the app is going to run is DPU-Arm64. There are several ways to do this:

  • Use an Arm emulation on the x86 build-machine. A DOCA development container is available as a part of the DOCA packages.
  • Use a cross-compilation toolchain. 

In this test, I used the first option as it was the easiest. The second option can give you a different performance but creating that toolchain has its challenges

I downloaded and loaded the bfb_builder_doca_ubuntu_20.04 container on my x86 server and fired it up.

root@server1-x86:~# sudo docker load -i bfb_builder_doca_ubuntu_20.04-mlnx-5.4.tar 
root@server1-x86:~# docker run -v ~/code:/code --privileged -it -e container=dock 
er doca_v1.11_bluefield_os_ubuntu_20.04-mlnx-5.4:latest 

The DOCA and DPDK libraries come preinstalled in this container, and I just had to add them to the PKG_CONFIG path.

root@86b87b0ab0c2:/code # export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/opt/mellanox/dpdk/lib/aarch64-linux-gnu/pkgconfig 

I set up the workspace and FRR prerequisites within the container, same as with the previous option.

root@86b87b0ab0c2:/code # git clone https://github.com/AnuradhaKaruppiah/frr.git 
root@86b87b0ab0c2:/code # cd frr 
root@86b87b0ab0c2:/code/frr # git checkout dp-doca 

I could build my application within this DOCA container, but I couldn’t test it in place. So, the FRR binaries had to be built and packaged into debs, which are then copied over to the Bluefield DPU for testing. I set up the FRR Debian rules to match the FRR build configuration used in the previous option and generated the package:

root@86b87b0ab0c2:/code/frr # dpkg-buildpackage –j12 –uc -us 

Table 2 shows how the build time compares with previous methods.

DPU-Arm & X86 Build Times

Real  

User 

Sys 

DPU Arm 

(Complete make) 

2min 40.529sec 

16min 29.855sec 

2min 1.534sec 

DPU Arm 

(Debian package) 

5min 23.067sec 

20min 33.614sec 

2min 49.628sec 

X86 + DOCA dev container 

(Debian package) 

24min 19.051sec 

 

139min 39.286s 

 

3min 58.081sec 

 

Table 2. DPU-Arm and X86 build times

The giant jump in build time surprised me because I have an amply stocked x86 server and no Docker limits. So, it seems throwing CPUs and RAM at a problem doesn’t always help! This performance degradation is because of the cross architecture, as you can see with the next option. 

Developing in an AWS Graviton instance 

Next, I tried building my app natively on Arm but this time on an external server with more horsepower. I used an Amazon EC2 Graviton instance for this purpose with specs comparable to my x86 server. 

  • Arm64 arch, Ubuntu 20.04 OS
  • 128G RAM 
  • 32 vCPUs 
root@ip-172-31-28-243:~#  lscpu |grep "CPU(s):|Model name" 
CPU(s):              32 
Model name:   Neoverse-N1 
root@ip-172-31-28-243:~# grep MemTotal /proc/meminfo 
MemTotal:       129051172 kB 

To set up the DOCA and DPDK libraries in this instance, I installed the DOCA SDK repo meta package.

root@ip-172-31-28-243:~#  dpkg -i doca-repo-aarch64-ubuntu2004-local_1.1.1-1.5.4.2.4.1.3.bf.3.7.1.11866_arm64.deb 
root@ip-172-31-28-243:~#  apt update 
root@ip-172-31-28-243:~# apt install doca-sdk 

The remaining steps for cloning and building the FRR Debian package are the same as the previous option.  

Table 3 shows how the build fared on the AWS Arm instance.

DPU-Arm, X86 & AWS-Arm Build Times

Real  

User 

Sys 

DPU Arm 

(Complete make) 

2min 40.529sec 

16min 29.855sec 

2min 1.534sec 

DPU Arm 

(Debian package) 

5min 23.067sec 

20min 33.614sec 

2min 49.628sec 

X86 + DOCA dev container 

(Generate Debian package) 

24min 19.051sec 

 

139min 39.286sec 

 

3min 58.081sec 

 

AWS-Arm  

(Generate Debian package) 

1min 30.480sec 

 

6min 6.056sec 

0min 35.921sec 

 

Table 3. DPU-Arm, X86 and AWS-Arm build times

 This is a clear winner, no coffee needed.

Figure 1 shows the compile times in these environments.

Build times through different development environments
Figure 1. FRR build times with different options

Summary 

In this post, I discussed several development environments for DPU applications:

  • Bluefield DPU 
  • DOCA dev container on an x86 server
  • AWS Graviton compute instance 

You can prototype your app directly on the DPU, experiment with developing in the x86 DOCA development container, and grab an AWS Graviton instance with DOCA to punch it into hyperspeed! 

For more information, see the following resources: