Categories
Misc

Tensorflow2 traning with TFRecords: TypeError: ‘NoneType’ object is not callable

Hi everyone,

I have a question about TFRecords and how to train tf.keras models with them. For this I built a toy example, loading the iris dataset, writing the the data to a TFRecord, reading it back in and trying to train a simple MLP I found in a tutorial.

For encoding/writing/reading/decoding the TFRecords I followed mostly the official [Tutorial](https://www.tensorflow.org/tutorials/load_data/tfrecord). As far as I am able to see I can recover the original data, so I thought I plug the dataset into the MLP as the [fit method](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) should be able to work with tf.data datasets. In an example notebook with the MNIST dataset it worked fine, but in my case fit throws the following error:

TypeError: 'NoneType' object is not callable 

I copied the notebook to gist.

Does anybody know, how to solve this problem or if I am doing something wrong?

P.S.: I already tried to reshape the tensors, as some posts mentioned that the shapes are not always recovered.

submitted by /u/fXb0XTC3
[visit reddit] [comments]

Categories
Misc

NVIDIA and Palo Alto Networks Deliver Unprecedented Firewall Performance for 5G Cloud-Native Security with DPU Acceleration

Watch the replay of this joint session at GTC 21 to learn about achieving near-line rate speed of a next-generation firewall through the use of DPUs for a highly efficient 5G native security solution.

5G is unlike earlier generations of wireless networks. 5G offers many new capabilities such as lower latency, higher reliability and throughput, agile service deployment through cloud-native architectures, greater device density, and more. The adoption of 5G and its expanded capabilities drives the bandwidth requirements of mobile networks to 100Gbps and beyond. 

With 5G and the increasingly frequent implementations of cloud computing, a new direction in cyber-security is required to maintain adequate protection. Today’s cyber-attack methods are demonstrating increased sophistication and going after larger attack surfaces. Coupling this with modern cloud environments – which are more vulnerable than on-premises deployments – makes proper security enforcement difficult. With 5G, a new approach to security is needed to achieve adequate protection.

Figure 1. Properly securing 5G networks is becoming increasingly challenging

Next-generation 5G Firewall

Palo Alto Networks and NVIDIA have collaborated to create a scalable, adaptive security solution that combines the Palo Alto Next-Generation Firewall with the NVIDIA BlueField-2 Data Processing Unit (DPU). Integrating these two raises the bar for high-performance security in virtualized software-defined networks. The NVIDIA BlueField-2 DPU provides a rich set of network offload engines designed to address evolving security needs within demanding markets such as 5G and the cloud. Palo Alto Networks has taken its expertise in securing enterprise and mobile networks and applied it to 5G. They used this knowledge to implement a 5G-native security initiative that includes a virtual firewall. The virtual firewall is designed to meet the stringent security needs of 5G cloud-native environments, offering scale, operational simplicity, and automation, enabling customers to gain unparalleled security protection.

Figure 2. Palo Alto Networks’ Next-Generation Firewall provides native 5G security

For data centers looking to modernize their security infrastructure within 5G and cloud environments, the power of a software-defined, hardware-accelerated security architecture from NVIDIA and Palo Alto Networks provides increased infrastructure efficiency, granular zero-trust security across the entire solution stack, and streamlined security and management operations.

Figure 3. Intelligent traffic offload provided by BlueField-2 DPU

The dynamic nature of this solution has intelligent traffic offloads built in so that it adapts to real-time threats without requiring changes to the network infrastructure. The NVIDIA ASAP2 VNF offload technology filters or steers traffic for elephant flows identified based on AppID. Further, this AppID is used to inspect the first few packets to detect if it contains a threat or if it can offload the flow. If the packet is not suitable for offload, it is sent to the firewall for inspection. If the firewall determines the session is of no threat, it is sent to the PAN gRPCd process that calls the DPU daemon to add the session to the DPU session table for future offloading. The DPU will handle all subsequent packets in the flow without consuming any server CPU cycles for firewall processing. The solution provides up to 100Gb/s throughput with 80% of traffic offloaded to the DPU and ensures the highest performance without utilizing the CPU. This results in an throughput increase of 5X compared to host-based traditional firewall security solutions.

At GTC 21, NVIDIA and Palo Alto Networks jointly presented the intelligent traffic offload use-case for 5G native security. Watch the replay of this joint session to learn about achieving near-line rate speed of a next-generation firewall through the use of DPUs for a highly efficient 5G native security solution. Don’t miss the demonstration showcasing the flexibility, programmability, and agility of the Palo Alto Networks and NVIDIA joint cyber-security solution. The GA of this solution is targeted for May 2021. Please connect with your NVIDIA or Palo Alto Networks sales representatives to learn more.

Click here to watch the recorded session from GTC.

Categories
Misc

Accelerating Edge Computing with a Smarter Network

As computing power moves to edge computing, NVIDIA announces the EGX platform and ecosystem, including server-vendor certifications, hybrid cloud partners, and new GPU Operator software.

Demand for edge computing is growing rapidly because people increasingly need to analyze and use data where it’s created instead of trying to send it back to a data center. New applications cannot wait for the data to travel all the way to a centralized server, wait for it to be analyzed, then wait for the results to make the return trip. They need the data analyzed RIGHT HERE, RIGHT NOW! 

To meet this need, NVIDIA just announced an expansion of the NVIDIA EGX platform and ecosystem, which includes server vendor certifications, hybrid cloud partners, and new GPU Operator software that enables a cloud-native deployment for GPU servers. As computing power moves to the edge, we find that smarter edge computing needs smarter networking, and so the EGX ecosystem includes NVIDIA networking solutions.

IoT drives the need for edge computing

The growth of the Internet of Things (IoT), 5G wireless, and AI are all driving the move of compute to the edge. IoT means more—and smarter—devices are generating and consuming more data but in the field, far from traditional data centers. Autonomous vehicles, digital video cameras, kiosks, medical equipment, building sensors, smart cash registers, location tags, and of course phones will soon generate data from billions of end points. This data must be collected, filtered, and analyzed. Often, the distilled results are transferred to another data center or endpoint somewhere else. Sending all the data back to the data center without any edge processing not only adds much latency, it’s often too much data to transmit over WAN connections. Data centers often don’t even have enough room to store all the unfiltered, uncompressed data coming from the edge.

5G brings higher bandwidth and lower latency to the edge, enabling faster data acquisition and new applications for IoT devices. Data that previously wasn’t collected or which couldn’t be shared is now available over the air. The faster wireless connectivity enables new applications that use and respond to data at the edge, in real time. That’s instead of waiting for it to be stored centrally then analyzed later, if it’s analyzed at all.

AI means more useful information can be derived from all the new data, driving quick decisions. The flood of IoT data is too voluminous to be analyzed by humans. It requires AI technology to separate the wheat from the chaff (the signal from the noise). The decision and insights from AI then feed applications both at the edge and back in the central data center.

5G will enable analysis and decision making at the edge which will enable new services for telco operators
Figure 1. IoT and 5G wireless drive increased deployment of AI computing at the edge.

NVIDIA EGX delivers AI at the edge

Many edge AI workloads—such as image recognition, video processing, and robotics—require massive parallel processing power, an area where NVIDIA GPUs are unmatched. To meet the need for more advanced AI processing at the edge, NVIDIA introduced the EGX platform. The EGX  platform supports a hyper-scalable range of GPU servers, from a single NVIDIA Jetson Nano system up to a full rack of NVIDIA T4 or V100 Tensor Core servers. The Jetson Nano delivers up to half a trillion operations per second (1/2 TOPS), while a full rack of T4 servers can handle ten thousand trillion operations per second (10,000 TOPS).

NVIDIA EGX also includes container-based tools, drivers, and NVIDIA CUDA-X libraries to support AI applications at the edge. EGX is supported by major server vendors and includes integration with Red Hat OpenShift to provide enterprise-class container orchestration based on Kubernetes. This is all critical because so many of the edge computing locations—retail stores, hospitals, self-driving cars, homes, factories, cell phones, and so on— are supported by enterprises, local government, and telcos, not by hyperscalers.

Recently, NVIDIA announced new EGX features and customer implementations, along with strong support for hybrid cloud solutions. The NGC-Ready server certification program has been expanded to include tests for edge security and remote management, and the new NVIDIA GPU Operator simplifies management and operation of AI across widely distributed edge devices.

EGX systems for telcos include the small NVIDIA Jetson Nano to the large NVIDIA T4 or V100 Tensor Core servers, as well as systems validated through the NVIDIA NGC - Ready program. Each are capable of running Kubernetes container engines and the NVIDIA software stack.
Figure 2. NVIDIA EGX platform includes GPU, CUDA-X interfaces, container management, and certified hardware partners.

Smarter edge needs smarter networking

But there is another class of technology and partners needed to make EGX—and AI at the edge—as smart and efficient as it can be: networking. As the amount of GPU-processing power at the edge and the number of containers increases, the amount of network traffic can also increase exponentially.

Before AI, the analyzable edge data traffic, not counting streamed graphics, videos and music going out to phones, probably flowed 95% inbound. For example, data might flow from cameras to digital video recorders, from cars to servers, or from retail stores to a central data center. Any analysis or insight would often be human-driven, as people can only concentrate and observe a single stream of video. The data might be stored for a later date, removing the ability to make instant decisions. 

Now, with AI solutions like EGX deployed at the edge, applications must talk with IoT devices, back to servers in the data center, and with each other. AI applications trade data and results with standard CPUs, data from the edge is synthesized with data from the corporate data center or public cloud, and the results get pushed back to the kiosks, cars, appliances, MRI scanners, and phones.

The result is a massive amount of N-way data traffic between containers, IoT devices, GPU servers, the cloud, and traditional centralized servers. Software-defined networking (SDN) and network virtualization play a larger role. This expanded networking brings new security concerns, as the potential attack surface for hackers and malware is much larger than before and cannot be contained inside a firewall.

As networking becomes more complex, the network must become smarter in many ways. Some examples of this are:

  • Packets must be routed efficiently between containers, VMs, and bare metal servers.
  • Network function virtualization (NFV) and SDN demand accelerated packet switching, which could be in user space or kernel space.
  • The use of RDMA requires hardware offloads on the NICs and intelligent traffic management on the switches.
  • Security requires that data be encrypted at rest or in flight, or both. Whatever is encrypted must also be decrypted at some point.
  • The growth in IoT data combined with the switch from spinning disk to flash call for compression and deduplication of data to control storage costs.

These increased network complexity and security concerns impose a growing burden on the edge servers as well as on the corporate and cloud servers that interface with them. With more AI power and faster network speeds, handling the network virtualization, SDN rules, and security filtering sucks up an expensive share of CPU cycles, unless you have the right kind of smart network. As the network connections get faster, that network’s smarts must be accelerated in hardware instead of running in software.

SmartNICs save edge compute cycles

Smarter edge computing requires smarter networking. If this networking is handled by the CPUs or GPUs, then valuable cycles are consumed by moving the data instead of analyzing and transforming it. Someone must encode and decode overlay network headers, determine which packet goes to which container, and ensure that SDN rules are followed. Software-defined firewalls and routers impose additional CPU burdens as packets must be filtered based on source, destination, headers, or even on the internal content of the packets. Then, the packets are forwarded, mirrored, rerouted, or even dropped, depending on the network rules.

Fortunately, there is a class of affordable SmartNICs, such as the NVIDIA ConnectX family, which offload all this work from the CPU. These adapters have hardware-accelerated functions to handle overlay networks, Remote Direct Memory Access, container networking, virtual switching, storage networking, and video streaming. They also accelerate the adapter side of network congestion management and QoS.

The newest adapters, such as the ConnectX-6 Dx, can perform in-line encryption and decryption in hardware at high speeds, supporting IPsec and TLS. With these important but repetitive network tasks safely accelerated by the NIC, the CPUs and GPUs at the edge connect quickly and efficiently with each other and the IoT, all the while focusing their core cycles on what they do best—running applications and parallelized processing of complex data.

BlueField DPU adds extra protection against malware and overwork

An even more advanced network option for edge compute efficiency is a DPU or data processing unit, such as the NVIDIA BlueField DPU. A DPU combines all the high-speed networking and offloads of a SmartNIC with programmable cores that can handle additional functions around networking, storage, or security.

  • It can offload both SDN data plane and control plane functions.
  • It can virtualize flash storage for CPUs or GPUs.
  • It can implement security in a separate domain to provide very high levels of protection against malware.

On the security side, DPUs such as BlueField provide security domain isolation. Without isolation, any security software is running in the same domain as the OS, container management, and application. If an attacker compromises any of those, the security software is at risk of being bypassed, removed, or corrupted. With BlueField, the security software continues running on the DPU where it can continue to detect, isolate, and report malware or breaches on the server. By running in a separate domain—protected by a hardware root of trust—the security features can sound the alarm to intrusion detection and prevention mechanisms and also prevent malware on the infected server from spreading.

The newest BlueField-2 DPU also adds regular expression (RegEx) matching that can quickly detect patterns in network traffic or server memory, so it can be used for threat identification. It also adds hardware offloads for data efficiency using deduplication through a SHA-2 hash and compression/decompression.

Smarter networking at the edge

With the increasing use of AI solutions at the edge, like the NVIDIA EGX platform, the edge becomes infinitely smarter. However, networking and security also get more complex and threaten to slow down servers, just when the growth of the IoT and 5G wireless requires more compute power. This can be solved with the deployment of SmartNICs and DPUs, such as the ConnectX and BlueField product families. These network solutions offload important network and security tasks, such as SDN, network virtualization, and software-defined firewall functions. This allows AI at the edge to run more efficiently and securely.

For more information, see the NVIDIA EGX Platform for Edge Computing webinar.

Categories
Misc

IDC Business Value White Paper: The Business Value of NVIDIA Ethernet Switch Solutions for Managing and Optimizing Network Performance

IDC analysts Brad Casemore and Harsh Singh interviewed IT organizations with real world experience deploying and managing Cumulus Linux and NVIDIA Spectrum switches in mission critical data centers over a significant time period.

NVIDIA recently commissioned IDC to conduct research into the business value and technical benefits of the NVIDIA Ethernet switch solution. IDC analysts Brad Casemore and Harsh Singh interviewed IT organizations with real world experience deploying and managing Cumulus Linux and NVIDIA Spectrum switches in mission critical data centers over a significant time period. The research included interviews with organizations that were using the combined solution set and had first-hand knowledge about the costs and benefits of this solution. 

During the interviews, companies were asked a variety of quantitative and qualitative questions about NVIDIA’s impact on their IT and network operations. The results highlighted several key benefits that NVIDIA customers are realizing, including: 

  • Higher efficiency:  Helping network operations and management staff be more efficient while allowing them to spend more time on innovation and business-related projects  
  • Better performance and security:  Improving overall network performance including increasing the efficiency of network security and reducing application latency  
  • Improved cost/performance:  Leveraging improved network and application performance while lowering costs to bolster business operations and results 
  • Increased reliability and productivity:  Reducing the occurrences and rate of unplanned downtime, thereby lowering business risk and increasing productivity 

“The biggest benefit is operational efficiencies — it’s easier to deploy and manage from an IT standpoint. Any time we need to make changes to the infrastructure, for whatever reason, we can do it more simply and quicker. And the raw IT budget has been reduced,” NVIDIA Ethernet Switching Solutions Customer 

In this paper, IDC concluded that a modern data center can be a significant contributor to business value when the datacenter network combines all the traditional attributes of switch hardware — such as scalability, reliability, performance, and low latency — with a commensurate degree of software-based network automation, programmability, flexibility, and actionable analytics and insights.   

In the IDC Business Value White Paper, IDC has documented the business value of NVIDIA Spectrum Ethernet Switches in conjunction with Cumulus Linux through the ease of overall network management and operations.  Interviewees also reported they were able to support their business with a more cost-effective and better performing network by reducing the costs of the network itself but still providing enough bandwidth for their organization’s business users. 

Join guests IDC analysts, Brad Casemore and Harsh Singh on May 27th at 10 am PT for a full readout on the details of their findings in a special webinar. Please register here. 

All webinar attendees will get exclusive early access to the full IDC Business Value White Paper*, entitled “The Business Value of NVIDIA Ethernet Switch Solutions for Managing and Optimizing Network Performance”.  You’ll also receive the IDC Executive Overview and IDC Snapshot/Infographic. 

*IDC Business Value Whitepaper, Sponsored by NVIDIA, The Business Value of NVIDIA Ethernet Switch Solutions for Managing and Optimizing Network Performance, IDC Doc. #US47556921, April 2021 

Categories
Offsites

KELM: Integrating Knowledge Graphs with Language Model Pre-training Corpora

Large pre-trained natural language processing (NLP) models, such as BERT, RoBERTa, GPT-3, T5 and REALM, leverage natural language corpora that are derived from the Web and fine-tuned on task specific data, and have made significant advances in various NLP tasks. However, natural language text alone represents a limited coverage of knowledge, and facts may be contained in wordy sentences in many different ways. Furthermore, existence of non-factual information and toxic content in text can eventually cause biases in the resulting models.

Alternate sources of information are knowledge graphs (KGs), which consist of structured data. KGs are factual in nature because the information is usually extracted from more trusted sources, and post-processing filters and human editors ensure inappropriate and incorrect content are removed. Therefore, models that can incorporate them carry the advantages of improved factual accuracy and reduced toxicity. However, their different structural format makes it difficult to integrate them with the existing pre-training corpora in language models.

In “Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training” (KELM), accepted at NAACL 2021, we explore converting KGs to synthetic natural language sentences to augment existing pre-training corpora, enabling their integration into the pre-training of language models without architectural changes. To that end, we leverage the publicly available English Wikidata KG and convert it into natural language text in order to create a synthetic corpus. We then augment REALM, a retrieval-based language model, with the synthetic corpus as a method of integrating natural language corpora and KGs in pre-training. We have released this corpus publicly for the broader research community.

Converting KG to Natural Language Text
KGs consist of factual information represented explicitly in a structured format, generally in the form of [subject entity, relation, object entity] triples, e.g., [10×10 photobooks, inception, 2012]. A group of related triples is called an entity subgraph. An example of an entity subgraph that builds on the previous example of a triple is { [10×10 photobooks, instance of, Nonprofit Organization], [10×10 photobooks, inception, 2012] }, which is illustrated in the figure below. A KG can be viewed as interconnected entity subgraphs.

Converting subgraphs into natural language text is a standard task in NLP known as data-to-text generation. Although there have been significant advances on data-to-text-generation on benchmark datasets such as WebNLG, converting an entire KG into natural text has additional challenges. The entities and relations in large KGs are more vast and diverse than small benchmark datasets. Moreover, benchmark datasets consist of predefined subgraphs that can form fluent meaningful sentences. With an entire KG, such a segmentation into entity subgraphs needs to be created as well.

An example illustration of how the pipeline converts an entity subgraph (in bubbles) into synthetic natural sentences (far right).

In order to convert the Wikidata KG into synthetic natural sentences, we developed a verbalization pipeline named “Text from KG Generator” (TEKGEN), which is made up of the following components: a large training corpus of heuristically aligned Wikipedia text and Wikidata KG triples, a text-to-text generator (T5) to convert the KG triples to text, an entity subgraph creator for generating groups of triples to be verbalized together, and finally, a post-processing filter to remove low quality outputs. The result is a corpus containing the entire Wikidata KG as natural text, which we call the Knowledge-Enhanced Language Model (KELM) corpus. It consists of ~18M sentences spanning ~45M triples and ~1500 relations.

Converting a KG to natural language, which is then used for language model augmentation

Integrating Knowledge Graph and Natural Text for Language Model Pre-training
Our evaluation shows that KG verbalization is an effective method of integrating KGs with natural language text. We demonstrate this by augmenting the retrieval corpus of REALM, which includes only Wikipedia text.

To assess the effectiveness of verbalization, we augment the REALM retrieval corpus with the KELM corpus (i.e., “verbalized triples”) and compare its performance against augmentation with concatenated triples without verbalization. We measure the accuracy with each data augmentation technique on two popular open-domain question answering datasets: Natural Questions and Web Questions.

Augmenting REALM with even the concatenated triples improves accuracy, potentially adding information not expressed in text explicitly or at all. However, augmentation with verbalized triples allows for a smoother integration of the KG with the natural language text corpus, as demonstrated by the higher accuracy. We also observed the same trend on a knowledge probe called LAMA that queries the model using fill-in-the-blank questions.

Conclusion
With KELM, we provide a publicly-available corpus of a KG as natural text. We show that KG verbalization can be used to integrate KGs with natural text corpora to overcome their structural differences. This has real-world applications for knowledge-intensive tasks, such as question answering, where providing factual knowledge is essential. Moreover, such corpora can be applied in pre-training of large language models, and can potentially reduce toxicity and improve factuality. We hope that this work encourages further advances in integrating structured knowledge sources into pre-training of large language models.

Acknowledgements
This work has been a collaborative effort involving Oshin Agarwal, Heming Ge, Siamak Shakeri and Rami Al-Rfou. We thank William Woods, Jonni Kanerva, Tania Rojas-Esponda, Jianmo Ni, Aaron Cohen and Itai Rolnick for rating a sample of the synthetic corpus to evaluate its quality. We also thank Kelvin Guu for his valuable feedback on the paper.

Categories
Misc

GFN Thursday Plunges Into ‘Phantom Abyss,’ the New Adventure Announced by Devolver Digital

GFN Thursday returns with a brand new adventure, exploring the unknown in Phantom Abyss, announced just moments ago by Devolver Digital and Team WIBY. The game launches on PC this summer, and when it does, it’ll be streaming instantly to GeForce NOW members. No GFN Thursday would be complete without new games. And this week Read article >

The post GFN Thursday Plunges Into ‘Phantom Abyss,’ the New Adventure Announced by Devolver Digital appeared first on The Official NVIDIA Blog.

Categories
Misc

Tips: Acceleration Structure Compaction

Learn how to compact the acceleration structure in DXR and what to know before you start implementing.

In ray tracing, more geometries can reside in the GPU memory than with the rasterization approach because rays may hit the geometries out of the view frustum. You can let the GPU compact acceleration structures to save memory usage. For some games, compaction reduces the memory footprint for a bottom-level acceleration structure (BLAS) by at least 50%. BLASes usually take more GPU memory than top-level acceleration structures (TLAS), but this post is also valid for TLAS.

In this post, I discuss how to compact the acceleration structure in DXR and what to know before you start implementing. Do you have your acceleration structure already working but you want to keep the video memory usage as small as possible? Read Managing Memory for Acceleration Structures in DirectX Raytracing first and then come back.

I assume that you already have your acceleration structures suballocated by larger resources and want to save more video memory by compacting them. I use DXR API in this post but it’s similar in Vulkan too.

How does compaction work?

BLAS compaction is not as trivial as adding a new flag to the acceleration structure build input. For your implementation, you can consider this process as a kind of state machine that runs over a few frames (Figure 1). The compaction memory size isn’t known until after the initial build is completed. Wait until the compaction process is completed on the GPU. Here is the brief process to compact BLAS.

A block diagram with control flow from adding flags to calling DispatchRays.
Figure 1. Compaction workflow.
  1. Add the compaction flag when building the acceleration structure. For BuildRaytracingAccelerationStructure, you must specify the _ALLOW_COMPACTION build flag for the source BLAS from which to compact.
  2. Read the compaction sizes:
    • Call EmitRaytracingAccelerationStructurePostbuildInfo with the compaction size buffer, _POSTBUILD_INFO_COMPACTED_SIZE flag and the source BLASes that are built with the _ALLOW_COMPACTION flag. This computes the compaction buffer size on the GPU, which is then used to allocate the compaction buffer. The compaction size buffer is a buffer that holds the size values when it’s ready.
    • You can pass the post build info structure in your source BuildRaytracingAccelerationStructure instead of calling EmitRaytracingAccelerationStructurePostbuildInfo.
    • The API doesn’t directly return the size that you want to use, as it’s calculated from the GPU.
    • Use appropriate synchronization (for example, fence/signal) to make sure that you’re OK to read back the compaction size buffer.
    • You can use CopyResource and Map to read back the content of the compaction size buffer from GPU to CPU. There could be a couple of frames of delay for reading the size if you execute the command buffer and submit the queue one time per frame.
    • If the compaction size buffer isn’t ready to be read, then you can keep using the original BLAS for the rest of your rendering pipeline. In the next frames, you keep checking the readiness and continue the following steps.
  1. Create a new target BLAS resource with the known compaction size. Now you know the size and you can make your target BLAS resource ready.
  2. Compact it with Copy:
    • Copy from the source BLAS to the target BLAS using CopyRayTracingAccelerationStructure with the _COPY_MODE_COMPACT flag. Your target BLAS has the compacted content when it’s finished in GPU.
    • Make sure that your source BLAS has been built in the GPU already before running CopyRayTracingAccelerationStructure.
    • Wait for compaction to be finished using fence/signal.
    • You can also run compactions in parallel with other compactions and with other builds and refits.
  1. (Optional) Build a TLAS that points to the new compacted BLAS.
  2. Use it with DispatchRays. You are now OK to call DispatchRays or use inline ray tracing that uses the compacted BLAS.

Tips

Here are a few tips to help you deal with crashes, corruption, and performance issues.

Compaction count

You don’t need to compact all the BLASes in one frame because you can still call DispatchRays with the source BLASes while they’re being compacted. Limit your per-frame BLAS compaction count based on your frame budget.

Animating BLAS

It’s possible to compact animating BLASes, like for characters or particles. However, you pay the compaction cost and the delay of updates. I don’t recommend using compaction on particles and exploding meshes.

Your compacted BLAS could be outdated when it’s ready. In this case, you can refit on the compacted BLAS, if you can.

Don’t add _ALLOW_COMPACTION flag to BLASes that won’t be compacted because adding this flag isn’t free even though the cost is small.

Crashes or corruptions

If you have crashes or corruptions after your compaction-related changes, then try replacing the _COPY_MODE_COMPACT mode in your CopyRaytracingAccelerationStructure  with _COPY_MODE_CLONE instead.

Specify the initial acceleration structure size instead of the compacted size. It’ll make sure that your corrupted data is not from the result of the actual compaction if you still have the same issues. It could be from using the wrong/invalid resources or being out of sync due to missing barriers or fence waiting.

Null UAV barriers

Use null UAV barriers to find why GPU crash issues happen. Keep in mind that using null UAV barriers is suboptimal and try to use more specific options. If you do use null UAV barriers, add them as follows:

  • Before the Emit..PostbuildInfo call
  • Before reading back the sizes
  • After calling CopyRaytracingAccelerationStructure for compaction

A null UAV barrier is just the easiest way to make sure that you got all the resources covered, like if you accidentally use the same scratch resource for multiple BLAS builds. The null barrier should prevent those from clobbering each other. Or if you have a preceding skinning shader, you’ll be sure that the vertex positions are updated.

Figure 2 shows two usage patterns that explain where to add barriers. Those barriers are all necessary but you can try replacing them with null barriers to make sure that you didn’t miss anything.

Two block diagrams from building BLASes and calling DispatchRays with UAV Barriers in green box
Figure 2. Two barrier usage patterns: (a) is for BLASes built without post build info, (b) is with post build info.

Destroy the source BLAS

Do not destroy the source BLAS while it’s still being used. GPU memory savings can be achieved after you delete the source BLAS. Until then, you are keeping two versions of BLASes in the GPU. Destroy the resource as soon as possible after compacting it. Keep in mind that you can’t destroy them even after CopyAccelerationStructure is completed if you still have previous DispatchRays that uses the source BLAS in GPU.

Max compaction

If you don’t use the PREFER_FAST_BUILD or ALLOW_UPDATE flags, then you should get max compaction.

  • PREFER_FAST_BUILD uses its own compaction method and results can differ from ALLOW_COMPACTION.
  • ALLOW_UPDATE must leave room for updated triangles.

Conclusion

It might be more complicated than you thought, but it’s worth doing. I hope you are pleased with the compression rate and the total savings of the GPU memory. I recommend adding debug features that visualize the memory savings and the compression rate from your app to track how it goes per content.

Categories
Misc

Accelerating AI Modules for ROS and ROS 2 on NVIDIA Jetson Platform

In this post, we showcase our support for open-source robotics frameworks including ROS and ROS 2 on NVIDIA Jetson developer kits.

NVIDIA Jetson developer kits serve as a go-to platform for roboticists because of its ease of use, system support, and its comprehensive support for accelerating AI workloads. In this post, we showcase our support for open-source robotics frameworks including ROS and ROS 2 on NVIDIA Jetson developer kits.

Diagram shows flow of work to get started with ROS and ROS 2 on the Jetson platform.
Figure 1. ROS and ROS 2 with AI acceleration on NVIDIA Jetson platform.

This post includes the following helpful resources:

ROS and ROS 2 Docker containers

We offer different Docker images for ROS and ROS 2 with machine learning libraries. We also provide Dockerfiles for you to build your own Docker images according to your custom requirements.

ROS and ROS 2 Docker images

We provide support for ROS 2 Foxy Fitzroy, ROS 2 Eloquent Elusor, and ROS Noetic with AI frameworks such as PyTorch, NVIDIA TensorRT, and the DeepStream SDK. We include machine learning (ML) libraries including scikit-learn, numpy, and pillow. The containers are packaged with ROS 2 AI packages accelerated with TensorRT.

ROS 2 Foxy, ROS 2 Eloquent, and ROS Noetic with PyTorch and TensorRT Docker image:

Table 1 shows the pull commands for these Docker images.

Docker Images Pull command
ROS 2 Foxy with PyTorch and TensorRT $docker pull nvidiajetson/l4t-ros2-foxy-pytorch:r32.5
ROS 2 Foxy with DeepStream SDK $docker pull nvidiajetson/deepstream-ros2-foxy:5.0.1
ROS 2 Eloquent with PyTorch and TensorRT $docker pull nvidiajetson/l4t-ros2-eloquent-pytorch:r32.5
ROS 2 Eloquent with DeepStream SDK $docker pull nvidiajetson/deepstream-ros2-eloquent:5.0.1
ROS Noetic with PyTorch and TensorRT $docker pull nvidiajetson/l4t-ros-noetic-pytorch:r32.5
Table 1. Pull commands for ROS 2 Docker images.

ROS and ROS 2 DockerFiles

To enable you to easily run different versions of ROS 2 on Jetson, we released various Dockerfiles and build scripts for ROS 2 Eloquent, ROS 2 Foxy, ROS Melodic, and ROS Noetic. These containers provide an automated and reliable way to install ROS and ROS 2 on Jetson and build your own ROS-based applications.

Because Eloquent and Melodic already provide prebuilt packages for Ubuntu 18.04, the Dockerfiles, install these versions of ROS into the containers. In contrast, Foxy and Noetic are built from the source inside the container, as those versions only come prebuilt for Ubuntu 20.04. With the containers, using these versions of ROS and ROS 2 is the same, regardless of the underlying OS distribution.

To build the containers, clone the repo on your Jetson device running NVIDIA JetPack 4.4 or newer, and run the ROS build script:

$ git clone https://github.com/dusty-nv/jetson-containers
$ cd jetson-containers
$ ./scripts/docker_build_ros.sh all       # build all: melodic, noetic, eloquent, foxy
$ ./scripts/docker_build_ros.sh melodic   # build only melodic
$ ./scripts/docker_build_ros.sh noetic    # build only noetic
$ ./scripts/docker_build_ros.sh eloquent  # build only eloquent
$ ./scripts/docker_build_ros.sh foxy      # build only foxy 

Accelerated AI ROS and ROS 2 packages

GitHub: NVIDIA-AI-IOT/ros2_torch_trt

We’ve put together bundled packages with all the materials needed to run various GPU-accelerated AI applications with ROS and ROS 2 packages. There are applications for object detection, human pose estimation, gesture classification, semantics segmentation, and NVApril Tags.

The repository provides four different packages for classification and object detection using PyTorch and TensorRT. This repository serves as a starting point for AI integration with ROS 2. The main features of the packages are as follows:

  • For classification, select from various ImageNet pretrained models, including Resnet18, AlexNet, SqueezeNet, and Resnet50.
  • For detection, MobileNetV1-based SSD is currently supported, trained on the COCO dataset.
  • The TensorRT packages provide a significant speedup in carrying out inference relative to the PyTorch models performing inference directly on the GPU.
  • The inference results are published in the form of vision_msgs.
  • On running the node, a window is also shown with the inference results visualized.
  • A Jetson-based Docker image and launch file is provided for ease of use.

For more information, see Implementing Robotics Applications with ROS 2 and AI on the NVIDIA Jetson Platform.

ROS and ROS 2 packages for accelerated deep learning nodes

GitHub: dusty-nv/ros_deep_learning

This repo contains deep learning inference nodes and camera/video streaming nodes for ROS and ROS 2 with support for Jetson Nano, TX1, TX2, Xavier NX, NVIDIA AGX Xavier, and TensorRT.

The nodes use the image recognition, object detection, and semantic segmentation DNNs from the jetson-inference library and NVIDIA Hello AI World tutorial. Both come with several built-in pretrained networks for classification, detection, and segmentation and the ability to load customized user-trained models.

The camera/video streaming nodes support the following I/O interfaces:

  • MIPI CSI cameras
  • V4L2 cameras
  • RTP / RTSP
  • Videos and images
  • Image sequences
  • OpenGL windows

ROS Melodic and ROS 2 Eloquent are supported. We recommend the latest version of NVIDIA JetPack.

ROS 2 package for human pose estimation

GitHub: NVIDIA-AI-IOT/ros2_trt_pose

In this repository, we accelerate human-pose estimation using TensorRT. We use the widely adopted NVIDIA-AI-IOT/trt_pose repository. To understand human pose, pretrained models infer 17 body parts based on the categories from the COCO dataset. Here are the key features of the ros2_trt_pose package:

  • Publishes pose_msgs, such as count of person and person_id. For each person_id, it publishes 17 body parts.
  • Provides a launch file for easy usage and visualizations on Rviz2:
    • Image messages
    • Visual markers: body_joints, body_skeleton
  • Contains a Jetson-based Docker image for easy install and usage.

For more information, see Implementing Robotics Applications with ROS 2 and AI on the NVIDIA Jetson Platform.

ROS 2 package for accelerated NVAprilTags

GitHub: NVIDIA-AI-IOT/ros2-nvapriltags

This ROS 2 node uses the NVIDIA GPU-accelerated AprilTags library to detect AprilTags in images and publish the poses, IDs, and additional metadata. This has been tested on ROS 2 (Foxy) and should run on x86_64 and aarch64 (Jetson hardware). It is modeled after and comparable to the ROS 2 node for CPU AprilTags detection.

For more information about the NVIDIA Isaac GEM on which this node is based, see April Tags in the NVIDIA Isaac SDK 2020.2 documentation. For more information, see AprilTags Visual Fiducial System.

ROS 2 package for hand pose estimation and gesture classification

GitHub: NVIDIA-AI-IOT/ros2_trt_pose_hand

The ROS 2 package takes advantage of the recently released NVIDIA-AI-IOT/trt_pose_hand repo: Real-time hand pose estimation and gesture classification using TensorRT. It provides following key features:

  • Hand pose message with 21 key points
  • Hand pose detection image message
  • std_msgs for gesture classification with six classes:
    • fist
    • pan
    • stop
    • fine
    • peace
    • no hand
  • Visualization markers
  • Launch file for RViz2

ROS 2 package for text detection and monocular depth estimation

GitHub: NVIDIA-AI-IOT/ros2_torch2trt_examples

In this repository, we demonstrate the use of torch2trt, an easy-to-use PyTorch-to-TensorRT converter, for two different applications:

For easy integration and development, the ROS 2 package performs the following steps:

  1. Subscribes to the image_tools cam2image image message.
  2. Optimizes the model to TensorRT.
  3. Publishes the image message.

ROS and ROS 2 package for Jetson stats

GitHub:  NVIDIA-AI-IOT/ros2_jetson_stats

The jetson-stats package is for monitoring and controlling your NVIDIA Jetson [Xavier NX, Nano, NVIDIA AGX Xavier, TX1, or TX2]. In this repository, we provide a ROS 2 package for jetson_stats such that you can monitor different system status in deployment. The ROS package developed by Jetson Champion Raffaello Bonghi, PhD can be found at rbonghi/ros_jetson_stats.

The ros2_jetson_stats package features the following ROS 2 diagnostic messages:

  • GPU/CPU usage percentage
    • EMC/SWAP/Memory status (% usage)
    • Power and temperature of SoC

You can now control the following through the ROS 2 command line:

  • Fan (mode and speed)
    • Power model (nvpmodel)
    • Jetson_clocks

ROS 2 packages for the DeepStream SDK

The DeepStream SDK delivers a complete streaming analytics toolkit to build full AI-based solutions using multisensor processing, video, and image understanding. It offers supportfor popular object detection and segmentation models such as state-of-the-art SSD, YOLO, FasterRCNN, and MaskRCNN.

In this repository, we provide ROS 2 nodes based on the NVIDIA-AI-IOT/deepstream_python_apps repo to perform two inference object detection and attribute classification tasks:

  • Object detection: Four classes of objects are detected: Vehicle, Person, RoadSign, and TwoWheeler.
  • Attribute classification: Three types of attributes are classified for objects of class Vehicle: Color, Make, and Type.

We also provide sample ROS 2 subscriber nodes that subscribe to these topics and display results in the vision_msgs format. Each inference task also spawns a visualization window with bounding boxes and labels around detected objects.

For more information, see Implementing Robotics Applications with ROS 2 and AI on the NVIDIA Jetson Platform.

ROS-based Chameleon project: Understanding semantic obstacles with deep learning

GitHub:

This promising work looks at the potential to use the power of robotics and deep learning together. We use FCN-AlexNet, a segmentation network, to perform several real-world applications such as detecting stairs, potholes, or other hazards to robots in unstructured environments.

CUDA-accelerated Point Cloud Library

GitHub: NVIDIA-AI-IOT/cuda-pcl

Many Jetson users choose lidars as their major sensors for localization and perception in autonomous solutions. CUDA-PCL 1.0 includes three CUDA-accelerated PCL libraries:

  • CUDA-ICP
  • CUDA-Segmentation
  • CUDA-Filter

For more information, see Accelerating Lidar for Robotics with NVIDIA CUDA-based PCL.

NVIDIA Isaac Sim for robotics applications

JetBot modeled in NVIDIA Isaac Sim.
Figure 2. Waveshare JetBot in NVIDIA Isaac Sim.

For more information, see: Training Your NVIDIA JetBot to Avoid Collisions Using NVIDIA Isaac Sim.

Building

Here are sample projects to leverage the NVIDIA Jetson platform for both the open-source developer community, such as building an autonomous model-scale car, and enterprises, such as implementing human pose estimation for robot arm solutions. All are enabled by ROS, ROS 2 and NVIDIA Jetson.

ROS 2-based NanoSaur

NanoSaur is an open-source project designed and made by Raffaello Bonghi. It’s a fully 3D printable robot, made to work on your desk, and uses a simple camera with two OLED-like eyes. The size is 10x12x6cm in only 500g. With a simple power-bank, it can wander your desktop autonomously. It’s a little robot for robotics and AI education.

For more information, see About NanoSaur.

ROS and ROS 2 integration with Comau North America

This package demonstrates using a ROS 2 package to control the e.DO by bridging messages to ROS1, where the e.DO core package resides.

Video 1. Using NVIDIA Jetson and GPU accelerated gesture classification AI package with the Comau e.DO robot arm.

To test the Human Hand Pose Estimation package, the team used a Gazebo simulation of the Comau e.DO from Stefan Profanter’s open source repository. This enabled control of the e.DO in simulation with the help of MoveIt Motion Planning software. A ROS 2 node in the hand pose package publishes the hand pose classification message.

Because MoveIt 1.0 works only with ROS1, a software bridge was used to subscribe to the message from ROS1. Based on the hand pose detected and classified, a message with robot pose data is published to a listener, which sends the movement command to MoveIt. The resulting change in the e.DO robot pose can be seen in Gazebo.

ROS-based Yahboom DOFBOT

DOFBOT is the best partner for AI beginners, programming enthusiasts, and Jetson Nano fans. It is designed based on Jetson Nano and contains six HQ servos, an HD camera, and a multifunction expansion board. The whole body is made of green oxidized aluminum alloy, which is beautiful and durable. Through the ROS robot system, we simplify the motion control of serial bus servo.

For more information, see Yahboom DOFBOT AI Vision Robotic Arm with ROS Python programming for Jetson Nano 4GB B01.

ROS package for JetBot

GitHub: dusty-nv/jetbot_ros

JetBot is an open-source robot based on NVIDIA Jetson Nano:

  • Affordable: Less than $150 as an add-on to Jetson Nano.
  • Educational: Includes tutorials from basic motion to AI-based collision avoidance.
  • Fun: Interactively programmed from your web browser.

Building and using JetBot gives you practical experience for creating entirely new AI projects. To get started, read the JetBot documentation.

Summary

Keep yourself updated with ROS and ROS 2 support on NVIDIA Jetson.

Categories
Misc

GTC Sessions Now Available on NVIDIA On-Demand

With over 1600 sessions on the latest in AI, data center, accelerated computing, healthcare, intelligent networking, game development, and more – there is something for everyone.

With over 1600 sessions on the latest in AI, data center, accelerated computing, healthcare, intelligent networking, game development, and more – there is something for everyone

Categories
Misc

Any reason to use a Coral Edge TPU or Jetson Nano/NX/etc if desktop CPU is available?

^

Lets say for object detection of a video feed, would there be any value in using Google Coral Edge USB TPUs or processing Tensorflow on a Jetson device if the alterantive is something like an Intel NUC 10 i7 ( Core i7-10710U Passmark score: 10.1k) with NVME storage?

submitted by /u/GoingOffRoading
[visit reddit] [comments]