Categories
Misc

Meet the Omnivore: Developer Builds Bots With NVIDIA Omniverse and Isaac Sim

While still in grad school, Antonio Serrano-Muñoz has helped author papers spanning planetary gravities, AI-powered diagnosis of rheumatoid arthritis and robots that precisely track millimetric-sized walkers, like ants.

The post Meet the Omnivore: Developer Builds Bots With NVIDIA Omniverse and Isaac Sim appeared first on NVIDIA Blog.

Categories
Misc

Turbocharging Multi-Cloud Security and Application Delivery with VirtIO Offloading

F5 Accelerates Security and App DeliveryBy accelerating Virtio-net in hardware, poor network performance can be avoided while maintaining transparent software implementation, including full support for VM live migration. F5 Accelerates Security and App Delivery

The incredible increase of traffic within data centers along with increased adoption of virtualization is placing strains on the traditional data centers.

Customarily, virtual machines rely on software interfaces such as VirtIO to connect with the hypervisor. Although VirtIO is significantly more flexible compared to SR-IOV, it can use up to 50% more compute power in the host, thus reducing the servers’ overall efficiency.

Similarly, the adoption of software-defined data centers is on the rise. Both virtualization and software-defined workloads are extremely CPU-intensive. This creates inefficiencies that reduce overall performance system-wide. Furthermore, infrastructure security is potentially compromised as the application domain and networking domain are not separated.

F5 and NVIDIA recently presented on how to solve these challenges [NEED SESSION LINK] at NVIDIA GTC. F5 discussed accelerating its BIG-IP Virtual Edition (VE) virtualized appliance portfolio by offloading VirtIO to the NVIDIA BlueField-2 data processing unit (DPU) and ConnectX-6 Dx SmartNIC. In the session, they discuss how the DPU provides optimal acceleration and offload due to its onboard networking ASIC and Arm processor cores, freeing CPU cores to focus on application workloads.

Offloading to the DPU also provides domain isolation to secure resources more tightly. Support for VirtIO also enables dynamic composability, creating a software-defined, hardware-accelerated solution that significantly decreases reliance on the CPU while maintaining the flexibility that VirtIO offers.

Virtual switching acceleration

DPUs accelerating Virtio in hardware avoiding poor network performance from software implementations.
Figure 1. Offloading VirtIO moves virtual datapath out of software and into the hardware of the SmartNIC or DPU where it can be accelerated

Virtual switching was born as a consequence of server virtualization. Hypervisors need the ability to enable transparent traffic switching between VMs and with the outside world.

One of the most commonly used virtual switching software solutions is Open vSwitch (OVS). NVIDIA Accelerated Switching and Packet Processing (ASAP2) technology accelerates virtual switching to improve performance in software-defined networking environments.

ASAP2 supports using vDPA to offload virtual switching (the OVS data plane) from the control plane. This permits flow rules to be programmed into the eSwitch within the network adapter or DPU and allows the use of standard APIs and common libraries such as DPDK to provide significantly higher OVS performance without the associated CPU load.

ASAP2 also supports SR-IOV for hardware acceleration of the data plane. The combination of the two capabilities provides a software-defined and hardware-accelerated solution that resolves performance issues associated within virtual SDN vSwitching solutions.

Accelerated networking

Earlier this year, NVIDIA released NVIDIA DOCA, a framework that simplifies application development for BlueField DPUs. DOCA makes it easier to program and manage the BlueField DPU. Applications developed using DOCA for BlueField will also run without changes on future versions, ensuring forward compatibility.

DOCA consists of industry-standard APIs, libraries, and drivers. One of these drivers is the DOCA VirtIO-net, which provides virtio interface acceleration. When using BlueField, the virtio interface is run on the DPU hardware. This reduces the CPU’s involvement and accelerates VirtIO’s performance while enabling features such as live migrations.

Bar chart of performance testing done with VirtIO offloading shows a dramatic increase in performance and improvements in processing time and packets processed
Figure 2. Performance advantages available with VirtIO offloading [VirtIO INCORRECTLY CAPITALIZED IN CHART TITLE]

BIG-IP VE results

During the joint GTC session, F5 demonstrated the advantages of hardware acceleration versus running without hardware acceleration. The demonstration showed BIG-IP VE performing SSL termination for NGINX. The TSUNG traffic generator is used to send 512K byte packets through multiple instances of BIG-IP VE.

With VirtIO running on the host, the max throughput reached only 5 Gbps and took 187 seconds to complete, with only 80% of all packets processed.

The same scenario using hardware acceleration resulted in 16 Gbps of throughput in only 62 seconds and 100% of the packets were processed.

Summary

Increasing network speeds, virtualization, and software-defined networking are adding strain on data center systems and creating a need for efficiency improvements.

VirtIO is a well-established I/O virtualization interface but has a software-only framework. SR-IOV technology was developed precisely to support high performance and efficient offload and acceleration of network functionality, but it requires a specific driver in each VM. By accelerating VirtIO-net in hardware, you can avoid poor network performance while maintaining transparent software implementation, including full support for VM live migration.

The demonstration with F5 Networks showed a 320% improvement in throughput, a 66% reduction in processing time, and 100% of packets were processed. This is evidence that the evolving way forward is through hardware vDPA that combines the out-of-the-box availability of VirtIO drivers with the performance gains of DPU hardware acceleration.

This session was presented simulive at NVIDIA GTC and can be replayed. For more information about the F5-NVIDIA joint solution that demonstrates the benefits of reduced CPU utilization while achieving high performance using VirtIO, see GTC session titled, Multi-cloud Security and Appllicaiton Delivery with VirtIO.

Categories
Offsites

Enhancing Backpropagation via Local Loss Optimization

While model design and training data are key ingredients in a deep neural network’s (DNN’s) success, less-often discussed is the specific optimization method used for updating the model parameters (weights). Training DNNs involves minimizing a loss function that measures the discrepancy between the ground truth labels and the model’s predictions. Training is carried out by backpropagation, which adjusts the model weights via gradient descent steps. Gradient descent, in turn, updates the weights by using the gradient (i.e., derivative) of the loss with respect to the weights.

The simplest weight update corresponds to stochastic gradient descent, which, in every step, moves the weights in the negative direction with respect to the gradients (with an appropriate step size, a.k.a. the learning rate). More advanced optimization methods modify the direction of the negative gradient before updating the weights by using information from the past steps and/or the local properties (such as the curvature information) of the loss function around the current weights. For instance, a momentum optimizer encourages moving along the average direction of past updates, and the AdaGrad optimizer scales each coordinate based on the past gradients. These optimizers are commonly known as first-order methods since they generally modify the update direction using only information from the first-order derivative (i.e., gradient). More importantly, the components of the weight parameters are treated independently from each other.

More advanced optimization, such as Shampoo and K-FAC, capture the correlations between gradients of parameters and have been shown to improve convergence, reducing the number of iterations and improving the quality of the solution. These methods capture information about the local changes of the derivatives of the loss, i.e., changes in gradients. Using this additional information, higher-order optimizers can discover much more efficient update directions for training models by taking into account the correlations between different groups of parameters. On the downside, calculating higher-order update directions is computationally more expensive than first-order updates. The operation uses more memory for storing statistics and involves matrix inversion, thus hindering the applicability of higher-order optimizers in practice.

In “LocoProp: Enhancing BackProp via Local Loss Optimization”, we introduce a new framework for training DNN models. Our new framework, LocoProp, conceives neural networks as a modular composition of layers. Generally, each layer in a neural network applies a linear transformation on its inputs, followed by a non-linear activation function. In the new construction, each layer is allotted its own weight regularizer, output target, and loss function. The loss function of each layer is designed to match the activation function of the layer. Using this formulation, training minimizes the local losses for a given mini-batch of examples, iteratively and in parallel across layers. Our method performs multiple local updates per batch of examples using a first-order optimizer (like RMSProp), which avoids computationally expensive operations such as the matrix inversions required for higher-order optimizers. However, we show that the combined local updates look rather like a higher-order update. Empirically, we show that LocoProp outperforms first-order methods on a deep autoencoder benchmark and performs comparably to higher-order optimizers, such as Shampoo and K-FAC, without the high memory and computation requirements.

Method
Neural networks are generally viewed as composite functions that transform model inputs into output representations, layer by layer. LocoProp adopts this view while decomposing the network into layers. In particular, instead of updating the weights of the layer to minimize the loss function at the output, LocoProp applies pre-defined local loss functions specific to each layer. For a given layer, the loss function is selected to match the activation function, e.g., a tanh loss would be selected for a layer with a tanh activation. Each layerwise loss measures the discrepancy between the layer’s output (for a given mini-batch of examples) and a notion of a target output for that layer. Additionally, a regularizer term ensures that the updated weights do not drift too far from the current values. The combined layerwise loss function (with a local target) plus regularizer is used as the new objective function for each layer.

Similar to backpropagation, LocoProp applies a forward pass to compute the activations. In the backward pass, LocoProp sets per neuron “targets” for each layer. Finally, LocoProp splits model training into independent problems across layers where several local updates can be applied to each layer’s weights in parallel.

Perhaps the simplest loss function one can think of for a layer is the squared loss. While the squared loss is a valid choice of a loss function, LocoProp takes into account the possible non-linearity of the activation functions of the layers and applies layerwise losses tailored to the activation function of each layer. This enables the model to emphasize regions at the input that are more important for the model prediction while deemphasizing the regions that do not affect the output as much. Below we show examples of tailored losses for the tanh and ReLU activation functions.

Loss functions induced by the (left) tanh and (right) ReLU activation functions. Each loss is more sensitive to the regions affecting the output prediction. For instance, ReLU loss is zero as long as both the prediction (â) and the target (a) are negative. This is because the ReLU function applied to any negative number equals zero.

After forming the objective in each layer, LocoProp updates the layer weights by repeatedly applying gradient descent steps on its objective. The update typically uses a first-order optimizer (like RMSProp). However, we show that the overall behavior of the combined updates closely resembles higher-order updates (shown below). Thus, LocoProp provides training performance close to what higher-order optimizers achieve without the high memory or computation needed for higher-order methods, such as matrix inverse operations. We show that LocoProp is a flexible framework that allows the recovery of well-known algorithms and enables the construction of new algorithms via different choices of losses, targets, and regularizers. LocoProp’s layerwise view of neural networks also allows updating the weights in parallel across layers.

Experiments
In our paper, we describe experiments on the deep autoencoder model, which is a commonly used baseline for evaluating the performance of optimization algorithms. We perform extensive tuning on multiple commonly used first-order optimizers, including SGD, SGD with momentum, AdaGrad, RMSProp, and Adam, as well as the higher-order Shampoo and K-FAC optimizers, and compare the results with LocoProp. Our findings indicate that the LocoProp method performs significantly better than first-order optimizers and is comparable to those of higher-order, while being significantly faster when run on a single GPU.

Train loss vs. number of epochs (left) and wall-clock time, i.e., the real time that passes during training, (right) for RMSProp, Shampoo, K-FAC, and LocoProp on the deep autoencoder model.

Summary and Future Directions
We introduced a new framework, called LocoProp, for optimizing deep neural networks more efficiently. LocoProp decomposes neural networks into separate layers with their own regularizer, output target, and loss function and applies local updates in parallel to minimize the local objectives. While using first-order updates for the local optimization problems, the combined updates closely resemble higher-order update directions, both theoretically and empirically.

LocoProp provides flexibility to choose the layerwise regularizers, targets, and loss functions. Thus, it allows the development of new update rules based on these choices. Our code for LocoProp is available online on GitHub. We are currently working on scaling up ideas induced by LocoProp to much larger scale models; stay tuned!

Acknowledgments
We would like to thank our co-author, Manfred K. Warmuth, for his critical contributions and inspiring vision. We would like to thank Sameer Agarwal for discussions looking at this work from a composite functions perspective, Vineet Gupta for discussions and development of Shampoo, Zachary Nado on K-FAC, Tom Small for development of the animation used in this blogpost and finally, Yonghui Wu and Zoubin Ghahramani for providing us with a nurturing research environment in the Google Brain Team.

Categories
Misc

Getting Started with the Deep Learning Accelerator on NVIDIA Jetson Orin

Learn how to free your Jetson GPU for additional tasks by deploying neural network models on the NVIDIA Jetson Orin Deep Learning Accelerator (DLA).

If you’re an active Jetson developer, you know that one of the key benefits of NVIDIA Jetson is that it combines a CPU and GPU into a single module, giving you the expansive NVIDIA software stack in a small, low-power package that can be deployed at the edge. 

Jetson also features a variety of other processors, including hardware accelerated encoders and decoders, an image signal processor, and the Deep Learning Accelerator (DLA). 

The DLA is available on Jetson AGX Xavier, Xavier NX, Jetson AGX Orin and Jetson Orin NX modules. The recent NVIDIA DRIVE Xavier and Orin-based platforms also have DLA cores. 

If you use the GPU for deep learning execution, read on to learn more about DLA, why it’s useful, and how to use it.  

Overview of the Deep Learning Accelerator

The DLA is an application-specific integrated circuit that is capable of efficiently performing fixed operations, like convolutions and pooling, that are common in modern neural network architectures. Though the DLA doesn’t have as many supported layers as the GPU, it still supports a wide variety of layers used in many popular neural network architectures.  

In many instances, the layer support may cover the requirements of your model. For example, the NVIDIA TAO Toolkit includes a wide variety of pre-trained models that are supported by the DLA, ranging from object detection to action recognition.  

While it’s important to note that the DLA throughput is typically lower than that of the GPU, it is power-efficient and allows you to offload deep learning workloads, freeing the GPU for other tasks. Alternatively, depending on your application, you can run the same model on the GPU and DLA simultaneously to achieve higher net throughput.

Many NVIDIA Jetson developers are already using the DLA to successfully optimize their applications. Postmates optimized their delivery robot application on Jetson AGX Xavier leveraging the DLA along with the GPU. The Cainiao ET Lab used the DLA to optimize their logistics vehicle.  If you’re looking to fully optimize your application, the DLA is an important piece in the Jetson repertoire to consider. 

How to use the Deep Learning Accelerator

A flow diagram that highlights the steps necessary to optimize a model for use with the Deep Learning Accelerator.
Figure 1. A coarse architecture diagram highlighting the Deep Learning Accelerators on Jetson Orin

To use the DLA, you first need to train your model with a deep learning framework like PyTorch or TensorFlow. Next, you need to import and optimize your model with NVIDIA TensorRT.  TensorRT is responsible for generating the DLA engines, and can also be used as a runtime for executing them. Finally, you should profile your mode and make optimizations where possible to maximize DLA compatibility. 

Get started with the Deep Learning Accelerator

Ready to dive in? The Jetson_dla_tutorial GitHub project demonstrates a basic DLA workflow to help you in your journey toward optimizing your application for Jetson. 

With the tutorial, you can learn how to define the model in PyTorch, import the model with TensorRT, analyze the performance using the NVIDIA Nsight System profiler, modify the model for better DLA compatibility, and calibrate for INT8 execution. Note that the CIFAR10 dataset is used as a toy example to facilitate reproducing the steps.  

Explore the Jetson_dla_tutorial to get started.

More resources

Categories
Misc

What Is a QPU?

Just as GPUs and DPUs enable accelerated computing today, they’re also helping a new kind of chip, the QPU, boot up the promise of quantum computing. In your hand, a quantum processing unit might look and feel very similar to a graphics or a data processing unit. They’re all typically chips, or modules with multiple Read article >

The post What Is a QPU? appeared first on NVIDIA Blog.

Categories
Misc

Evaluating Data Lakes and Data Warehouses as Machine Learning Data Repositories

Data lakes can ingest a wide range of data types for big data and AI repositories. Data warehouses use structured data, mainly from business applications, with a focus on data transformation.

Data is the lifeblood of modern enterprises, whether you’re a retailer, financial service company, or digital advertiser. Across industries, organizations are recognizing the importance of their data for business analytics, machine learning, and AI.

Smart businesses are investing in new ways to extract value from their data: to better understand customer needs and behaviors, tailor new products and services, and make strategic decisions that will deliver competitive advantages in the years to come.

For decades, enterprise data warehouses have been used for all types of business analytics, with a robust ecosystem around SQL and relational databases. Now, a challenger has emerged.

Data lakes were created to store big data for training AI models and predictive analytics. This post covers the pros and cons of each repository: how they are used and, ultimately, which delivers the best outcomes for ML projects.

Key to this puzzle is processing data for AI and ML workflows. AI projects require massive amounts of data for training models and running predictive analytics. Technical teams must evaluate how to capture, process, and store data so that it is scalable, affordable, and readily available.

What’s a data warehouse?

Data warehouses were created in the 1980s to help enterprise companies organize high data volumes for the purpose of making better business decisions. Data warehouses are used with legacy sources such as enterprise resources planning (ERP), customer relationship management (CRM) software, inventory, and point of sale systems.

The primary goal is to provide operational reporting across lines of business, product analytics, and business intelligence.

Data warehouses have used ETL (extract, transform, load) for decades, with a bias for completing transform and clean data before uploading it. Traditional data warehouses have stringent standards for data structure and advance planning to meet schema requirements.

  • Data is only stored in data warehouses when it has already been processed and refined. ETL processes data by first cleaning data and then uploading into a relational database. The upside is that the data is in good shape and can be used. However, there is processing overhead that you pay up front, which is lost if the data is never used.
  • Data analysts must create a predetermined data structure and fixed schema before they can run their queries. This blocker is a huge pain point for data scientists, analysts, and other lines of business, as it takes months or longer to be able to run new queries.
  • Typically, the data in a warehouse is read-only, so it can be difficult to add, update, or delete data files.

Upside: Data quality

With any system, there are tradeoffs. The upside of data warehouses is their data is in good shape at ingest and will likely stay that way due to the discipline of data cleansing and data governance.

Traditional data warehouses excel as ledgers, providing clean, structured, and normalized data that serves as a single source of truth for an organization. Using relational databases, managers and business analysts across the organization can query massive amounts of corporate data quickly and accurately, to guide critical business strategies.

Downside: Schema requirement

Data warehouses are more likely to use ETL for operational analytics and machine learning workloads.

However, traditional data warehouses require a fixed schema for structuring the data, which could take months or years to agree across all teams and lines of business managers. By the time a schema gets implemented, its users have new queries, taking them back to square one.

It’s fair to say that data warehouse schema drove immense interest in data lakes.

Why use a data lake?

In the early 2000s, Apache Hadoop introduced a new paradigm for storing data in distributed file systems (HDFS) so enterprises could more easily mine their data for competitive advantage. The idea of a data lake came from Hadoop, enabling ingest of a wide spectrum of data types stored in low-cost blob or object storage.

Over the last decade, organizations have flocked to data lakes to capture diverse data types from the web, social media, sensors, Internet of Things, weather data, purchased lists, and so on. As big data gets bigger, data lakes became popular to store petabytes of raw data using elastic technologies.

Data lakes have two main draws: the easy ingest of a wide spectrum of data types and ready access to that data for improvised queries.

  • Using ELT (extract, load, transform), data lakes can ingest most any data type: structured, unstructured, semistructured, and binary for images and video.
  • Data going into a data lake does not have to be transformed before it is stored. Ingest is efficient, without the overhead of cleansing and normalizing data by type.
  • Data lakes make it easy to store all types of data (PDFs, audio, JSON documents) without knowing how that data might be used in the future.

Upside: Ad hoc queries

The upside of data lakes is that teams can access diverse data and run arbitrary queries on demand. The need to have data analytics available immediately is the main driver for adoption of data lakes. 

Downside: Data degradation over time

Raw data goes bad fast in a data lake. There are few tools to tame raw data, making it hard to do merges, deduplication, and data continuity.

What do data warehouses and data lakes have in common?

Data warehouses and data lakes both function as large data repositories and have common characteristics and drawbacks, especially around cost and complexity.

  • Scale: Both have the capability to retain massive amounts of data, using both batch and streaming.
  • High costs: Both are wildly expensive, costing more than a million dollars a year to maintain.
  • Complexity:  Data centers are managing dozens of unique data sources, with rapid volume growth of 50% a year or more. Storage infrastructure is taking more IT person hours, raising storage costs and driving down overall efficiency.
  • Data processing: Both can use ETL and ELT processing.
  • Shared use cases: As data scientists prioritize ML techniques to derive new insights from their data, many organizations are now getting the best of both worlds: AI-enabled data analytics and a wide range of diverse data types.

What’s the difference between data warehouses and data lakes?

Comparing data warehouses to data lakes is a bit like comparing apples and oranges. They offer different things.

  • A data warehouse organizes, cleans, and stores data for analysis.
  • A data lake stores many data types and transforms them on demand.

As teams become more focused on AI projects, the gaps in functionality, manageability, and data quality issues come to light, causing both approaches to evolve and improve.

Deployment

Data warehouses are more likely to be on-premises or in a hybrid cloud. Data lakes are more likely to be cloud-based to take advantage of more affordable storage options.

Data processing

Data warehouses are more likely to use ETL for operational analytics and machine learning workloads. Data lakes ingest data using ELT pipelines of raw data in case that it’s needed in the future. Data lakes also do not require a schema, so teams can pose improvised queries without significant delay.

Tools

Data lakes lack the robustness of a data warehouse, in terms of a functioning programming model and mature, enterprise-ready software and tools. Data lakes have a myriad of pain points, including no support for transactions, atomicity, or data governance.

Data quality

It’s always a problem. It’s a bigger problem for data lakes. Expect to do a lot of monitoring and maintenance on data in a data lake. If you can’t manage raw data efficiently, you can end up with a data swamp, where performance is poor and storage costs are out of control.

Roughly 85% of data lakes fail, Gartner estimates, due to low-quality data. As the adage says: Data pipelines are only as good as the data that flows through them.

Buy compared to build

Companies like Teradata, Oracle, and IBM can sell you a data warehouse for millions of dollars. Storage is one of the most expensive components, as average companies are seeing data volumes growing more than 50% a year.

To get a data lake, most companies build their own on a free PaaS using open source Apache Spark, Kafka, or Zookeeper. This does not mean that building and maintaining a data lake is less expensive, however.

By one estimate, it can cost upwards of a million dollars each year to hire the people for deploying a production data lake with cloud storage. Standing up a data lake can take 6 months to a year, if you can get the expertise.

What’s best for ML workloads?

The short answer is both. Most companies will use both a data warehouse and a data lake for AI projects. Here’s why.

Data lakes are popular because they can scale up for big data—petabytes or exabytes—without breaking the bank. However, data lakes do not offer an end-to-end solution for ML workloads, due to constraints in its programming model.

Many organizations adopted the Hadoop paradigm, only to find that it was nearly impossible to get highly skilled talent to extract data from a data lake using MapReduce. The introduction and development of Apache Spark has kept data lakes afloat, making it easier to access data.

Still, the Hadoop model has not fulfilled its promise for ML. Data lakes’ ongoing pain points include a lack of atomicity, poor performance, lack of semantic updates, and an evolving Spark engine for SQL.

Compare that to a data warehouse, which is compatible with an entire SQL ecosystem. Any software written for a SQL backend can access enterprise software. The methods range from a WYSIWYG frontend and drag-and-drop interfaces to automatically generated dashboards to fully automated ways to do Kube analysis and hyper Kubes, to name a few.

All the business intelligence and data analytics work of the last 30 years is inherited in SQL databases. None of it works on Hadoop or in data lakes.

More data warehouses are supporting ELT that’s commonly used by data lakes. A primary use case for data lakes is to ingest data into a data warehouse, so that data can be extracted and structured for ML projects. ELT enables data scientists to both define a way to structure data and to query it while keeping raw data as a source of truth.

The promise of a data lakehouse

For data engineers looking for a more robust data solution for their big data needs, a data lakehouse (a combination of a data lake and data warehouse) promises to address the drawbacks of data lakes.

A data lakehouse can offer data integrity and governance, support for transactions, and ongoing high performance, with help from an open-source framework called delta lake.

Hybrid cloud options

If you’re just starting with AI data architectures, companies like Amazon and Google are offering cloud-based data warehouses (Amazon Redshift, Google BigQuery) to help lower storage and deployment costs.

CoreDB is an open-source database service that functions as a data lake as a service under an Apache license.

Conclusion

Data warehouses and data lakes are both useful approaches to tame big data and make steps forward for advanced ML analytics. Data lakes are a recent approach to storing massive amounts in commercial clouds, such as Amazon S3 and Azure Blob.

The definitions of data warehouse and data lakes are evolving. Each approach is experimenting with new data processes and models for novel use cases. Going forward, techniques for optimizing performance will be critical, both for managing costs and for monitoring data hygiene in large repositories.

A data lake offers a more flexible solution for data analytics and can process and store data at a low price. However, the Hadoop data lake paradigm does not offer a fully functional solution for machine learning at scale today. Many organizations are forging new tactics and trying new tools to enable better functionality for both data warehouses and data lakes in the near future.

For more information, see the following resources:

Categories
Misc

1,650+ Global Interns Gleam With NVIDIA Green

A record number of interns calls for a record-sized celebration. In our largest contingent ever, over 1,650 interns from 350+ schools started with NVIDIA worldwide over the past year. Amidst busy work days tackling real-world projects across engineering, automation, robotics and more, the group’s also finishing up a three-day celebration, culminating today with National Intern Read article >

The post 1,650+ Global Interns Gleam With NVIDIA Green appeared first on NVIDIA Blog.

Categories
Misc

Experience the Ease of AI Model Creation with the TAO Toolkit on LaunchPad

The TAO Toolkit lab on LaunchPad has everything you need to experience the end-to-end process of fine-tuning and deploying an object detection application.

Building AI Models from scratch is incredibly difficult, requiring mountains of data and an army of data scientists. With the NVIDIA TAO Toolkit, you can use the power of transfer learning to fine-tune NVIDIA pretrained models with your own data and optimize for inference—without AI expertise or large training datasets.

You can now experience the TAO Toolkit through NVIDIA LaunchPad, a free program that provides short-term access to a large catalog of hands-on labs. 

LaunchPad helps developers, designers, and IT professionals speed up the creation and deployment of modern, data-intensive applications. LaunchPad is the best way to enjoy and experience the transformative power of the NVIDIA hardware and software stack working in unison to power your AI applications.  

TAO Toolkit on LaunchPad 

The TAO Toolkit lab on LaunchPad has everything you need to experience the end-to-end process of fine-tuning and deploying an object detection application. 

Object detection is a popular computer vision task that involves classifying and putting bounding boxes around images or frames of videos. It can be used for real-world applications in retail (self check-out, for example), transportation, manufacturing, and more. 

With the TAO Toolkit, you can also: 

  • Achieve up to 4x in inference speed-up with built-in model optimization 
  • Generalize your model with offline and online data augmentation
  • Scale up and out with multi-GPU and multi-node to speed-up your model training 
  • Visualize and understand model training performance in TensorBoard

The TAO Toolkit lab is preconfigured with the datasets, GPU-optimized pretrained models, Jupyter notebooks, and the necessary SDKs for you to seamlessly accomplish your task. 

Ready to get started? Apply now to access the free lab.  

Learn more about the TAO Toolkit.

Categories
Misc

Pony.ai Express: New Autonomous Trucking Collaboration Powered by NVIDIA DRIVE Orin

More than 160 years after the legendary Pony Express delivery service completed its first route, a new generation of “Pony”-emblazoned vehicles are taking an AI-powered approach to long-haul delivery. Autonomous driving company Pony.ai announced today a partnership with SANY Heavy Truck (SANY), China’s largest heavy equipment manufacturer, to jointly develop level 4 autonomous trucks. The Read article >

The post Pony.ai Express: New Autonomous Trucking Collaboration Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.

Categories
Misc

Enabling Enterprise Cybersecurity Protection with a DPU-Accelerated, Next-Generation Firewall

Palo Alto Networks and NVIDIA have developed an Intelligent Traffic Offload (ITO) solution to solve the scaling, efficiency, and economic challenges this creates.

Cyberattacks are gaining sophistication and are presenting an ever-growing challenge. This challenge is compounded by an increase in remote workforce connections driving growth in secure tunneled traffic at the edge and core, the expansion of traffic encryption mandates for the federal government and healthcare networks, and an increase in video traffic.

In addition, an increase in mobile and IoT traffic is being generated by the introduction of 5G speeds and the addition of billions of connected devices.

These trends are creating new security challenges that require a new direction in cybersecurity to maintain adequate protection. IT Departments—and firewalls—must inspect exponentially more data and take deeper looks inside traffic flows to address new threats. They must be able to check traffic between virtual machines and containers that run on the same host, traffic that traditional firewall appliances cannot see.

Operators must deploy enough firewalls capable of handling the total traffic throughput, but doing so without sacrificing performance can be extremely cost-prohibitive. This is because general-purpose processors (server CPUs) are not optimized for packet inspection and cannot handle the higher network speeds. This results in suboptimal performance, poor scalability, and increased consumption of expensive CPU cores.

Security applications such as next-generation firewalls (NGFW) are struggling to keep up with higher traffic loads. While software-defined NGFWs offer the flexibility and agility to place firewalls anywhere in modern data centers, scaling them for performance, efficiency, and economics is challenging for today’s enterprises.

Next-generation firewalls

To address these challenges, NVIDIA partnered with Palo Alto Networks to accelerate their VM-Series Next Generation Firewalls through the NVIDIA BlueField data processing unit (DPU). The DPU accelerates packet filtering and forwarding by offloading traffic from the host processor to dedicated accelerators and ARM cores on the BlueField DPU.

The solution delivers the intrusion prevention and advanced security capabilities of Palo Alto Networks’ virtual NGFWs to every server without sacrificing network performance or consuming the CPU cycles needed for business applications. This hardware-accelerated, software-defined NGFW is a milestone in boosting firewall performance and maximizing data center security coverage and efficiency.

The DPU operates as an intelligent network filter to parse and steer traffic flows based on predefined policies with zero CPU overhead, enabling the NGFW to support close to 100 Gb/s throughput for typical use cases. This is a 5x performance boost versus running the VM-Series firewall on a CPU alone, and up to 150 percent CapEx savings compared to legacy hardware.

Intelligent traffic offload service

The joint Palo Alto Networks-NVIDIA solution creates an intelligent traffic offload (ITO) service that overcomes the challenges of performance, scalability, and efficiency. Integration of the VM-Series NGFWs with the NVIDIA BlueField DPUs turbocharges the NGFW solution to improve cost economics while improving threat detection and mitigation. 

20% of traffic benefits from security inspection while 80% of traffic does not (video, VOIP, etc.).
Figure 1. ITO using the Palo Alto Networks NGFW with the BlueField DPU helps enterprises that are challenged with performance vs. security vs. cost 

In certain customer environments, up to 80% of network traffic doesn’t need to be—or can’t be—inspected by a firewall, such as encrypted traffic or streaming traffic from video, gaming, and conferencing. NVIDIA and Palo Alto Networks’ joint solution addresses this through the ITO service, which examines network traffic to determine whether each session would benefit from deep security inspection. 

ITO optimizes firewall resources by checking all control packets but only checking payload flows that require deep security inspection. Suppose the firewall determines that the session would not benefit from security inspection. In that case, the firewall inspects the initial packets of the flow then ITO instructs the DPU to forward all subsequent packets in that session directly to their destination without sending them through the firewall (Figure 2).

Gain a 5X performance improvement and reduce the number of CPU cores required to support security inspection.
Figure 2. DPU acceleration of NGFW provides unprecedented performance and efficiency gains

By only examining flows that can benefit from security inspection and offloading the rest to the DPU, the overall load on the firewall and the host CPU is reduced, and performance increases without sacrificing security.

ITO empowers enterprises to protect end users with an NGFW that can run on every host in a zero-trust environment, helping expedite their digital transformation while keeping them safe from a myriad of cyber threats.

First NGFW to market

To stay ahead of emerging threats, Palo Alto Networks jointly developed the first virtual NGFW to be accelerated by BlueField DPU. The VM-Series firewall enables application-aware segmentation, prevents malware, detects new threats, and stops data exfiltration, all at higher speeds and with less CPU consumption, by offloading these tasks from the host processor to the BlueField DPU.

The DPU operates as an intelligent network filter to parse, classify and steer traffic flows with zero CPU overhead, enabling the NGFW to support close to 100 Gb/s throughput per server for typical use cases. The recently announced DPU-enabled Palo Alto Networks VM-Series NGFW uses zero-trust network security principles.

The ITO solution was presented at NVIDIA GTC during a joint session with Palo Alto Networks. For more information about the ITO service’s role in delivering a software-defined, hardware-accelerated NGFW that addresses ever-evolving cybersecurity threats for enterprise data centers, see the Accelerating Enterprise Cybersecurity with Software-Defined DPU-Powered Firewall GTC session.