Categories
Misc

3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’

The warm, friendly animation Mushroom Spirit is featured In the NVIDIA Studio this week, modeled by talented 3D illustrator Julie Greenberg, aka Juliestrator.

The post 3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Open Source Simulation Expands with NVIDIA PhysX 5 Release

NVIDIA today released the latest version of the NVIDIA PhysX 5 SDK under the same open source license terms as PhysX 4 to help expand simulation workflows and…

NVIDIA today released the latest version of the NVIDIA PhysX 5 SDK under the same open source license terms as PhysX 4 to help expand simulation workflows and applications across global industries. We are pleased to release this much-anticipated update on the NVIDIA-Omniverse/PhysX GitHub repository. 

A longtime GameWorks technology, PhysX has become the primary physics engine and a key foundational technology pillar of NVIDIA Omniverse. It is a powerful simulation engine currently used by industry leaders for robotics, deep reinforcement learning, autonomous driving, factory automation, and visual effects. For next-generation robotics applications, it will enable high fidelity simulations at real-time speeds that are needed for simulating and testing autonomous machines.

“Having a powerful, open-source tool for physics like NVIDIA’s new PhysX 5 library is a critical part of the realism delivered by the Open 3D Engine,” said Royal O’Brien, Executive Director at the Open 3D Foundation and General Manager of Digital Media and Games at the Linux Foundation.

“As PhysX use cases spread to other important 3D domains like simulation and digital twins, we are excited to see NVIDIA working with open source, allowing everyone to harness the innovation and collaboration that these communities can bring,” O’Brien said.

PhysX has become a key reference implementation of the similarly open source Pixar Universal Scene Description (USD) Physics standard available at PixarAnimationStudios/USD on GitHub. This informed the decision to return to the more permissive licensing terms used for PhysX 4. All CPU source code is available under the simple BSD3 open source license, and NVIDIA GPU binaries are included at no cost.

“This release of the PhysX SDK goes hand in hand with USD Physics, a description of a scene’s physical properties that was co-developed with Pixar,” said Dave Eberle, Tools-Sim Lead at Pixar. “Pixar’s ongoing USD collaboration with NVIDIA and other parties is aimed at enabling creators to imbue physics into their scenes with more ease, and we are excited that the open sourcing of the SDK will accelerate the adoption of simulation behaviors in more creative tools.”

What’s new in PhysX 5 open source

The NVIDIA Flow and NVIDIA Blast libraries, while technically not dependent on PhysX, are now a part of the PhysX product family and licensed together. They are available in the same GitHub repo with Blast.

PhysX 5 SDK now supports the capabilities of NVIDIA Flex, which enables various new features. These features include finite element model-based soft body dynamics as well as liquid, cloth, and inflatable objects using position-based dynamics, optimized to run on GPUs. A signed distance field collision feature on GPU has also been added, which allows the user to perform collision detection using a voxelized version of the source mesh, eliminating the need to create a convex decomposition.

Video 1. An NVIDIA Flow dust emitter moving around a scene in Omniverse Create

In terms of new CPU features, PhysX 5 users can now define custom geometries, meaning cylinder shapes or implicit block-based worlds can now be supported. Both CPU and GPU parallel computing performance for large simulations has been significantly improved.

The evolved role of PhysX also brings some fundamental technical changes. Formerly a game physics engine with optimized ports available for a broad range of video game consoles, PhysX is now a high-fidelity GPU-accelerated physics simulation engine used in robotics, deep reinforcement learning, autonomous driving, factory automation, and visual effects, just to name a few.  As a result, video game console ports are no longer available from NVIDIA, though given our permissive licensing, the community is now able to create and maintain ports to such platforms.

Video 2. A digital twin of a kinetic sculpture simulated using gears and cams modeled with PhysX 5

As part of the update, some of the tools and utilities such as digital content creation tool exporters, debugging telemetry and diagnostics, demos, and samples have now been merged into the Omniverse platform.

Advanced demos are no longer bundled with the SDK. Visit the physics demos in NVIDIA Omniverse at NVIDIA On-Demand for more advanced examples of what is possible with PhysX. NVIDIA Omniverse is also where you should look for any content creation tools. NVIDIA is investing in creating the best possible physics toolset in Omniverse, which will continue to evolve and improve.  

The future of PhysX

NVIDIA continues to embrace open source in support of building an inclusive ecosystem. This is a first step in the process of opening up more and more Omniverse source code. As you browse through the source, you might come across some files that have existed as far back as 2001 and can still be used today. 

“PhysX is essential in making video game worlds feel more realistic and believable, not to mention fun. We are excited to see NVIDIA going open source with the latest version,” said Mika Vehkala, Director of Technology at Remedy.

In the near future, watch for source code releases showing how to build a user-modified version of this PhysX SDK into a custom Omniverse extension. NVIDIA also plans to have a full reference implementation of a USD Physics parser and simulation stack available with full source. 

You can access the open source code by visiting the NVIDIA-Omniverse/PhysX GitHub repository , which also includes the NVIDIA Flow library. Watch the latest tutorials on PhysX at NVIDIA On-Demand.

Visit the Omniverse Developer Resource Center and the USD page for additional resources, view the latest tutorials on Omniverse, and check out the forums for support. Join the Omniverse community, Discord server, and Twitch Channel to chat with the community, and subscribe to get the latest Omniverse news.

Follow NVIDIA Omniverse on Instagram, Twitter, YouTube, and Medium for additional resources and inspiration.

Categories
Misc

Anyone Can Build Metaverse Applications With New Beta Release of Omniverse

The new beta release of NVIDIA Omniverse is now available with major updates to core reference applications and tools for developers, creators, and novices…

The new beta release of NVIDIA Omniverse is now available with major updates to core reference applications and tools for developers, creators, and novices looking to build metaverse applications.

Each of the core components of the Omniverse platform have been updated to make it even faster, more accessible, and more flexible for collaborative workflows across applications. These updates empower developers of any background to easily build their custom applications, connections, and extensions anywhere. Learn more about how to develop on NVIDIA Omniverse.

Powered by support for new NVIDIA Ada Generation GPUs and advances in NVIDIA simulation technology, this new beta release focuses on maximizing ease of ingesting large, complex scenes from multiple third-party applications, and maximizing real-time rendering, path tracing, and physics simulation.

Graphic of the five core components of NVIDIA Omniverse: Nucleus, Connect, Kit, Simulation, and RTX Renderer.
Figure 1. The five core components of NVIDIA Omniverse

Nucleus, the central database and collaboration engine of NVIDIA Omniverse, now enables faster live collaboration and copying between servers. Nucleus Navigator 3.2 makes it possible to move files and folders seamlessly between servers located on-premises and in the cloud. It also adds enhanced search functionality to quickly retrieve images, objects, and other assets. OmniObjects with Omniverse Live 2.0 allows faster collaboration between Connectors.

New and updated Connectors for popular apps are available through Omniverse Connect, the libraries that allow you to create Connectors from your favorite apps to the Omniverse platform. The beta release includes new and updated Connectors for PTC Creo, Autodesk Alias, Kitware ParaView, Siemens JT, and Autodesk Maya, among others.

PhysX 5, the flagship tool of Omniverse Simulation, has been open sourced so you can easily modify, build, and distribute your own physics simulation applications. The new version of PhysX comes with exciting new features like support for multiple scenes, collision-triggered audio, and an inspector for robotic applications. Experience Omniverse Simulation by downloading Omniverse and testing technical demos in Omniverse Showroom to see the power of PhysX 5 and real-time RTX Rendering.

New features and capabilities across Omniverse applications are driven by Omniverse Kit 104, which now allows novice or experienced Python and C++ developers to more easily develop, package, and publish their own custom metaverse applications and extensions to accelerate industry-specific workflows.

Connecting to Omniverse with Universal Scene Description

Our software partners are leading the way building useful extensions and Connectors on Omniverse Kit. Some of the more recently published extensions and Connectors include:

  • Updates to Omniverse Connectors for Autodesk 3ds Max, Autodesk Maya, Autodesk Revit, Epic Games’ Unreal Engine, McNeel Rhino, Trimble SketchUp, Graphisoft Archicad, and Kitware’s ParaView
  • New Omniverse Connectors for Autodesk Alias and PTC Creo
  • Reallusion iClone 8.1.0 live sync Connector for seamless interactions between Omniverse apps and iClone 8
  • The OTOY OctaneRender hydra render delegate, which enables Omniverse users to use OctaneRender directly in the Omniverse Create or View viewport
  • The Nextspace digital twin platform extension for normalizing data and geometry to drive the use of AI, analytics, and simulation
  • SmartCow’s Omniverse extension for synthetic data generation of large datasets of license plates for license plate recognition AI

More extensions and Connectors are on the way from companies like Lumirithmic, which is connecting their Hollywood-grade avatar scan provider to Omniverse.

“We’ve been using NVIDIA Omniverse as our primary content delivery engine to serve our enterprise customers,” said Jayanth Kannan, VP of Software Engineering at Lumirithmic. “NVIDIA Omniverse does all the heavy lifting and enables seamless integration of our Avatars with industry standard DCC tools, helping our customers readily use our assets in their commercial projects.”

Move.ai, another partner extending the Omniverse, will soon be publishing an extension to put markerless motion capture in the hands of Omniverse users. 

“We’re excited by the potential to enable users to enhance their creative pipelines with our Move extension, which will allow users of Omniverse to access our free Motion Library,” said Niall Hendry, Head of Partnerships & Delivery at Move.ai. “The Omniverse team has been super responsive, guiding us every step of the way.”

Developers are invited to apply for early access to the new Omniverse Exchange Publishing Portal, which offers a new channel to distribute their custom tools and applications.

A new foundation for developing metaverse tools with Omniverse Kit 104

Omniverse Kit is the SDK on which every Omniverse microservice (like DeepSearch) or reference application (such as Omniverse Create, View, or Isaac Sim) is built. These microservices and reference applications are built as samples for developers to copy and customize.

Most Omniverse development work is exposed in Python workflows. This Omniverse Kit 104 beta release includes a new set of extension templates for C++ developers and technical artists to build extensions using C++. 

Omniverse Kit extension templates contain various example extensions to act as references for developing UI widgets, Universal Scene Description (USD) interactions, and more. These templates remove the need to create extensions from scratch and speed your application development. 

New Omniverse Kit app templates are also now available to make it easier than ever to build advanced 3D tools similar to NVIDIA’s reference applications that leverage core Omniverse technologies like RTX, PhysX, OmniGraph, and USD.

Screencapture of the new Omniverse Kit application template used to create your own apps leveraging technologies from the Omniverse platform like RTX, PhysX, Nucleus, OmniGraph, and USD.
Figure 2. Use the new Omniverse Kit application template to create your own apps leveraging technologies from the Omniverse platform like RTX, PhysX, Nucleus, OmniGraph, and USD

Other key updates in Omniverse Kit include the following:

  • Viewport 2.0 for fully customizable, open workflows 
  • New navigation possibilities for user interfaces in Omni.ui.menu
  • The ability to encapsulate extension features in Actions
  • A centralized API and UI to manage Hotkeys

To learn more about Omniverse Kit 104, see Create Your Own Metaverse Applications with C++ and Python in Omniverse Kit 104. You can also watch the GTC session, How to Build Extensions and Apps for Virtual Worlds with NVIDIA Omniverse on demand.

See Omniverse Kit 104 in action with Omniverse reference applications

Omniverse Code is the integrated development environment (IDE) where developers can take advantage of all the new features of Kit 104. All the latest documentation and samples for building Omniverse applications, extensions, and microservices are integrated in Omniverse Code, making it easy for developers of all backgrounds to learn to develop and use Kit extensions. Omniverse Code makes it easier than ever to leverage Omniverse’s extensibility so that non-traditional developers can quickly build tools and applications to make their workflows more efficient and personalized.

The Omniverse Create application has been updated as part of the beta release with animation improvements and better capabilities for large world authoring. Creators can collaborate more seamlessly on large worlds with layer-based live workflows and Viewport icons showing locations of other users in a scene. 

This release also supports the new DLSS 3 included in the Ada Generation GeForce RTX and NVIDIA RTX GPUs, enabling massive improvements in performance and quality in the RTX renderer by generating additional high-quality frames in real time. 

You can also use many new PhysX extensions in Omniverse Create, including PhysX Authoring Toolbar and Signed Distance Field (SDF) Colliders.

  • PhysX Authoring Toolbar – A simple authoring toolbar to make all your content behave correctly in a simulated environment.
  • SDF Colliders – SDF-based collision detection can now be used for physics objects, enabling direct real-time simulation of gears and cams.

This year, Omniverse Create has launched over 300 extensions built in Kit, including the following:

  • ActionGraph – A special type of OmniGraph in Create, allows you to create event-driven behaviors and logic inside scenes with node-based visual programming.
  • Omni.ui.scene – An extension in Omni.ui that allows you to build interactable UI for widgets and manipulators directly inside the viewport or 3D environment.
  • DeepSearch – An AI-powered microservice that enables instant natural language or 2D image-based search into Omniverse Nucleus’s asset database to retrieve images, objects, or other assets.
A screenshot showing a car in Action Graph. You can use Action Graph to add event-driven behaviors to an asset. For the car shown, you can open/close doors, raise/lower the spoiler, and change paint colors.
Figure 3. Use Action Graph to add event-driven behaviors to an asset. For the car shown, you can open/close doors, raise/lower the spoiler, and change paint colors

“For architectural design/visualization workloads, normally we use software out of the box, but you can run into limitations with those out of the box implementations,” said Eric Craft, XR & Visualization Program Manager at architectural firm Mead & Hunt. “NVIDIA’s Omniverse development platform gives me the ability to easily tweak and customize their tools, so I can build a more efficient, more effective toolkit for our company.” 

“Since it’s based on USD,” Craft added, “the platform interconnects with other popular industry tools which means I can build a custom Omniverse tool in one place, but use it across our multi-app workflows. And because of the USD layer-based workflow changes in Omniverse stay even when the design export is updated.” 

Audio2Gesture, an AI-powered tool that creates realistic body gestures based on an audio file, is now available in Omniverse Machinima.

Omniverse View, a simple review and approval app, now features a focused, collaborative review and markup experience. 

NVIDIA Omniverse Replicator, an SDK for generating 3D synthetic data for AI and simulation workflows, is now available as a container for easy deployment on your preferred Cloud Service Provider (CSP). AWS users can leverage the Omniverse GPU-Optimized AMI available on AWS marketplace and deploy the replicator container seamlessly on an EC2 instance. 

Get started with NVIDIA Omniverse

With a new set of diverse tools and updated applications now available in Omniverse, there has never been a better time to get started. Download the Omniverse free license for individuals to start building with the beta release of Omniverse. 

The Omniverse team is eager to hear your feedback about the beta release and actively looking for input in our Omniverse forums to improve the experience for individual users. Join our community Omniverse livestream on Wednesday, November 9 to learn more about the beta release of Omniverse and get ideas for how to take advantage of the new features.

Subscribe to the Omniverse newsletter to receive updates about Omniverse Enterprise. Follow us on Instagram, Twitter, YouTube, and Medium to stay up to date with the next releases and use cases.

Visit the Omniverse Developer Resource Center and the USD page for additional resources, view the latest tutorials on Omniverse, and check out the forums for support. Join the Omniverse community, Discord server, and Twitch Channel to chat with the community, and subscribe to get the latest Omniverse news.

Categories
Misc

Enabling Enterprise AI Transformations for Telcos with NVIDIA and VMware

AI has the power to transform every industry, but transformation takes time, and it’s rarely easy. For enterprises across industries to be as successful as…

AI has the power to transform every industry, but transformation takes time, and it’s rarely easy. For enterprises across industries to be as successful as possible in their own transformations, they need access to AI-ready technology platforms. They also must be able to use 5G connectivity at the edge to harness valuable data and inform their AI and ML models.

Sign up for the latest telecommunications news from NVIDIA.

The advantages of 5G, such as lower latency and improved mobility as well as data throughput, also increase the application footprint of AI/ML applications within enterprises. According to an analysis by Verified Market Research, the market size for enterprise AI is projected to hit over $88 billion by 2030, up from $7 billion in 2022.

The future lives on the edge with AI, and any business that doesn’t stake its claim now risks falling behind. Industries across the spectrum are only beginning to unlock the tremendous value of AI when it is operationalized:

  • Banks are looking to understand the behavior of customers using AI-powered mobile apps to customize the experience and provide personalized service.
  • Manufacturers are beginning to use real-time data to prevent issues and challenges in their processes proactively, lowering maintenance costs and optimizing their operations.
  • Educators are using AI learning platforms to give students uninterrupted access to lessons from any device or location.

One of the most powerful of these potential applications combines AI and visual computing to begin pushing the boundaries of the metaverse, the 3D evolution of the internet.

By analyzing the unending stream of data generated by connected devices, it is possible to generate a digital twin of anything from a car to the factory in which the car is built. These digital twins are virtual simulations of the physical objects that can be manipulated and altered using AI. Digital twins can simulate how these objects would behave before applying changes in the real world.

Across all these use cases, a common factor is the need to combine AI-ready, digital compute platforms with 5G connectivity at the edge to best leverage the data that increasingly resides there.

As the telecommunications industry has tackled the evolution to 5G connectivity, it has also recognized the value of helping other industries embrace AI to transform their own businesses.

For enterprises that want to realize the immense value of AI but lack the necessary IT infrastructure, telcos represent the best option to provide a managed offering to deliver these services. Telcos are uniquely positioned to take the core connectivity services they’ve perfected and combine them with AI-ready infrastructure to provide enterprises with a managed, connected, and end-to-end AI offering.

For telcos looking to offer new B2B services outside of their core connectivity services, this represents an enormous opportunity to grow revenue and increase profitability.

Most telcos today are not necessarily experts in IT infrastructure platforms or AI. Thankfully, they don’t have to be.

The AI-Ready Enterprise Platform, running VMware’s Cloud Director cloud service delivery platform with NVIDIA AI Enterprise, offers telcos a suite of data science tools and frameworks they can use to harness countless AI applications and reduce time to ROI while overcoming problems posed by unplanned implementations—all to help telcos transform their business and capture the opportunity AI represents. 

Challenges of AI implementation

Companies that don’t adopt AI risk being left behind. But even those that embrace its potential find that it requires a high degree of operational effectiveness and cooperation between AI development teams and business stakeholders.

A recent report by Gartner predicted that 85% of AI projects would fail to deliver on their promises, due in part to a lack of internal skill within the implementing enterprise. It’s one thing to initiate an AI trial, but several factors have led enterprises to find that scaling any trials to generate financial impact is beyond their means:

  • A well-trained AI system requires quality data to function; poor data will give bad results.
  • The cost of replacing outdated hardware with AI-based systems can be prohibitive.
  • Gaps in AI enablement for network and device performance monitoring lead to problems in gathering real-time insights.

Without a concerted, platform-based approach, costs can quickly spin out of control and a company’s return on investment is slowed. That’s why many enterprises looking to embrace AI will be on the lookout for a managed solution.

As a telco, if you can overcome these challenges yourself and deliver on that promise, you can set yourself up as the long-term AI and connectivity platform provider for enterprises looking to transform.

AI-Ready Enterprise Platform value

NVIDIA and VMware have partnered to make it as easy as possible for telcos to surmount any potential hurdles and begin offering AI as a service.

By supplying the application frameworks—including SDKs, tools, APIs, and documentation—NVIDIA and VMware enable telcos to become true SaaS players with the AI-Ready Enterprise Platform. NVIDIA GPUs and DPUs enable these new applications while VMware provides a unified, multi-cloud infrastructure for networking, security, and compute services out to the edge. This combination enables operators and enterprises to start from any compute workload and expand to other workloads on the same infrastructure.

AI-Ready Enterprise Platform with VMware Cloud Director and NVIDIA AI Enterprise unlocks the power of AI by delivering an end-to-end enterprise platform optimized for AI workloads. VMware Cloud Director virtualizes the GPU and enables multiple tenants to share and consume the GPU as a service. When telcos implement AI-Ready Enterprise Platform, the full value of AI and ML applications is achievable and can be delivered alongside a telco’s connectivity services as a true managed AI offer.

Ultimately, this sets up telcos to provide end-to-end infrastructure including the connectivity, edge computing, and applications that are key for AI democratization. Enterprises are free to scale without compromise, enabling more complex AI training and data analytics. This can include services like the following:

  • Intelligent video analytics: Helps retailers keep a closer eye on shopper experiences and merchandise loss.
  • Immersive digital twins over 5G VR: Helps teams virtually collaborate on product designs without being limited by location.
  • AI-enabled traffic monitoring systems: Help municipalities take advantage of a telco’s subscriber base to improve congestion.

Conclusion

As new functions for AI continue to spread across every sector of the economy, late adopters may find themselves at a competitive disadvantage.

Telcos, with their development of 5G edge connectivity and vast troves of consumer data, find themselves on the front lines of this burgeoning frontier. With access to a growing ecosystem of AI applications built around the NVIDIA platform, telcos have a unique opportunity to deliver AI services. They can drive profitability to enterprises in the world’s largest industries, from transportation and healthcare to retail.

For more information, see the following resources:

Categories
Misc

Accelerating Load Times for DirectX Games and Apps with GDeflate for DirectStorage

Photo of a woman looking at her monitor.Load times. They are the bane of any developer trying to construct a seamless experience. Trying to hide loading in a game by forcing a player to shimmy…Photo of a woman looking at her monitor.

Load times. They are the bane of any developer trying to construct a seamless experience. Trying to hide loading in a game by forcing a player to shimmy through narrow passages or take extremely slow elevators breaks immersion.

Now, developers have a better solution. NVIDIA collaborated with Microsoft and IHV partners to develop GDeflate for DirectStorage 1.1, an open standard for GPU compression. The current Game Ready Driver (version 526.47) contains NVIDIA RTX IO technology, including optimizations for GDeflate.

GDeflate: An Open GPU Compression Standard

GDeflate is a high-performance, scalable, GPU-optimized data compression scheme that can help applications make use of the sheer amount of data throughput available on modern NVMe devices. It makes streaming decompression from such devices practical by eliminating CPU bottlenecks from the overall I/O pipeline. GDeflate also provides bandwidth amplification effects, further improving the effective throughput of the I/O subsystem.

GDeflate Open Source will be released on GitHub with a permissive license for IHVs and ISVs. We want to encourage the quick embrace of GDeflate as a data-parallel compression standard, facilitating its adoption across the PC ecosystem and on other platforms.

To show the benefits of GDeflate, we measured system performance without compression, with standard CPU-side decompression, and with GPU-accelerated GDeflate decompression on a representative game-focused dataset, containing texture and geometry data.

A plot depicting the achieved bandwidth over varying staging buffer sizes using no compression, Zlib, a CPU implementation of GDeflate, and the GPU version of GDeflate.
Figure 1. Data throughput of various data compressed formats compared to varying staging buffer sizes
A plot depicting the processing cycles over varying staging buffer sizes using no compression, Zlib, a CPU implementation of GDeflate, and the GPU version of GDeflate.
Figure 2. Processing cycles of various data compressed formats compared to varying staging buffer sizes

As you can see from Figures 1 and 2, the data throughput of uncompressed streaming is limited by the system bus bandwidth at about ~3 GB/s, which happens to be the limit of a Gen3 PCIe interconnect.

When applying traditional compression with decompression happening on the CPU, it’s the CPU that becomes the overall bottleneck, resulting in lower throughput than would otherwise be possible with uncompressed streaming. Not only does it underutilize available I/O resources of the system, but it also takes away CPU cycles from other tasks needing CPU resources.

With GPU-accelerated GDeflate decompression, the system can deliver effective bandwidth well in excess of what’s possible without applying compression. It is effectively multiplying data throughput by its compression ratio. The CPU remains fully available for performing other important tasks, maximizing system-level performance.

GDeflate is available as a standard GPU decompression option in DirectStorage 1.1—a modern I/O streaming API from Microsoft. We’re looking forward to next-generation game engines benefiting from GDeflate by dramatically reducing loading times.

Resource streaming and data compression

Today’s video games feature extremely detailed interactive environments, requiring the management of enormous assets. This data must be delivered first to the end user’s system, and then, at runtime, actively streamed to the GPU for processing. The bulk of a game’s content package is made up of resources that naturally target the GPU: textures, materials, and geometry data.

Traditional data compression techniques are applicable to game content that rarely changes. For example, a texture that is authored only one time may have to be loaded multiple times as the player advances through a game level. Such assets are usually compressed when they are packaged for distribution and decompressed on demand when the game is played. It has become standard practice to apply compression to game assets to reduce the size of the downloadable (and its installation footprint).

However, most data compression schemes are designed for CPUs and assume serial execution semantics. In fact, the process of data compression is usually described in fundamentally serial terms: a stream of data is scanned serially while looking for redundancies or repeated patterns. It replaces multiple occurrences of such patterns with a reference to their previous occurrence. As a result, such algorithms can’t easily scale to data-parallel architectures or accommodate the need for faster decompression rates demanded by modern game content.

At the same time, recent advances in I/O technology have dramatically improved available I/O bandwidth on the end user system. It’s typical to see a consumer system equipped with a PCIe Gen3 or Gen4 NVMe device, capable of delivering up to 7 GB/s of data bandwidth.

To put this in perspective, at this rate, it is possible to fill the entire 24 GBs of frame buffer memory on the high-end NVIDIA GeForce RTX 4090 GPU in a little over 3 seconds!

To keep up with these system-level I/O speed improvements, we need dramatic advances in data compression technology. At these rates, it is no longer practical to use the CPU for data decompression on the end user’s system. That requires an unacceptably large fraction of precious CPU cycles to be spent on this auxiliary task. It may also slow down the entire system.

The CPU shouldn’t become the bottleneck that holds back the I/O subsystem.

Data-parallel decompression and GDeflate architecture

With Moore’s law ending, we can no longer expect to get “free” performance improvements from serial processors.

High-performance systems have long embraced large-scale data parallelism to continue scaling performance for many applications. On the other hand, parallelizing the traditional data compression algorithms has been challenging, due to fundamental serial assumptions “baked” into their design.

What we need is a GPU-friendly data compression approach that can scale performance as GPUs become wider and more parallel.

This is the problem that we set out to address with GDeflate, a novel data-parallel compression scheme optimized for high-throughput GPU decompression. We designed GDeflate with the following goals:

  • High-performance GPU-optimized decompression to support the fastest NVMe devices
  • Offload the CPU to avoid making it the bottleneck during I/O operations
  • Portable to a variety of data-parallel architectures, including CPUs and GPUs
  • Can be implemented cheaply in fixed-function hardware, using existing IP
  • Establish as a data-parallel data compression standard

As you could guess from its name, GDeflate builds upon the well-established RFC 1951 DEFLATE algorithm, expanding and adapting it for data-parallel processing. While more sophisticated compression schemes exist, the simplicity and robustness of the original DEFLATE data coding make it an appealing choice for highly tuned GPU-based implementations.

Existing fixed-function implementations of DEFLATE can also be easily adapted to support GDeflate for improved compatibility and performance.

Two-level parallelism

A many-core SIMD machine consumes the GDeflate bitstream by design, explicitly exposing parallelism at two levels.

First, the original data stream is segmented into 64 KB tiles, which are processed independently. This coarse-grained decomposition provides thread-level parallelism, enabling multiple tiles to be processed concurrently on multiple cores of the target processor. This also enables random access to the compressed data at tile granularity. For example, a streaming engine may request a sparse set of tiles to be decompressed in accordance with the required working set for a given frame.

Also, 64 KB happens to be the standard tile size for tiled or sparse resources in graphics APIs (DirectX and Vulkan), which makes GDeflate compatible with future on-demand streaming architectures leveraging these API features.

Second, the bitstream within tiles is specifically formatted to expose finer-grained, SIMD-level parallelism. We expect that a cooperative group of threads will process individual tiles, as the group can directly parse the GDeflate bitstream using hardware-accelerated data-parallel operations, commonly available on most SIMD architectures.

All threads in the SIMD group share the decompression state. The formatting of the bitstream is carefully constructed to enable highly optimized cooperative processing of compressed data.

This two-level parallelization strategy enables GDeflate implementations to scale easily across a wide range of data-parallel architectures, also providing necessary headroom for supporting future, even wider data-parallel machines without compromising decompression performance.

NVIDIA RTX IO supports DirectStorage 1.1

NVIDIA RTX IO is now included in the current Game Ready Driver (version 526.47), which offers accelerated decompression throughput.

Both DirectStorage and RTX IO leverage the GDeflate compression standard.

“Microsoft is delighted to partner with NVIDIA to bring the benefits of next-generation I/O to Windows gamers. DirectStorage for Windows will enable games to leverage NVIDIA’s cutting-edge RTX IO and provide game developers with a highly efficient and standard way to get the best possible performance from the GPU and I/O system. With DirectStorage, game sizes are minimized, load times reduced, and virtual worlds are free to become more expansive and detailed, with smooth and seamless streaming.”

Bryan Langley, Group Program Manager for Windows Graphics and Gaming

Getting started with DirectStorage in RTX IO drivers

We have a few more recommendations to help ensure the best possible experience using DirectStorage with GPU decompression on NVIDIA GPUs.

Preparing your application for DirectStorage

Achieving maximum end-to-end throughput with DirectStorage with GPU decompression requires enqueuing a sufficient number of read requests, to keep the pipeline fully saturated.

In preparation for DirectStorage integration, applications should group resource I/O and creation requests close together in time. Ideally, resource I/O and creation operations occur in their own CPU thread, separate from threads doing other loading screen activities like shader creation.

Assets on disk should be also packaged together in large enough chunks so that DirectStorage API call frequency is kept to a minimum and CPU costs are minimized. This ensures that enough work can be submitted to DirectStorage to keep the pipeline fully saturated.

For more information about general best practices, see Using DirectStorage and the DirectStorage 1.1 Now Available Microsoft post.

Deciding the staging buffer size

  • Make sure to change the default staging buffer size whenever GPU decompression is used. The current 32 MB default isn’t sufficient to saturate modern GPU capabilities.
  • Make sure to benchmark different platforms with varying NVMe, PCIe, and GPU capabilities when deciding on the staging buffer size. We found that a 128-MB staging buffer size is a reasonable default. Smaller GPUs may require less and larger GPUs may require more.

Compression ratio considerations

  • Make sure to measure the impact that different resource types have on compression savings and GPU decompression performance.
  • In general, various data types, such as texture and geometry, compress at different ratios. This can cause some variation in GPU decompression execution performance.
  • This won’t have a significant effect on end-to-end throughput. However, it may result in variation in latency when delivering the resource contents to their final locations.

Windows File System

  • Try to keep disk files accessed by DirectStorage separate from files accessed by other I/O APIs. Shared file use across different I/O APIs may result in the loss of bypass I/O improvements.

Command queue scheduling when background streaming

  • In Windows 10, command queue scheduling contention can occur between DirectStorage copy and compute command queues, and application-managed copy and compute command queues.
  • The NVIDIA Nsight Systems, PIX, and GPUView tools can assist in determining whether background streaming with DirectStorage is in contention with important application-managed command queues.
  • In Windows 11, overlapped execution between DirectStorage and application command queues is fully expected.
  • If overlapped execution results in suboptimal performance of application workloads, we recommend throttling back DirectStorage reads. This helps maintain critical application performance while background streaming is occurring.

Summary

Next-generation game engines require streaming huge amounts of data, aiming to create increasingly realistic, detailed game worlds. Given that, it’s necessary to rethink game engines’ resource streaming architecture, and fully leverage improvements in I/O technology.

Using the GPU as an accelerator for compute-intensive, data decompression becomes critical for maximizing system performance and reducing load times.

The NVIDIA RTX IO implementation of GDeflate is a scalable GPU-optimized compression technology that enables applications to benefit from the computational power of the GPU for I/O acceleration. It acts as a bandwidth amplifier for high-performance I/O capabilities of today and future systems.

Categories
Misc

Data Storytelling Best Practices for Data Scientists and AI Practitioners

Storytelling with data is a crucial soft skill for AI and data professionals. To ensure that stakeholders understand the technical requirements, value, and…

Storytelling with data is a crucial soft skill for AI and data professionals. To ensure that stakeholders understand the technical requirements, value, and impact of data science team efforts, it is necessary for data scientists, data engineers, and machine learning (ML) engineers to communicate effectively.

This post provides a framework and tips you can adopt to incorporate key elements of data storytelling into your next presentation, pitch, or proposal. It aims to accomplish the following:

  • Introduce storytelling within the context of data science and machine learning
  • Highlight the benefits of effective storytelling for data science practitioners
  • Provide tips on how to cultivate data storytelling skills

What is storytelling with data

Data storytelling is the ability to add contextual information to key data and insights to help develop viewpoints and realizations for project stakeholders. Data scientists and AI practitioners must effectively convey the impact of data-driven action or reasoning.  

Data and machine learning practitioners can use data storytelling to more effectively communicate with clients, project stakeholders, team members, and other business entities. A compelling narrative can help your audience understand complex concepts and can help win new projects.

Data storytelling case study

This section explores the key structural components of a data-driven story. 

The article, What Africa Will Look Like in 100 Years, leverages data and visualizations to tell a narrative of the ongoing transformation occurring in Africa from the viewpoint of major African cities such as Lagos, Dakar, and Cairo.

The strategic composition of this article presents the problem, background, and solution. This approach provides a strong foundation for any data-driven narrative. The article also includes facts, anecdotes, data, and charts and graphs. Together, these produce a free-flowing, well-structured, engaging, and informative account of the subject matter.

The opening sections of this article describe the context and main point: “Can Africa translate its huge population growth into economic development and improved quality of life?” 

Information such as key dates, figures, and first-person statements create a picture grounded in reality, allowing the reader to form a full understanding of the subject matter. The presentation of data using charts and graphs allows for the visualization of Africa’s major cities transformations. Specific data points include population growth, education rate, and life expectancy. Personal experiences and first-hand accounts from citizens of the focus cities provide additional context.

An effective framework for storytelling in data science

This section explores how storytelling in the data science field should be structured and presented. The goal is to equip you with an easy-to-follow framework for your next presentation, article, or video to stakeholders. 

The recipe for success when storytelling can be distilled into three individual components: context, dispute, and solution (Figure 1). These components can be combined with other methods to tell a compelling story with data. 

  • Context: Lay the foundation for your narrative and provide some background
  • Dispute: Discuss the problem associated with the context
  • Solution: Explain and discuss the solution that either ends or mitigates the identified problem
Graphic showing the components of storytelling: context, dispute, and solution.
Figure 1. The components of storytelling

Context

In storytelling, context involves providing information to reinforce, support, and reveal the key findings extracted from data samples. Without context, collated data are only collections of alphanumeric representations of information that alone don’t provide any actionable insight into the issue or topic. Presenting data together with reinforcing context and other supporting elements can aid understanding and help audiences reach meaningful conclusions. 

You can use many different methods to create context when storytelling. A context within data is produced by leveraging a collection of reinforcing materials such as actors, anecdotes, visualization, data labels, diagrams, and more.

To provide an example, consider the sentence below:

“200,000 plug-in electric vehicles were sold in the United Kingdom in 2021, representing an approximate 140% year-on-year increase.” 

Adding contextual information and supporting anecdotes can increase relatability, as shown in the paragraph below: 

“James’s interest in electric vehicles was sparked by a conversation he overheard on the radio about climate change. He did some research and found that a Volkswagen ID.3 would be a great choice for him. James decided to buy the car and by mid-2021, he was one of the many UK residents who had made the switch to electric vehicles. Sales of electric vehicles in 2021 more than doubled what they were in 2020, due to the public’s increasing awareness of climate change and its effects.”

Charts and diagrams are also important to include. They visualize data to aid understanding and provide additional support (Figure 2).

Bar chart showing the sales volume of plug-in electric vehicles in selected European countries in 2021, as an example of data visualization.
Figure 2. A bar chart is an example of data visualization that helps to provide context in data storytelling

Dispute

Dispute, in the context of data storytelling, is a problem, conflict, argument, debate, or issue. To drive the impact of introducing a new tool or adopting a new methodology, it helps to include mention of the key dispute. 

Below is an example of a dispute that helps drive the point of the initial electric vehicle data:

“The United Kingdom is a net importer of fossil fuels for the use of energy and electricity generation. Fossil fuels power our transportation, electrical, and technological services, and even domestic items heavily reliant on fossil fuels’ energy output. The problem is that the UK is determined to significantly reduce its dependence on fossil fuels by 2050. Hence, the question is how the UK can reduce its fossil fuel consumption and move to low-carbon energy sources as an alternative. In addition, fossil fuels are a massive contributor to climate change and extreme weather.”

Solution

The third, and final element to consider when connecting storytelling with data is the solution. The solution can come in many forms, such as reconfiguring an existing system, implementing new methodologies, or becoming aware of educational materials and how to best use them.

The proposed solution should be direct, obvious, and memorable. If proposed solutions are ambiguous, stakeholders will ask more questions. A direct solution, on the other hand, allows for action and the formation of future steps.

Below is an example of a proposed solution:

“Awareness is the first step to making the national UK goal of reducing fossil fuel dependency by 2050. To reach more people like James, we propose a scale-up of the WWF Carbon footprint app to include AI-powered functionality that enables services such as energy consumption prediction per household based on historical data and predicted energy demands. This scale-up initiative will require funding of £100 million and will be delivered to the public a year after project approval.”

The proposed solution contains a reference to the story to make it easier to remember. It also includes information about the project cost and timeline to show that it is direct. 

Sample outline 

Use the sample outline below as a reference for your next data storytelling project.

Opening section

  • Start with a factual statement of your key data point or dataset summary that highlights the impact of the dispute, lack of solution, or the impact of a possible solution. For example, “305,300 plug-in electric vehicles were sold in the United Kingdom in 2021, representing an approximate 140% year-on-year increase.”
  • Expand on the initial opening section by including several paragraphs introducing, explaining, and expanding on the context.

Middle section

  • Introduce, explain, and expand on the dispute.
  • Include anecdotes, facts, figures, charts, and diagrams to contextualize the dispute and present the problem.
  • Introduce, explain, and expand on the dispute concerning the solution.
  • Include anecdotes, facts, figures, charts, and diagrams to illustrate the impact and value of the proposed solution.

Closing section

  • Summarize your main points. Show the benefits a solution would bring, and the undesired consequences of not having a solution.
  • Include a call to action as a next step that encapsulates the desired outcome of the story told with data.
Complete diagram of the components, elements, and considerations for storytelling.
Figure 3. The key components and accompanying attributes of effective data storytelling

Summary

Companies and organizations are becoming more data-driven every day. As a result, AI and data professionals of all levels need to develop data storytelling skills to bridge gaps of understanding related to technicalities, datasets, and technologies. The information in this post will give you a strong foundation from which to start building your data storytelling skills.

Categories
Misc

Tiny Computer, Huge Learnings: Students at SMU Build Baby Supercomputer With NVIDIA Jetson Edge AI Platform

“DIY” and “supercomputer” aren’t words typically used together. But a do-it-yourself supercomputer is exactly what students built at Southern Methodist University, in Dallas, using 16 NVIDIA Jetson Nano modules, four power supplies, more than 60 handmade wires, a network switch and some cooling fans. The project, dubbed SMU’s “baby supercomputer,” aims to help educate those Read article >

The post Tiny Computer, Huge Learnings: Students at SMU Build Baby Supercomputer With NVIDIA Jetson Edge AI Platform appeared first on NVIDIA Blog.

Categories
Misc

Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron

Large language models (LLMs) are some of the most advanced deep learning algorithms that are capable of understanding written language. Many modern LLMs are…

Large language models (LLMs) are some of the most advanced deep learning algorithms that are capable of understanding written language. Many modern LLMs are built using the transformer network introduced by Google in 2017 in the Attention Is All You Need research paper.

NVIDIA NeMo Megatron is an end-to-end GPU-accelerated framework for training and deploying transformer-based LLMs up to a trillion parameters. In September 2022, NVIDIA announced that NeMo Megatron is now available in Open Beta, allowing you to train and deploy LLMs using your own data. With this announcement, several pretrained checkpoints have been uploaded to HuggingFace, enabling anyone to deploy LLMs locally using GPUs.

This post walks you through the process of downloading, optimizing, and deploying a 1.3 billion parameter GPT-3 model using NeMo Megatron. It includes NVIDIA Triton Inference Server, a powerful open-source, inference-serving software that can deploy a wide variety of models and serve inference requests on both CPUs and GPUs in a scalable manner.

System requirements

While training LLMs requires massive amounts of compute power, trained models can be deployed for inference at a much smaller scale for most use cases.

The models from HuggingFace can be deployed on a local machine with the following specifications:

  • Running a modern Linux OS (tested with Ubuntu 20.04).
  • An NVIDIA Ampere architecture GPU or newer with at least 8 GB of GPU memory.
  • At least 16 GB of system memory.
  • Docker version 19.03 or newer with the NVIDIA Container Runtime.
  • Python 3.7 or newer with PIP.
  • A reliable Internet connection for downloading models.
  • Permissive firewall, if serving inference requests from remote machines.

Preparation

NeMo Megatron is now in Open Beta and available for anyone who completes the free registration form. Registration is required to gain access to the training and inference containers, as well as helper scripts to convert and deploy trained models.

Several trained NeMo Megatron models are hosted publicly on HuggingFace, including 1.3B, 5B, and 20B GPT-3 models. These models have been converted to the .nemo format which is optimized for inference.

Converted models cannot be retrained or fine-tuned, but they enable fully trained models to be deployed for inference. These models are significantly smaller in size compared to the pre-conversion checkpoints and are supported by the FasterTransformer (FT) format. FasterTransformer is a backend in Triton Inference Server to run LLMs across GPUs and nodes.

For the purposes of this post, we used the 1.3B model, which has the quickest inference speeds and can comfortably fit in memory for most modern GPUs.

To convert the model, run the following steps.

Download the 1.3B model to your system. Run the following command in the desired directory to keep converted models for NVIDIA Triton to read:

wget https://huggingface.co/nvidia/nemo-megatron-gpt-1.3B/resolve/main/nemo_gpt1.3B_fp16.nemo

Make a note of the folder to which the model was copied, as it is used throughout the remainder of this post.

Verify the MD5sum of the downloaded file:

$ md5sum nemo_gpt1.3B_fp16.nemo
38f7afe7af0551c9c5838dcea4224f8a  nemo_gpt1.3B_fp16.nemo

Use a web browser to log in to NGC at ngc.nvidia.com. Enter the Setup menu by selecting your account name. Select Get API Key followed by Generate API Key to create the token. Make a note of the key as it is only shown one time.

In the terminal, add the token to Docker:

$ docker login nvcr.io
Username: $oauthtoken
Password: 

Replace with the token that was generated. The username must be exactly $oauthtoken, as this indicates that a personal access token is being used.

Pull the latest training and inference images for NeMo Megatron:

$ docker pull nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3
$ docker pull nvcr.io/ea-bignlp/bignlp-inference:22.08-py3

At the time of publication, the latest image tags are 22.08.01-py3 for training and 22.08-py3 for inference. We recommend checking for newer tags on NGC and pulling those, if available.

Verify that the images were pulled successfully, as the IDs might change with different tags:

$ docker images | grep "ea-bignlp/bignlp"
nvcr.io/ea-bignlp/bignlp-training                       22.08.01-py3                         d591b7488a47   11 days ago     17.3GB
nvcr.io/ea-bignlp/bignlp-inference                      22.08-py3                            77a6681df8d6   2 weeks ago     12.2GB

Model conversion

To optimize throughput and latency of the model, it can be converted to the FT format, which contains performance modifications to the encoder and decoder layers in the transformer architecture.

FT can serve inference requests with 3x quicker latencies or more compared to their non-FT counterparts. The NeMo Megatron training container includes the FT framework as well as scripts to convert a .nemo file to the FT format.

Triton Inference Server expects models to be stored in a model repository. Model repositories contain checkpoints and model-specific information that Triton Inference Server reads to tune the model at deployment time. As with the FT framework, the NeMo Megatron training container includes scripts to convert the FT model to a model repository for Triton.

Converting a model to the FT format and creating a model repository for the converted model can be done in one pass in a Docker container. To create an FT-based model repository, run the following command. Items that might have to change are in bold.

docker run --rm 
    --gpus all 
    --shm-size=16GB 
    -v /path/to/checkpoints:/checkpoints 
    -v /path/to/checkpoints/output:/model_repository 
    nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3 
    bash -c 'export PYTHONPATH=/opt/bignlp/FasterTransformer:${PYTHONPATH} && 
    cd /opt/bignlp && 
    python3 FasterTransformer/examples/pytorch/gpt/utils/nemo_ckpt_convert.py 
        --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo 
        --infer-gpu-num 1 
        --saved-dir /model_repository/gpt3_1.3b 
        --weight-data-type fp16 
        --load-checkpoints-to-cpu 0 && 
    python3 /opt/bignlp/bignlp-scripts/bignlp/collections/export_scripts/prepare_triton_model_config.py 
        --model-train-name gpt3_1.3b 
        --template-path /opt/bignlp/fastertransformer_backend/all_models/gpt/fastertransformer/config.pbtxt 
        --ft-checkpoint /model_repository/gpt3_1.3b/1-gpu 
        --config-path /model_repository/gpt3_1.3b/config.pbtxt 
        --max-batch-size 256 
        --pipeline-model-parallel-size 1 
        --tensor-model-parallel-size 1 
        --data-type bf16'

These steps launch a Docker container to run the conversions. The following list is of a few important parameters and their functions:

  • -v /path/to/checkpoints:/checkpoints: Specify the local directory where checkpoints were saved. This is the directory that was mentioned during the checkpoint download step earlier. The final :/checkpoints directory in the command should stay the same.
  • -v /path/to/checkpoint/output:/model_repository: Specify the local directory to save the converted checkpoints to. Make a note of this location as it is used in the deployment step later. The final :/model_repository directory in the command should stay the same.
  • nvcr.io/ea-bignlp/bignlp-training:22.08.01-py3: If a newer image exists on NGC, replace the highlighted tag with the new version.
  • --in-file /checkpoints/nemo_gpt1.3B_fp16.nemo: The name of the downloaded checkpoint to convert. If you are using a different version, replace the name here.
  • --infer-gpu-num 1: This is the number of GPUs to use for the deployed model. If using more than one GPU, increase this number to the desired amount. The remainder of this post assumes that the value of 1 was used here.
  • --model-train-name gpt3_1.3b: The name of the deployed model. If you are using a different model name, make a note of the new name as NVIDIA Triton requests require the name to be specified.
  • --tensor-model-parallel-size 1: If you are using a different GPU count for inference, this number must be updated. The value should match that of --infer-gpu-num from earlier.

After running the command, verify that the model has been converted by viewing the specified output directory. The output should be similar to the following (truncated for brevity):

$ ls -R output/
output/:
gpt3_1.3b

output/gpt3_1.3b:
1-gpu  config.pbtxt

output/gpt3_1.3b/1-gpu:
config.ini
merges.txt
model.final_layernorm.bias.bin
model.final_layernorm.weight.bin
...

Model deployment

Now that the model has been converted to a model repository, it can be deployed with Triton Inference Server. Do this using the NeMo Megatron Inference container, which has NVIDIA Triton built in.

By default, NVIDIA Triton uses three ports for HTTP, gRPC, and metric requests.

docker run --rm 
    --name triton-inference-server 
    -d 
    --gpus all 
    -p 8000-8002:8000-8002 
    -v /path/to/checkpoints/output:/model_repository 
    nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 
    bash -c 'export CUDA_VISIBLE_DEVICES=0 && 
    tritonserver --model-repository /model_repository'
  • -d: This tells Docker to run the container in the background. The server remains online and available for requests until the container is killed.
  • -p 8000-8002:8000-8002: NVIDIA Triton communicates using ports 8000 for HTTP requests, 8001 for gRPC requests, and 8002 for metrics information. These ports are mapped from the container to the host, allowing the host to handle requests directly and route them to the container.
  • -v /path/to/checkpoints/output:/model_repository: Specify the location where the converted checkpoints were saved to on the machine. This should match the model repository location from the conversion step earlier.
  • nvcr.io/ea-bignlp/bignlp-inference:22.08-py3: If a newer version exists on NGC, replace the highlighted tag with the new version.
  • export CUDA_VISIBLE_DEVICES=0: Specify which devices to use. If the model was converted to use multiple GPUs earlier, this should be a comma-separated list of the GPUs up to the desired number. For example, if you are using four GPUs, this should be CUDA_VISIBLE_DEVICES=0,1,2,3.

To verify that the container was launched successfully, run docker ps, which should show output similar to the following:

CONTAINER ID   IMAGE                                          COMMAND                  CREATED              STATUS              PORTS                                                           NAMES
f25cf23b75b7   nvcr.io/ea-bignlp/bignlp-inference:22.08-py3   "/opt/nvidia/nvidia_…"   About a minute ago   Up About a minute   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   triton-inference-server

Check the logs to see if the model was deployed and ready for requests (output truncated for brevity).

$ docker logs triton-inference-server
I0928 14:29:34.011299 1 server.cc:629] 
+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| gpt3_1.3b | 1       | READY  |
+-----------+---------+--------+

I0928 14:29:34.131430 1 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB
I0928 14:29:34.132280 1 tritonserver.cc:2176] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.24.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /model_repository                                                                                                                                                                            |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 0                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0928 14:29:34.133520 1 grpc_server.cc:4608] Started GRPCInferenceService at 0.0.0.0:8001
I0928 14:29:34.133751 1 http_server.cc:3312] Started HTTPService at 0.0.0.0:8000
I0928 14:29:34.174655 1 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

If the output is similar to what’s shown here, the model is ready to receive inference requests.

Sending inference requests

With a local Triton Inference Server running, you can start sending inference requests to the server. NVIDIA Triton’s client API supports multiple languages including Python, Java, and C++. For the purposes of this post, we provide a sample Python application.

from argparse import ArgumentParser
import numpy as np
import tritonclient.http as httpclient
from tritonclient.utils import np_to_triton_dtype
from transformers import GPT2Tokenizer

def fill_input(name, data):
    infer_input = httpclient.InferInput(name, data.shape, np_to_triton_dtype(data.dtype))
    infer_input.set_data_from_numpy(data)
    return infer_input

def build_request(query, host, output):
    with httpclient.InferenceServerClient(host) as client:
        request_data = []
        request = np.array([query]).astype(np.uint32)
        request_len = np.array([[len(query)]]).astype(np.uint32)
        request_output_len = np.array([[output]]).astype(np.uint32)
        top_k = np.array([[1]]).astype(np.uint32)
        top_p = np.array([[0.0]]).astype(np.float32)
        temperature = np.array([[1.0]]).astype(np.float32)

        request_data.append(fill_input('input_ids', request))
        request_data.append(fill_input('input_lengths', request_len))
        request_data.append(fill_input('request_output_len', request_output_len))
        request_data.append(fill_input('runtime_top_k', top_k))
        request_data.append(fill_input('runtime_top_p', top_p))
        request_data.append(fill_input('temperature', temperature))
        result = client.infer('gpt3_1.3b', request_data)
        output = result.as_numpy('output_ids').squeeze()
        return output

def main():
    parser = ArgumentParser('Simple Triton Inference Requestor')
    parser.add_argument('query', type=str, help='Enter a text query to send to '
                        'the Triton Inference Server in quotes.')
    parser.add_argument('--output-length', type=int, help='Specify the desired '
                        'length for output.', default=30)
    parser.add_argument('--server', type=str, help='Specify the host:port that '
                        'Triton is listening on. Defaults to localhost:8000',
                        default='localhost:8000')
    args = parser.parse_args()

    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    query = tokenizer(args.query).input_ids
    request = build_request(query, args.server, args.output_length)
    print(tokenizer.decode(request))

if __name__ == '__main__':
    main()

At a high level, the script does the following:

  1. Takes an input request from the user, such as, “Hello there! How are you today?”
  2. Tokenizes the input using a pretrained GPT-2 tokenizer from HuggingFace.
  3. Builds an inference request using several required and optional parameters, such as request, temperature, output length, and so on.
  4. Sends the request to NVIDIA Triton.
  5. Decodes the response using the tokenizer from earlier.

To run the code, several Python dependencies are required. These packages can be installed by running the following command:

$ pip3 install numpy tritonclient[http] transformers

After the dependencies are installed, save the code to a local file and name it infer.py. Next, run the application as follows:

$ python3 infer.py "1 2 3 4 5 6"

This sends the prompt “1 2 3 4 5 6” to the local inference server and should output the following to complete the sequence up to the default response token limit of 30:

“1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36"

The server can now respond to any HTTP requests using this basic formula and can support multiple concurrent requests both locally and remote.

Summary

Large language models are powering a growing number of applications. With the public release of several NeMo Megatron models, it’s now possible to deploy trained models locally.

This post outlined how to deploy public NeMo Megatron models using a simple Python script. You can test more robust models and use cases by downloading the larger models hosted on HuggingFace.

For more information about using NeMo Megatron, see the NeMo Megatron documentation and NVIDIA/nemo GitHub repo.

Categories
Misc

Explainer: What Are Graph Neural Networks?

GNNs apply the predictive power of deep learning to rich data structures that depict objects and their relationships as points connected by lines in a graph.

GNNs apply the predictive power of deep learning to rich data structures that depict objects and their relationships as points connected by lines in a graph.

Categories
Offsites

Researchers thought this was a bug (Borwein integrals)