Categories
Misc

Build Tools for the 3D World with the Extend the Omniverse Contest

Announcing our first Omniverse developer contest for building an Omniverse Extension. Show us how you’re extending Omniverse to transform 3D workflows and virtual worldbuilding.

Developers across industries are building 3D tools and applications to help teams create virtual worlds in art, design, manufacturing, and more. NVIDIA Omniverse, an extensible platform for full fidelity design, simulation, and developing USD-based workflows, has an ever-growing ecosystem of developers building Python-based extensions. We’ve launched contests in the past for building breathtaking 3D simulations using the Omniverse Create app. 

Today, we’re announcing our first NVIDIA Omniverse contest specifically for developers, engineers, technical artists, hobbyists, and researchers to develop Python tools for 3D worlds. The contest runs from July 11 to August 19, 2022. The overall winner will be awarded an NVIDIA RTX A6000, and the runners-up in each category will win a GeForce RTX 3090 Ti.

The challenge? Build an Omniverse Extension using Omniverse Kit and the developer-centric Omniverse application Omniverse Code. Contestants can create Python extensions in one of the following categories for the Extend the Omniverse contest:

  • Layout and scene authoring tools
  • Omni.ui with Omniverse Kit
  • Scene modifier and manipulator tools

Layout and scene authoring tools

The demand for 3D content and environments is growing exponentially. Layout and scene authoring tools help scale workflows for world-building, leveraging rules-based algorithms and AI to generate assets procedurally.

Instead of tediously placing every component by hand, creators can paint in broader strokes and automatically generate physical objects like books, lamps, or fences to populate a scene. With the ability to iterate layout and scenes more freely, creators can accelerate their workflows and free up time to focus on creativity. 

Universal Scene Description (USD) is at the foundation of layout and scene authoring tools contestants can develop in Omniverse. The powerful, easily extensible scene description handles incredibly large 3D datasets without skipping a beat—enabling creating, editing, querying, rendering, and collaboration in 3D worlds.

Video 1. How to build a tool using Omniverse Code that programmatically creates a scene

Omni.ui with Omniverse Kit

Well-crafted user interfaces provide a superior experience for artists and developers alike. They can boost productivity and enable nontechnical and technical users to harness the power of complex algorithms. 

Building custom user interfaces has never been simpler than with Omni.ui, Omniverse’s UI toolkit for creating beautiful and flexible graphical UI design. Omni.ui was designed using modern asynchronous technologies and UI design patterns to be reactive and responsive. 

Using Omniverse Kit, you can deeply customize the final look of applications with widgets for creating visual components, receiving user input, and creating data models. With its style sheet architecture that feels akin to HTML or CSS, you can change the look of your widgets or create a new color scheme for an entire app.

Existing widgets can be combined and new ones can be defined to build the interface that you’ve always wanted. These extensions can range from floating panels in the navigation bar to markup tools in Omniverse View and Showroom. You can also create data models, views, and delegates to build robust and flexible interfaces.

Video 2. How to use Omniverse Kit and Omni.ui, the toolkit to create custom UIs in Python

Scene modifier and manipulator tools

Scene modifier and manipulator tools offer new ways for artists to interact with their scenes. Whether it’s changing the geometry of an object, the lighting of a scene, or creating animations, these tools enable artists to modify and manipulate scenes with limited manual work.

Using omni.ui.scene, Omniverse’s low-code module for building UIs in 3D space, you can develop 3D widgets and manipulators to create and move shapes in a 3D projected scene with Python. Many primitive objects are available, including text, image, rectangle, arc, line, curve, and mesh, with more regularly being added.

Video 3. How to build a scene modifier tool in Omniverse

We can’t wait to see what extensions you’ll create to contribute to the ecosystem of extensions that are expanding what’s possible in the Omniverse. Read more about the contest, or watch the video below for a step-by-step guide on how to enter. You can also visit the GitHub contest page for sample code and other resources to get started. 

Video 4. How to submit to the contest

Don’t miss these upcoming events:

  • Join the Omniverse community on Discord July 13, 2022 for the Getting Started – #ExtendOmniverse Developer Contest livestream
  • Join us at SIGGRAPH for hands-on developer labs where you can learn how to build extensions in Omniverse.  

Learn more in the Omniverse Resource Center, which details how developers can build custom applications and extensions for the platform. 

Follow Omniverse on Instagram, Twitter, YouTube, and Medium for additional resources and inspiration. Check out the Omniverse forums and join our Discord Server to chat with the community.

Categories
Misc

Advancing Robotic Assembly with a Novel Simulation Approach Using NVIDIA Isaac

A breakthrough in the simulation and learning of contact-rich interactions provides tools and methods to accelerate robotic assembly and simulation research.

NVIDIA robotics and simulation researchers presented Factory: Fast Contact for Robotic Assembly at the 2022 Robotics: Science and Systems (RSS) conference. This work is a novel breakthrough in the simulation and learning of contact-rich interactions, which are ubiquitous in robotics research. Its aim is to greatly accelerate research and development in robotic assembly, as well as serve as a powerful tool for contact-rich simulation of any kind.

Robotic assembly: What, why, and challenges

Assembly is essential across the automotive, aerospace, electronics, and medical industries. Examples include tightening nuts and bolts, soldering, peg insertion, and cable routing.

However, robotic assembly remains one of the oldest and most challenging tasks in robotics. It has been exceptionally difficult to automate because of the physical complexity, high reliability, part variability, and high accuracy requirements.

In industry, robotic assembly methods may achieve high precision, accuracy, and reliability but often require expensive equipment and custom fixtures that can be time-consuming to set up and maintain (preprogrammed trajectories and careful tuning, for example). Tasks that involve robustness to variation (part types, appearance, and locations) and complex manipulation are frequently done using manual labor.

Research methods may achieve lower cost, higher adaptivity, and improved robustness but are often less reliable and slower.

Simulation: A tool for solving the challenges in robotic assembly

Simulation has been used for decades to verify, validate, and optimize robot designs and algorithms in robotics. This includes ensuring the safety of deploying these algorithms. It has also been used to generate large-scale datasets for deep learning, perform system identification, and develop planning and control methods.

In reinforcement learning (RL) research, we have recently seen how simulation results can be transferred to a real system. The importance of accurate physics simulation for robotics development cannot be overemphasized.

GIF of multiple walking robots on the wheel training in Isaac Gym.
Figure 1. ANYmal Demo Training in NVIDIA Isaac Gym, a high-performance GPU-accelerated physics simulator for robot learning

Physics-based simulators like MuJoCo and NVIDIA Isaac Gym have been used to train virtual agents to perform manipulation and locomotion tasks, such as solving a Rubik’s Cube or walking on uneven terrain using ANYmal. The policies have successfully transferred to real-world robots.

However, the power of a fast and accurate simulator has not substantially impacted robotic assembly. Developing such simulators for complex bodies with different variations and motions is a difficult task.

For example, a simple nut-and-bolt assembly requires more than pure helical motion. There are finite clearances between the threads of the nut and bolt, which allow the nut to move with six degrees of freedom. Even humans require some level of carefulness to ensure that the nut has proper initial alignment with the bolt and does not get stuck during tightening. 

However, simulating the task with traditional methods may require meshes with tens of thousands of triangles. Detecting collisions between these meshes, generating contact points and normals, and solving non-penetration constraints are major computational challenges.

Despite the fact that there is an abundance of threaded fasteners in the world, no existing robotics simulator is able to simulate even a single nut-and-bolt assembly in real time at the same rate as the underlying physical dynamics.

In Factory, the researchers developed methods to overcome the challenges in robotic assembly and other contact-rich interactions.

What is Factory?

Factory (Fast Contact for Robotic Assembly) is a set of physics simulation methods and robot learning tools for achieving real-time and faster simulation of a wide range of contact-rich interactions. One of the Factory applications is robotic assembly.

Factory offers the following central contributions:

  • A set of methods for fast, accurate physical simulation of contact-rich interactions through a novel GPU-based synthesis of signed distance function (SDF)-based collisions, contact reduction, and a Gauss-Seidel solver.
  • A robot learning suite consisting of: 
    • 60 high-quality assets, including a Franka robot and all rigid-body assemblies from the NIST Assembly Task Board 1, the established benchmark for robotic assembly
    • Three Isaac Gym-style learning environments for robotic assembly
    • Seven classical robot controllers
  • Proof-of-concept reinforcement learning policies for robots performing contact-rich tasks (a simulated Franka Robot solving the most contact-rich task on the NIST board, nut-and-bolt assembly)

The physics simulation methods in the Factory paper have been integrated into the PhysX physics engine used by Isaac Gym. The asset suite and reinforcement learning policies are available with the latest version of Isaac Gym and the Isaac Gym Environments GitHub repo. The simulation methods are also available in the Omniverse Isaac Sim simulator, with reinforcement learning examples coming later this summer.

Simulation methods and results

Using the fast GPU-based implementations of SDF collisions for objects, contact reduction algorithms for reducing contacts from the SDF collisions, and custom numerical solvers, the researchers were not only able to simulate a single M16 nut and bolt in real time but 1,024 in parallel environments and real time. This is essentially 20,000x faster than the prior state-of-the-art. 

The researchers demonstrated the simulator’s performance in a wide range of challenging scenes, including the following:

  • 512 bowls falling into a pile in the same environment
  • A pile of nuts fed into a feeder mechanism vibrating at 60 Hz
  • A Franka robot executing a hand-scripted trajectory to grasp and tighten a nut onto a bolt, with 128 instances of this environment executing in real time
GIF of 1,024 ZM16 nut-and-bolt parallel assembly.
Figure 2. The M16 nut-and-bolt assemblies scene, consisting of 1,024 parallel nut-and-bolt interactions executing in real time
GIF of robot hands retrieving nuts from a vibratory feeder and tightening them onto a bolt.
Figure 3. The Franka robot plus M16 nut-and-bolt assemblies scene, consisting of 128 parallel Franka robots retrieving nuts from a vibratory feeder mechanism and tightening them onto a bolt
GIF of 1,024 bowls falling into a pile in the same environment.
Figure 4. 1,024 bowls falling into a pile in the same environment, executing in real time

Robot learning tools

The most established benchmark for robotic assembly is the NIST assembly task board, the focus of an annual robotics competition since 2017. The NIST Task Board 1 consists of 38 unique parts. However, the CAD models provided are not ideal for physics simulations due to a lack of real-world clearances, interferences between parts, hand-derived measurements, and so on. Realistic models are hard to find.

Image for assembly components such as nut-and-bolt, USB, and Wi-Fi connector .
Figure 5. The real NIST Task Board 1. Compare to the simulated board in Figure 6.

Factory uses 60 high-quality, simulation-ready part models, each with an Onshape CAD model, one or more OBJ meshes, a URDF description, and estimated material properties that conform to international standards (ISO 724, ISO 965, and ISO 286) or which are based on models sourced from manufacturers. These models include all parts on the NIST assembly Task Board 1 with dimensional variations that span real-world tolerance bands. Clearance between parts ranges from 0 to a maximum of 2.66 mm, with many parts within the 0.1-0.5 mm range.

Image for simulated assembly components such as nut-and-bolt, USB, and Wi-Fi connector
Figure 6. Rendering of a simulated NIST Task Board 1, demonstrating the provided assets

Factory provides three robotic assembly scenes for Isaac Gym that can be used for developing planning and control algorithms, collecting simulated sensor data for supervised learning, and training RL agents. Each scene contains a Franka robot and disassembled assemblies from the NIST Task Board 1.

The assets can be randomized in types and locations across all environments. All scenes have been tested with up to 128 simultaneous environments on an NVIDIA RTX 3090 GPU. The scenes are shown below:

Images of a robot arm for different assembly tasks such as robot arm plus nut-and-bolt, robot arm plus USB connector, and robot arm plus gears
Figure 7. Factory robotic assembly environments

The seven robot controllers available in the learning environments include a joint-space inverse differential kinematics (IK) motion controller, a joint-space inverse dynamics (ID) controller, a task-space impedance controller, an operational space motion controller, an open-loop force controller, a closed-loop proportional force controller, and a hybrid force-motion controller.

The researchers intend that the models, environments, and controllers continuously grow with contributions from them and the community. 

Proof-of-concept RL policies

Factory employs GPU-accelerated on-policy RL to solve the most contact-rich task on NIST Task Board 1: assembling a nut onto a bolt. Like many assembly tasks, such a procedure is long-horizon and challenging to learn end-to-end. The problem was separated into three phases:

  1. Pick: The robot grasps the nut with a parallel-jaw gripper from a random location on a work surface.
  2. Place: The robot transports the nut to the top of a bolt fixed to the surface.
  3. Screw: The robot brings the nut into contact with the bolt, engages the mating threads, and tightens the nut until it contacts the base of the bolt head.
GIF of trained robot arm picking up a nut.
Figure 8. A trained robot arm picking up a nut, one of the achieved goal states of the trained subpolicies for FrankaNutBoltEnv
GIF of trained robot arm placing a nut onto a bolt.
Figure 9. A trained robot arm placing a nut onto a bolt, one of the achieved goal states of the trained subpolicies for FrankaNutBoltEnv
GIF of trained robot arm screwing a nut onto a bolt.
Figure 10. A trained robot arm screwing a nut onto a bolt, one of the achieved goal states of the trained subpolicies for FrankaNutBoltEnv

The training was done on a single GPU. Large randomizations were applied to the initial position and orientation of the objects with a batch of 3-4 policies trained simultaneously using proximal policy optimization (PPO). Each batch takes 1-1.5 hours to train and each subpolicy is trained in over 128 environments with a maximum of 1,024 policy updates for rapid experimentation. The success rate was 98.4% at test time.

Finally, to evaluate the potential for sim-to-real transfer (transferring the policy learned in simulation to real-world robotics systems), the researchers compared the contact forces generated during these interactions in simulation to contact forces measured in the real world by humans performing the same task with a wrench. For more information, see the R-PAL Daily Interactive Manipulation (DIM) dataset.

The figure below shows that the histogram of the simulation Fasten Nut lies in the middle of the histogram of the Real Fasten Nut, which shows a strong consistency with the real-world values.

Graph showing histogram of real and simulated contact forces
Figure 11. Comparison of simulated contact forces during screw subpolicy execution with analogous real-world contact forces from the Daily Interactive Manipulation (DIM) dataset

Conclusion and future directions

Although Factory was developed with robotic assembly as a motivating application, there are no limitations on using the methods for entirely different tasks within robotics, such as grasping complex non-convex shapes in home environments, locomotion on uneven outdoor terrain, and non-prehensile manipulation of aggregates of objects.

The future direction of this work is to realize full end-to-end simulation for complex physical interactions, including techniques for efficiently transferring the trained policies to real-world robotic systems. This can potentially minimize cost and risk, improve safety, and achieve efficient behaviors.

One day, every advanced industrial manufacturing robot might be trained in simulation using such techniques for seamless transfer to the real world.

Towards this end, NVIDIA developers are working to refine the physics simulation methods used by the Factory research so that they can be used within Omniverse Isaac Sim. Limited functionality is already present, and will become more robust over time.

Get started with Factory

The Factory RL environments will also be available in future versions of the Omniverse Isaac Gym Environments for Isaac Sim.

Categories
Offsites

​​Deep Hierarchical Planning from Pixels

Research into how artificial agents can make decisions has evolved rapidly through advances in deep reinforcement learning. Compared to generative ML models like GPT-3 and Imagen, artificial agents can directly influence their environment through actions, such as moving a robot arm based on camera inputs or clicking a button in a web browser. While artificial agents have the potential to be increasingly helpful to people, current methods are held back by the need to receive detailed feedback in the form of frequently provided rewards to learn successful strategies. For example, despite large computational budgets, even powerful programs such as AlphaGo are limited to a few hundred moves until receiving their next reward.

In contrast, complex tasks like making a meal require decision making at all levels, from planning the menu, navigating to the store to pick up groceries, and following the recipe in the kitchen to properly executing the fine motor skills needed at each step along the way based on high-dimensional sensory inputs. Hierarchical reinforcement learning (HRL) promises to automatically break down such complex tasks into manageable subgoals, enabling artificial agents to solve tasks more autonomously from fewer rewards, also known as sparse rewards. However, research progress on HRL has proven to be challenging; current methods rely on manually specified goal spaces or subtasks, and no general solution exists.

To spur progress on this research challenge and in collaboration with the University of California, Berkeley, we present the Director agent, which learns practical, general, and interpretable hierarchical behaviors from raw pixels. Director trains a manager policy to propose subgoals within the latent space of a learned world model and trains a worker policy to achieve these goals. Despite operating on latent representations, we can decode Director’s internal subgoals into images to inspect and interpret its decisions. We evaluate Director across several benchmarks, showing that it learns diverse hierarchical strategies and enables solving tasks with very sparse rewards where previous approaches fail, such as exploring 3D mazes with quadruped robots directly from first-person pixel inputs.

Director learns to solve complex long-horizon tasks by automatically breaking them down into subgoals. Each panel shows the environment interaction on the left and the decoded internal goals on the right.

How Director Works
Director learns a world model from pixels that enables efficient planning in a latent space. The world model maps images to model states and then predicts future model states given potential actions. From predicted trajectories of model states, Director optimizes two policies: The manager chooses a new goal every fixed number of steps, and the worker learns to achieve the goals through low-level actions. However, choosing goals directly in the high-dimensional continuous representation space of the world model would be a challenging control problem for the manager. Instead, we learn a goal autoencoder to compress the model states into smaller discrete codes. The manager then selects discrete codes and the goal autoencoder turns them into model states before passing them as goals to the worker.

Left: The goal autoencoder (blue) compresses the world model (green) state (st) into discrete codes (z). Right: The manager policy (orange) selects a code that the goal decoder (blue) turns into a feature space goal (g). The worker policy (red) learns to achieve the goal from future trajectories (s1, …, s4) predicted by the world model.

All components of Director are optimized concurrently, so the manager learns to select goals that are achievable by the worker. The manager learns to select goals to maximize both the task reward and an exploration bonus, leading the agent to explore and steer towards remote parts of the environment. We found that preferring model states where the goal autoencoder incurs high prediction error is a simple and effective exploration bonus. Unlike prior methods, such as Feudal Networks, our worker receives no task reward and learns purely from maximizing the feature space similarity between the current model state and the goal. This means the worker has no knowledge of the task and instead concentrates all its capacity on achieving goals.

Benchmark Results
Whereas prior work in HRL often resorted to custom evaluation protocols — such as assuming diverse practice goals, access to the agents’ global position on a 2D map, or ground-truth distance rewards — Director operates in the end-to-end RL setting. To test the ability to explore and solve long-horizon tasks, we propose the challenging Egocentric Ant Maze benchmark. This challenging suite of tasks requires finding and reaching goals in 3D mazes by controlling the joints of a quadruped robot, given only proprioceptive and first-person camera inputs. The sparse reward is given when the robot reaches the goal, so the agents have to autonomously explore in the absence of task rewards throughout most of their learning.

The Egocentric Ant Maze benchmark measures the ability of agents to explore in a temporally-abstract manner to find the sparse reward at the end of the maze.

We evaluate Director against two state-of-the-art algorithms that are also based on world models: Plan2Explore, which maximizes both task reward and an exploration bonus based on ensemble disagreement, and Dreamer, which simply maximizes the task reward. Both baselines learn non-hierarchical policies from imagined trajectories of the world model. We find that Plan2Explore results in noisy movements that flip the robot onto its back, preventing it from reaching the goal. Dreamer reaches the goal in the smallest maze but fails to explore the larger mazes. In these larger mazes, Director is the only method to find and reliably reach the goal.

To study the ability of agents to discover very sparse rewards in isolation and separately from the challenge of representation learning of 3D environments, we propose the Visual Pin Pad suite. In these tasks, the agent controls a black square, moving it around to step on differently colored pads. At the bottom of the screen, the history of previously activated pads is shown, removing the need for long-term memory. The task is to discover the correct sequence for activating all the pads, at which point the agent receives the sparse reward. Again, Director outperforms previous methods by a large margin.

The Visual Pin Pad benchmark allows researchers to evaluate agents under very sparse rewards and without confounding challenges such as perceiving 3D scenes or long-term memory.

In addition to solving tasks with sparse rewards, we study Director’s performance on a wide range of tasks common in the literature that typically require no long-term exploration. Our experiment includes 12 tasks that cover Atari games, Control Suite tasks, DMLab maze environments, and the research platform Crafter. We find that Director succeeds across all these tasks with the same hyperparameters, demonstrating the robustness of the hierarchy learning process. Additionally, providing the task reward to the worker enables Director to learn precise movements for the task, fully matching or exceeding the performance of the state-of-the-art Dreamer algorithm.

Director solves a wide range of standard tasks with dense rewards with the same hyperparameters, demonstrating the robustness of the hierarchy learning process.

Goal Visualizations
While Director uses latent model states as goals, the learned world model allows us to decode these goals into images for human interpretation. We visualize the internal goals of Director for multiple environments to gain insights into its decision making and find that Director learns diverse strategies for breaking down long-horizon tasks. For example, on the Walker and Humanoid tasks, the manager requests a forward leaning pose and shifting floor patterns, with the worker filling in the details of how the legs need to move. In the Egocentric Ant Maze, the manager steers the ant robot by requesting a sequence of different wall colors. In the 2D research platform Crafter, the manager requests resource collection and tools via the inventory display at the bottom of the screen, and in DMLab mazes, the manager encourages the worker via the teleport animation that occurs right after collecting the desired object.

Left: In Egocentric Ant Maze XL, the manager directs the worker through the maze by targeting walls of different colors. Right: In Visual Pin Pad Six, the manager specifies subgoals via the history display at the bottom and by highlighting different pads.
Left: In Walker, the manager requests a forward leaning pose with both feet off the ground and a shifting floor pattern, with the worker filling in the details of leg movement. Right: In the challenging Humanoid task, Director learns to stand up and walk reliably from pixels and without early episode terminations.
Left: In Crafter, the manager requests resource collection via the inventory display at the bottom of the screen. Right: In DMLab Goals Small, the manager requests the teleport animation that occurs when receiving a reward as a way to communicate the task to the worker.

Future Directions
We see Director as a step forward in HRL research and are preparing its code to be released in the future. Director is a practical, interpretable, and generally applicable algorithm that provides an effective starting point for the future development of hierarchical artificial agents by the research community, such as allowing goals to only correspond to subsets of the full representation vectors, dynamically learning the duration of the goals, and building hierarchical agents with three or more levels of temporal abstraction. We are optimistic that future algorithmic advances in HRL will unlock new levels of performance and autonomy of intelligent agents.

Categories
Misc

No Fueling Around: Designers Collaborate in Extended Reality on Porsche Electric Race Car

A one-of-a-kind electric race car revved to life before it was manufactured — or even prototyped — thanks to GPU-powered extended reality technology. At the Automotive Innovation Forum in May, NVIDIA worked with Autodesk VRED to showcase a photorealistic Porsche electric sports car in augmented reality, with multiple attendees collaborating in the same immersive environment. Read article >

The post No Fueling Around: Designers Collaborate in Extended Reality on Porsche Electric Race Car appeared first on NVIDIA Blog.

Categories
Misc

Designing Arithmetic Circuits with Deep Reinforcement Learning

Learn how NVIDIA researchers use AI to design better arithmetic circuits that power our AI chips.

As Moore’s law slows down, it becomes increasingly important to develop other techniques that improve the performance of a chip at the same technology process node. Our approach uses AI to design smaller, faster, and more efficient circuits to deliver more performance with each chip generation.

Vast arrays of arithmetic circuits have powered NVIDIA GPUs to achieve unprecedented acceleration for AI, high-performance computing, and computer graphics. Thus, improving the design of these arithmetic circuits would be critical in improving the performance and efficiency of GPUs.

What if AI could learn to design these circuits? In PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning, we demonstrate that not only can AI learn to design these circuits from scratch, but AI-designed circuits are also smaller and faster than those designed by state-of-the-art electronic design automation (EDA) tools. The latest NVIDIA Hopper GPU architecture has nearly 13,000 instances of AI-designed circuits.

Two circuit layouts are shown side by side. The layout on the left is smaller in height and width than the layout on the right.
Figure 1. 64b adder circuits designed by PrefixRL AI (left) are up to 25% smaller than that designed by a state-of-the-art EDA tool (right) while being as fast and functionally equivalent

In Figure 1, the circuit corresponds to the (31.4µm², 0.186ns) point in the PrefixRL curve in Figure 5.

The circuit design game

Arithmetic circuits in computer chips are constructed using a network of logic gates (like NAND, NOR, and XOR) and wires. The desirable circuit should have the following characteristics:

  • Small: A lower area so that more circuits can fit on a chip.
  • Fast: A lower delay to improve the performance of the chip.
  • Consume less power: A lower power consumption of the chip.

In our paper, we focus on circuit area and delay. We find that power consumption is well-correlated with area for our circuits of interest. Circuit area and delay are often competing properties, so we want to find the Pareto frontier of designs that effectively trades off these properties. Put simply, we desire the minimum area circuit at every delay.

In PrefixRL, we focus on a popular class of arithmetic circuits called (parallel) prefix circuits. Various important circuits in the GPU such as adders, incrementors, and encoders are prefix circuits that can be defined at a higher level as prefix graphs.

In this work, we specifically ask the question: can an AI agent design good prefix graphs? The state-space of all prefix graphs is large O(2^n^n) and cannot be explored using brute force methods.

a block diagram showing the Qnetwork block observes a prefix graph with four nodes and proposes another node (3,2) to be added. The prefix graphs before and after the node addition are shown. There are arrows from both nodes with the label “circuit synthesis” pointing to corresponding layouts of synthesized circuits. The difference in area and delay between the synthesized circuits are labeled as reward.
Figure 2. One iteration of PrefixRL with a 4b circuit example

A prefix graph is converted into a circuit with wires and logic gates using a circuit generator. These generated circuits are then further optimized by a physical synthesis tool using physical synthesis optimizations such as gate sizing, duplication, and buffer insertion.

The final circuit properties (delay, area, and power) do not directly translate from the original prefix graph properties, such as level and node count, due to these physical synthesis optimizations. This is why the AI agent learns to design prefix graphs but optimizes for the properties of the final circuit generated from the prefix graph.

We pose arithmetic circuit design as a reinforcement learning (RL) task, where we train an agent to optimize the area and delay properties of arithmetic circuits. For prefix circuits, we design an environment where the RL agent can add or remove a node from the prefix graph, after which the following steps happen:

  1. The prefix graph is legalized to always maintain a correct prefix sum computation.
  2. A circuit is generated from the legalized prefix graph.
  3. The circuit undergoes physical synthesis optimizations using a physical synthesis tool.
  4. The area and delay properties of the circuit are measured.

During an episode, the RL agent builds up the prefix graph step-by-step by adding or removing nodes. At each step, the agent receives the improvement in the corresponding circuit area and delay as rewards.

State and action representation and the deep reinforcement learning model

We use the Q-learning algorithm to train the circuit designer agent. We use a grid representation for prefix graphs where each element in the grid uniquely maps to a prefix node. This grid representation is used at both the input and output of the Q-network. Each element in the input grid represents whether a node is present or absent. Each element in the output grid represents the Q-values for adding or removing a node.

We use a fully convolutional neural network architecture for the agent as the input and output of the Q-learning agent are grid representations. The agent separately predicts the Q values for the area and delay properties because the rewards for area and delay are separately observable during training.

there is a left and right panel in this image. The left panel has two columns and three rows. Each row corresponds to a different prefix graph. The two columns show the graph structure and the corresponding grid representation respectively. The second row, for example, has the nodes (3:3), (2:2), (1:1), (0:0), (3:0), (2:0), (1:0), (3:2) in (msb:lsb) format. The grid representation plots these nodes on a grid where rows are msb and columns are lsbs. The graph representation has node (3:0) with parents (3:2) and (1:0), node (2:0) with parents (2:2) and (1:0), node (3:2) with parents (3:3) and (2:2), node (1:0) with parents (1:1) and (0:0). The right panel shows a block diagram of a neural network where blocks are layers. From input to output the blocks are CONV 3X3, STRIDE1, BATCHNORM, RELU, a few RESIDUAL blocks, CON1X1, STRIDE 1, BATCHNORM, LRELU, CONV1X1, STRIDE 1. The input to the neural network is the grid representation. The outputs are Q of {area, delay}X{add, delete} for nodes on the same grid representation.
Figure 3. Representations of certain 4b prefix graphs (left) and fully convolutional Q-learning agent architecture (right)

Distributed training with Raptor

PrefixRL is a computationally demanding task: physical simulation required 256 CPUs for each GPU and training the 64b case took over 32,000 GPU hours.

We developed Raptor, an in-house distributed reinforcement learning platform that takes special advantage of NVIDIA hardware for this kind of industrial reinforcement learning (Figure 4).

Raptor has several features that enhance scalability and training speed such as job scheduling, custom networking, and GPU-aware data structures. In the context of PrefixRL, Raptor makes the distribution of work across a mix of CPUs, GPUs, and Spot instances possible.

Networking in this reinforcement learning application is diverse and benefits from the following.

  • Raptor’s ability to switch between NCCL for point-to-point transfer to transfer model parameters directly from the learner GPU to an inference GPU.
  • Redis for asynchronous and smaller messages such as rewards or statistics.
  • A JIT-compiled RPC to handle high volume and low latency requests such as uploading experience data.

Finally, Raptor provides GPU-aware data structures such as a replay buffer that has a multithreaded server to receive experience from multiple workers, and batches data in parallel and prefetches it onto the GPU.

Figure 4 shows that our framework powers concurrent training and data collection, and takes advantage of NCCL to efficiently send actors the latest parameters.

A flow diagram with blocks for actors and optimizers on GPUs, an arrow showing NN parameter transfer using NCCL from optimizers to actors. Block for the environment has actions flow from actors and states flow to actors. States also flow to block with circuit synthesis CPU and synthesis cache. Action and states also flow to the experience buffer. Rewards from circuit synthesis flow to experience buffer. States, actions, and rewards are sampled from the experience buffer and flow to optimizers.
Figure 4. We use Raptor for decoupled and parallelized training and reward calculation to overcome circuit synthesis latency

Reward computation

We use a tradeoff weight w from [0,1] to combine the area and delay objectives. We train various agents with various weights to obtain a Pareto frontier of designs that balance the tradeoff between area and delay.

The physical synthesis optimizations in the RL environment can generate various solutions to tradeoff between area and delay. We should drive the physical synthesis tool with the same tradeoff weight for which a particular agent is trained.

Performing physical synthesis optimizations in the loop for reward computation has several advantages.

  • The RL agent learns to directly optimize the final circuit properties for a target technology node and library.
  • The RL agent can optimize the properties of the target arithmetic circuit and its surrounding logic jointly by including the surrounding logic during physical synthesis.

However, performing physical synthesis is a slow process (~35 seconds for 64b adders), which can greatly slow RL training and exploration.

We decouple reward calculation from state update as the agent only needs the current prefix graph state to take actions, and not circuit synthesis nor previous rewards. Thanks to Raptor, we can offload the lengthy reward calculation onto a pool of CPU workers to perform physical synthesis in parallel, while actor agents step through the environment without needing to wait.

When rewards are returned by the CPU workers, the transitions can then be inserted into the replay buffer. Synthesis rewards are cached to avoid redundant computation whenever a state is reencountered.

Results

The RL agents learn to design circuits tabula rasa purely through learning with feedback from synthesized circuit properties. Figure 5 shows the latest results* that use 64b adder circuits designed by PrefixRL, Pareto-dominated adder circuits from a state-of-the-art EDA tool in area and delay.

The best PrefixRL adder achieved a 25% lower area than the EDA tool adder at the same delay. These prefix graphs that map to Pareto optimal adder circuits after physical synthesis optimizations have irregular structures.

Animation with a fixed right panel showing Pareto curves of PrefixRL and EDA tool on area and delay axes. PrefixRL curve is lower area and delay throughout. Animated left panel displays various prefix graph architectures at different times and an arrow point to the corresponding point on the PrefixRL curve.
Figure 5. PrefixRL designs arithmetic circuits that are smaller and faster than circuits designed by a state-of-the-art EDA tool. (left) The circuit architectures; (right) the corresponding 64b adder circuit properties plots

Conclusion

To the best of our knowledge, this is the first method using a deep reinforcement learning agent to design arithmetic circuits. We hope that this method can be a blueprint for applying AI to real-world circuit design problems: constructing action spaces, state representations, RL agent models, optimizing for multiple competing objectives, and overcoming slow reward computation processes such as physical synthesis.

For more information and comparisons against other approaches, see PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning (preprint).

Categories
Misc

Mission-Driven: Takeaways From Our Corporate Responsibility Report

NVIDIA’s latest corporate responsibility report shares our efforts in empowering employees and putting to work our technologies for the benefit of humanity. Amid ongoing global economic concerns and pandemic challenges, this year’s report highlights our ability to attract and retain talent that come here to do their life’s work while tackling some of the world’s Read article >

The post Mission-Driven: Takeaways From Our Corporate Responsibility Report appeared first on NVIDIA Blog.

Categories
Offsites

Enabling Creative Expression with Concept Activation Vectors

Advances in computer vision and natural language processing continue to unlock new ways of exploring billions of images available on public and searchable websites. Today’s visual search tools make it possible to search with your camera, voice, text, images, or multiple modalities at the same time. However, it remains difficult to input subjective concepts, such as visual tones or moods, into current systems. For this reason, we have been working collaboratively with artists, photographers, and image researchers to explore how machine learning (ML) might enable people to use expressive queries as a way of visually exploring datasets.

Today, we are introducing Mood Board Search, a new ML-powered research tool that uses mood boards as a query over image collections. This enables people to define and evoke visual concepts on their own terms. Mood Board Search can be useful for subjective queries, such as “peaceful”, or for words and individual images that may not be specific enough to produce useful results in a standard search, such as “abstract details in overlooked scenes” or “vibrant color palette that feels part memory, part dream“. We developed, and will continue to develop, this research tool in alignment with our AI Principles.

Search Using Mood Boards
With Mood Board Search, our goal is to design a flexible and approachable interface so people without ML expertise can train a computer to recognize a visual concept as they see it. The tool interface is inspired by mood boards, commonly used by people in creative fields to communicate the “feel” of an idea using collections of visual materials.

With Mood Board Search, users can train a computer to recognize visual concepts in image collections.

To get started, simply drag and drop a small number of images that represent the idea you want to convey. Mood Board Search returns the best results when the images share a consistent visual quality, so results are more likely to be relevant with mood boards that share visual similarities in color, pattern, texture, or composition.

It’s also possible to signal which images are more important to a visual concept by upweighting or downweighting images, or by adding images that are the opposite of the concept. Then, users can review and inspect search results to understand which part of an image best matches the visual concept. Focus mode does this by revealing a bounding box around part of the image, while AI crop cuts in directly, making it easier to draw attention to new compositions.

Supported interactions, like AI crop, allow users to see which part of an image best matches their visual concept.

Powered by Concept Activation Vectors (CAVs)
Mood Board Search takes advantage of pre-trained computer vision models, such as GoogLeNet and MobileNet, and a machine learning approach called Concept Activation Vectors (CAVs).

CAVs are a way for machines to represent images (what we understand) using numbers or directions in a neural net’s embedding space (which can be thought of as what machines understand). CAVs can be used as part of a technique, Testing with CAVs (TCAV), to quantify the degree to which a user-defined concept is important to a classification result; e.g., how sensitive a prediction of “zebra” is to the presence of stripes. This is a research approach we open-sourced in 2018, and the work has since been widely applied to medical applications and science to build ML applications that can provide better explanations for what machines see. You can learn more about embedding vectors in general in this Google AI blog post, and our approach to working with TCAVs in Been Kim’s Keynote at ICLR.

In Mood Board Search, we use CAVs to find a model’s sensitivity to a mood board created by the user. In other words, each mood board creates a CAV — a direction in embedding space — and the tool searches an image dataset, surfacing images that are the closest match to the CAV. However, the tool takes it one step further, by segmenting each image in the dataset in 15 different ways, to uncover as many relevant compositions as possible. This is the approach behind features like Focus mode and AI crop.

Three artists created visual concepts to share their way of seeing, shown here in an experimental app by design invention studio, Nord Projects.

Because embedding vectors can be learned and re-used across models, tools like Mood Board Search can help us express our perspective to other people. Early collaborations with creative communities have shown value in being able to create and share subjective experiences with others, resulting in feelings of being able to “break out of visually-similar echo chambers” or “see the world through another person’s eyes”. Even misalignment between model and human understanding of a concept frequently resulted in unexpected and inspiring connections for collaborators. Taken together, these findings point towards new ways of designing collaborative ML systems that embrace personal and collective subjectivity.

Conclusions and Future Work
Today, we’re open-sourcing the code to Mood Board Search, including three visual concepts made by our collaborators, and a Mood Board Search Python Library for people to tap the power of CAVs directly into their own websites and apps. While these tools are early-stage prototypes, we believe this capability can have a wide-range of applications from exploring unorganized image collections to externalizing ways of seeing into collaborative and shareable artifacts. Already, an experimental app by design invention studio Nord Projects, made using Mood Board Search, investigates the opportunities for running CAVs in camera, in real-time. In future work, we plan to use Mood Board Search to learn about new forms of human-machine collaboration and expand ML models and inputs — like text and audio — to allow even deeper subjective discoveries, regardless of medium.

If you’re interested in a demo of this work for your team or organization, email us at cav-experiments-support@google.com.

Acknowledgments
This blog presents research by (in alphabetical order): Kira Awadalla, Been Kim, Eva Kozanecka, Alison Lentz, Alice Moloney, Emily Reif, and Oliver Siy, in collaboration with design invention studio Nord Projects. We thank our co-author, Eva Kozanecka, our artist collaborators, Alexander Etchells, Tom Hatton, Rachel Maggart, the Imaging team at The British Library for their participation in beta previews, and Blaise Agüera y Arcas, Jess Holbrook, Fernanda Viegas, and Martin Wattenberg for their support of this research project.

Categories
Misc

Wordle for AI: Santiago Valderrama on Getting Smarter on Machine Learning

Want to learn about AI and machine learning? There are plenty of resources out there to help — blogs, podcasts, YouTube tutorials — perhaps too many. Machine learning engineer Santiago Valderrama has taken a far more focused approach to helping us all get smarter about the field. He’s created a following by posing one machine Read article >

The post Wordle for AI: Santiago Valderrama on Getting Smarter on Machine Learning appeared first on NVIDIA Blog.

Categories
Misc

GFN Thursday Brings New Games to GeForce NOW for the Perfect Summer Playlist

Nothing beats the summer heat like GFN Thursday. Get ready for four new titles streaming at GeForce quality across nearly any device. Buckle up for some great gaming, whether poolside, in the car for a long road trip, or in the air-conditioned comfort of home. Speaking of summer, it’s also last call for this year’s Read article >

The post GFN Thursday Brings New Games to GeForce NOW for the Perfect Summer Playlist appeared first on NVIDIA Blog.

Categories
Offsites

MLGO: A Machine Learning Framework for Compiler Optimization

The question of how to compile faster and smaller code arose together with the birth of modem computers. Better code optimization can significantly reduce the operational cost of large datacenter applications. The size of compiled code matters the most to mobile and embedded systems or software deployed on secure boot partitions, where the compiled binary must fit in tight code size budgets. With advances in the field, the headroom has been heavily squeezed with increasingly complicated heuristics, impeding maintenance and further improvements.

Recent research has shown that machine learning (ML) can unlock more opportunities in compiler optimization by replacing complicated heuristics with ML policies. However, adopting ML in general-purpose, industry-strength compilers remains a challenge.

To address this, we introduce “MLGO: a Machine Learning Guided Compiler Optimizations Framework”, the first industrial-grade general framework for integrating ML techniques systematically in LLVM (an open-source industrial compiler infrastructure that is ubiquitous for building mission-critical, high-performance software). MLGO uses reinforcement learning (RL) to train neural networks to make decisions that can replace heuristics in LLVM. We describe two MLGO optimizations for LLVM: 1) reducing code size with inlining; and 2) improving code performance with register allocation (regalloc). Both optimizations are available in the LLVM repository, and have been deployed in production.

How Does MLGO Work? With Inlining-for-Size As a Case Study
Inlining helps reduce code size by making decisions that enable the removal of redundant code. In the example below, the caller function foo() calls the callee function bar(), which itself calls baz(). Inlining both callsites returns a simple foo() function that reduces the code size.

Inlining reduces code size by removing redundant code.

In real code, there are thousands of functions calling each other, and thus comprise a call graph. During the inlining phase, the compiler traverses over the call graph on all caller-callee pairs, and makes decisions on whether to inline a caller-callee pair or not. It is a sequential decision process as previous inlining decisions will alter the call graph, affecting later decisions and the final result. In the example above, the call graph foo()bar()baz() needs a “yes” decision on both edges to make the code size reduction happen.

Before MLGO, the inline / no-inline decision was made by a heuristic that, over time, became increasingly difficult to improve. MLGO substitutes the heuristic with an ML model. During the call graph traversal, the compiler seeks advice from a neural network on whether to inline a particular caller-callee pair by feeding in relevant features (i.e., inputs) from the graph, and executes the decisions sequentially until the whole call graph is traversed.

Illustration of MLGO during inlining. “#bbs”, “#users”, and “callsite height” are example caller-callee pair features.

MLGO trains the decision network (policy) with RL using policy gradient and evolution strategies algorithms. While there is no ground truth about best decisions, online RL iterates between training and running compilation with the trained policy to collect data and improve the policy. In particular, given the current model under training, the compiler consults the model for inline / no-inline decision making during the inlining stage. After the compilation finishes, it produces a log of the sequential decision process (state, action, reward). The log is then passed to the trainer to update the model. This process repeats until we obtain a satisfactory model.

Compiler behavior during training. The compiler compiles the source code foo.cpp to an object file foo.o with a sequence of optimization passes, one of which is the inline pass.

The trained policy is then embedded into the compiler to provide inline / no-inline decisions during compilation. Unlike the training scenario, the policy does not produce a log. The TensorFlow model is embedded with XLA AOT, which converts the model into executable code. This avoids TensorFlow runtime dependency and overhead, minimizing the extra time and memory cost introduced by ML model inference at compilation time.

Compiler behavior in production.

We trained the inlining-for-size policy on a large internal software package containing 30k modules. The trained policy is generalizable when applied to compile other software and achieves a 3% ~ 7% size reduction. In addition to the generalizability across software, generalizability across time is also important — both the software and compiler are under active development so the trained policy needs to retain good performance for a reasonable time. We evaluated the model’s performance on the same set of software three months later and found only slight degradation.

Inlining-for-size policy size reduction percentages. The x-axis presents different software and the y-axis represents the percentage size reduction. “Training” is the software on which the model was trained and “Infra[1|2|3]” are different internal software packages.

The MLGO inlining-for-size training has been deployed on Fuchsia — a general purpose open source operating system designed to power a diverse ecosystem of hardware and software, where binary size is critical. Here, MLGO showed a 6.3% size reduction for C++ translation units.

Register-Allocation (for performance)
As a general framework, we used MLGO to improve the register allocation pass, which improves the code performance in LLVM. Register Allocation solves the problem of assigning physical registers to live ranges (i.e., variables).

As the code executes, different live ranges are completed at different times, freeing up registers for use by subsequent processing stages. In the example below, each “add” and “multiply” instruction requires all operands and the result to be in physical registers. The live range x is allocated to the green register and is completed before either live ranges in the blue or yellow registers. After x is completed, the green register becomes available and is assigned to live range t.

Register allocation example.

When it’s time to allocate live range q, there are no available registers, so the register allocation pass must decide which (if any) live range can be “evicted” from its register to make room for q. This is referred to as the “live range eviction” problem, and is the decision for which we train the model to replace original heuristics. In this particular example, it evicts z from the yellow register, and assigns it to q and the first half of z.

We now consider the unassigned second half of live range z. We have a conflict again, and this time the live range t is evicted and split, and the first half of t and the final part of z end up using the green register. The middle part of z corresponds to the instruction q = t * y, where z is not being used, so it is not assigned to any register and its value is stored in the stack from the yellow register, which later gets reloaded to the green register. The same happens to t. This adds extra load/store instructions to the code and degrades performance. The goal of the register allocation algorithm is to reduce such inefficiencies as much as possible. This is used as the reward to guide RL policy training.

Similar to the inlining-for-size policy, the register allocation (regalloc-for-performance) policy is trained on a large Google internal software package, and is generalizable across different software, with 0.3% ~1.5% improvements in queries per second (QPS) on a set of internal large-scale datacenter applications. The QPS improvement has persisted for months after its deployment, showing the model’s generalizability across the time horizon.

Conclusion and Future Work
We propose MLGO, a framework for integrating ML techniques systematically in an industrial compiler, LLVM. MLGO is a general framework that can be expanded to be: 1) deeper, e.g., adding more features, and applying better RL algorithms; and 2) broader, by applying it to more optimization heuristics beyond inlining and regalloc. We are enthusiastic about the possibilities MLGO can bring to the compiler optimization domain and look forward to its further adoption and to future contributions from the research community.

Try it Yourself
Check out the open-sourced end-to-end data collection and training solution on github and a demo that uses policy gradient to train an inlining-for-size policy.

Acknowledgements
We’d like to thank MLGO’s contributors and collaborators Eugene Brevdo, Jacob Hegna, Gaurav Jain, David Li, Zinan Lin, Kshiteej Mahajan, Jack Morris, Girish Mururu, Jin Xin Ng, Robert Ormandi, Easwaran Raman, Ondrej Sykora, Maruf Zaber, Weiye Zhao. We would also like to thank Petr Hosek, Yuqian Li, Roland McGrath, Haowei Wu for trusting us and deploying MLGO in Fuchsia as MLGO’s very first customer; thank David Blaikie, Eric Christopher, Brooks Moses, Jordan Rupprecht for helping to deploy MLGO in Google internal large-scale datacenter applications; and thank Ed Chi, Tipp Moseley for their leadership support.