Hello! So I am trying to create a multiclass classifier using VGG16 in transfer learning to classify users’I emotions. The data is sorted into 4 classes, which have their proper directories so I can use the ‘image_dataset_from_directory’ function.
My code correctly identifies X number of images to 4 classes, but when I try to execute model.fit() it returns this error:
ValueError: in user code: File "/home/blabs/.local/lib/python3.9/site-packages/keras/engine/training.py", line 878, in train_function * return step_function(self, iterator) File "/home/blabs/.local/lib/python3.9/site-packages/keras/engine/training.py", line 867, in step_function ** outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/home/blabs/.local/lib/python3.9/site-packages/keras/engine/training.py", line 860, in run_step ** outputs = model.train_step(data) File "/home/blabs/.local/lib/python3.9/site-packages/keras/engine/training.py", line 809, in train_step loss = self.compiled_loss( File "/home/blabs/.local/lib/python3.9/site-packages/keras/engine/compile_utils.py", line 201, in __call__ loss_value = loss_obj(y_t, y_p, sample_weight=sw) File "/home/blabs/.local/lib/python3.9/site-packages/keras/losses.py", line 141, in __call__ losses = call_fn(y_true, y_pred) File "/home/blabs/.local/lib/python3.9/site-packages/keras/losses.py", line 245, in call ** return ag_fn(y_true, y_pred, **self._fn_kwargs) File "/home/blabs/.local/lib/python3.9/site-packages/keras/losses.py", line 1664, in categorical_crossentropy return backend.categorical_crossentropy( File "/home/blabs/.local/lib/python3.9/site-packages/keras/backend.py", line 4994, in categorical_crossentropy target.shape.assert_is_compatible_with(output.shape) ValueError: Shapes (None, 1) and (None, 4) are incompatible
How can I approach solving this issue? Thank you for your help.
We have a Tensorflow workflow and model that works great when used in a local environment (Python) – however, we now need to push it to production (Heroku). So we’re thinking we need to move our model into some type of Cloud hosting.
If possible, I’d like to upload the model directory (not an H5 file) to a cloud service/storage provider and then load that model into Tensorflow.
Here is how we’re currently loading in a model, and what we’d like to be able to do:
# Current setup loads model from local directory dnn_model = tf.keras.models.load_model('./neural_network/true_overall) # We'd like to be able to load the model from a cloud service/storage dnn_model = tf.keras.models.load_model('https://some-kinda-storage-service.com/neural_network/true_overall)
Downloading the directory and running it from a temp directory isn’t an option with our setup – so we’ll need to be able to run the model from the cloud. We don’t necessarily need to “train” the model in the cloud, we just need to be able to load it.
I’ve looked into some things like TensorServe and TensorCloud, but I’m not 100% sure if thats what we need (we’re super new to Tensorflow and AI in general).
What’s the best way to get the models (as a directory) into the cloud so we can load them into our code?
Posted by Mohammad Saleh, Software Engineer, Google Research, Brain Team and Anjuli Kannan, Software Engineer, Google Docs
For many of us, it can be challenging to keep up with the volume of documents that arrive in our inboxes every day: reports, reviews, briefs, policies and the list goes on. When a new document is received, readers often wish it included a brief summary of the main points in order to effectively prioritize it. However, composing a document summary can be cognitively challenging and time-consuming, especially when a document writer is starting from scratch.
To help with this, we recently announced that Google Docs now automatically generates suggestions to aid document writers in creating content summaries, when they are available. Today we describe how this was enabled using a machine learning (ML) model that comprehends document text and, when confident, generates a 1-2 sentence natural language description of the document content. However, the document writer maintains full control — accepting the suggestion as-is, making necessary edits to better capture the document summary or ignoring the suggestion altogether. Readers can also use this section, along with the outline, to understand and navigate the document at a high level. While all users can add summaries, auto-generated suggestions are currently only available to Google Workspace business customers. Building on grammar suggestions, Smart Compose, and autocorrect, we see this as another valuable step toward improving written communication in the workplace.
A blue summary icon appears in the top left corner when a document summary suggestion is available. Document writers can then view, edit, or ignore the suggested document summary.
Abstractive text summarization, which combines the individually challenging tasks of long document language understanding and generation, has been a long-standing problem in NLU and NLG research. A popular method for combining NLU and NLG is training an ML model using sequence-to-sequence learning, where the inputs are the document words, and the outputs are the summary words. A neural network then learns to map input tokens to output tokens. Early applications of the sequence-to-sequence paradigm used recurrent neural networks (RNNs) for both the encoder and decoder.
The introduction of Transformers provided a promising alternative to RNNs because Transformers use self-attention to provide better modeling of long input and output dependencies, which is critical in document summarization. Still, these models require large amounts of manually labeled data to train sufficiently, so the advent of Transformers alone was not enough to significantly advance the state-of-the-art in document summarization.
The combination of Transformers with self-supervised pre-training (e.g., BERT, GPT, T5) led to a major breakthrough in many NLU tasks for which limited labeled data is available. In self-supervised pre-training, a model uses large amounts of unlabeled text to learn general language understanding and generation capabilities. Then, in a subsequent fine-tuning stage, the model learns to apply these abilities on a specific task, such as summarization or question answering.
The Pegasus work took this idea one step further, by introducing a pre-training objective customized to abstractive summarization. In Pegasus pre-training, also called Gap Sentence Prediction (GSP), full sentences from unlabeled news articles and web documents are masked from the input and the model is required to reconstruct them, conditioned on the remaining unmasked sentences. In particular, GSP attempts to mask sentences that are considered essential to the document through different heuristics. The intuition is to make the pre-training as close as possible to the summarization task. Pegasus achieved state-of-the-art results on a varied set of summarization datasets. However, a number of challenges remained to apply this research advancement into a product.
Applying Recent Research Advances to Google Docs
Data
Self-supervised pre-training results in an ML model that has general language understanding and generation capabilities, but a subsequent fine-tuning stage is critical for the model to adapt to the application domain. We fine-tuned early versions of our model on a corpus of documents with manually-generated summaries that were consistent with typical use cases.
However, early versions of this corpus suffered from inconsistencies and high variation because they included many types of documents, as well as many ways to write a summary — e.g., academic abstracts are typically long and detailed, while executive summaries are brief and punchy. This led to a model that was easily confused because it had been trained on so many different types of documents and summaries that it struggled to learn the relationships between any of them.
Fortunately, one of the key findings in the Pegasus work was that an effective pre-training phase required less supervised data in the fine-tuning stage. Some summarization benchmarks required as few as 1,000 fine-tuning examples for Pegasus to match the performance of Transformer baselines that saw 10,000+ supervised examples — suggesting that one could focus on quality rather than quantity.
We carefully cleaned and filtered the fine-tuning data to contain training examples that were more consistent and represented a coherent definition of summaries. Despite the fact that we reduced the amount of training data, this led to a higher quality model. The key lesson, consistent with recent work in domains like dataset distillation, was that it was better to have a smaller, high quality dataset, than a larger, high-variance dataset.
Serving
Once we trained the high quality model, we turned to the challenge of serving the model in production. While the Transformer version of the encoder-decoder architecture is the dominant approach to train models for sequence-to-sequence tasks like abstractive summarization, it can be inefficient and impractical to serve in real-world applications. The main inefficiency comes from the Transformer decoder where we generate the output summary token by token through autoregressive decoding. The decoding process becomes noticeably slow when summaries get longer since the decoder attends to all previously generated tokens at each step. RNNs are a more efficient architecture for decoding since there is no self-attention with previous tokens as in a Transformer model.
We used knowledge distillation, which is the process of transferring knowledge from a large model to a smaller more efficient model, to distill the Pegasus model into a hybrid architecture of a Transformer encoder and an RNN decoder. To improve efficiency we also reduced the number of RNN decoder layers. The resulting model had significant improvements in latency and memory footprint while the quality was still on par with the original model. To further improve the latency and user experience, we serve the summarization model using TPUs, which provide significant speed ups and allow more requests to be handled by a single machine.
Ongoing Challenges and Next Steps While we are excited by the progress so far, there are a few challenges we are continuing to tackle:
Document coverage: Developing a set of documents for the fine-tuning stage was difficult due to the tremendous variety that exists among documents, and the same challenge is true at inference time. Some of the documents our users create (e.g., meeting notes, recipes, lesson plans and resumes) are not suitable for summarization or can be difficult to summarize. Currently, our model only suggests a summary for documents where it is most confident, but we hope to continue broadening this set as our model improves.
Evaluation: Abstractive summaries need to capture the essence of a document while being fluent and grammatically correct. A specific document may have many summaries that can be considered correct, and different readers may prefer different ones. This makes it hard to evaluate summaries with automatic metrics only, user feedback and usage statistics will be critical for us to understand and keep improving quality.
Long documents: Long documents are some of the toughest documents for the model to summarize because it is harder to capture all the points and abstract them in a single summary, and it can also significantly increase memory usage during training and serving. However, long documents are perhaps most useful for the model to automatically summarize because it can help document writers get a head start on this tedious task. We hope we can apply the latest ML advancements to better address this challenge.
Conclusion Overall, we are thrilled that we can apply recent progress in NLU and NLG to continue assisting users with reading and writing. We hope the automatic suggestions now offered in Google Workspace make it easier for writers to annotate their documents with summaries, and help readers comprehend and navigate documents more easily.
Acknowledgements The authors would like to thank the many people across Google that contributed to this work: AJ Motika, Matt Pearson-Beck, Mia Chen, Mahdis Mahdieh, Halit Erdogan, Benjamin Lee, Ali Abdelhadi, Michelle Danoff, Vishnu Sivaji, Sneha Keshav, Aliya Baptista, Karishma Damani, DJ Lick, Yao Zhao, Peter Liu, Aurko Roy, Yonghui Wu, Shubhi Sareen, Andrew Dai, Mekhola Mukherjee, Yinan Wang, Mike Colagrosso, and Behnoosh Hariri. .
Turn on your TV. Fire up your favorite streaming service. Grab a Coke. A demo of the most important visual technology of our time is as close as your living room couch. Propelled by an explosion in computing power over the past decade and a half, path tracing has swept through visual media. It brings Read article >
I would like to share my project and show you how to apply tinyML approach to detect broken tooth conditions in the gearbox based upon recorded vibration data.
I used Raspberry Pi Pico, Arduino IDE, Neuton Tiny ML software I will give an answer to such a questions as:
Is it possible to make an AI-driven system that predicts gearbox failure on a simple $4 MCU? How to automatically build a compact model that does not require any additional compression? Can a non-data scientist implement such projects successfully?
Introduction and Business Constraint
In industry (e.g., wind power, automotive), gearboxes often operate under random speed variations. A condition monitoring system is expected to detect faults, broken tooth conditions and assess their severity using vibration signals collected under different speed profiles.
Modern cars have hundreds of thousands of details and systems where it is necessary to predict breakdowns, control the state of temperature, pressure, etc.As such, in the automotive industry, it is critically important to create and embed TinyML models that can perform right on the sensors and open up a set of technological advantages, such as:
Internet independence
No waste of energy and money on data transfer
Advanced privacy and security
In my experiment I want to show how to easily create such a technology prototype to popularize the TinyML approach and use its incredible capabilities for the automotive industry.
Neuton TinyML:Neuton**,** I selected this solution since it is free to use and automatically creates tiny machine learning models deployable even on 8-bit MCUs. According to Neuton developers, you can create a compact model in one iteration without compression.
Raspberry Pi Pico: The chip employs two ARM Cortex-M0 + cores, 133 megahertz, which are also paired with 256 kilobytes of RAM when mounted on the chip. The device supports up to 16 megabytes of off-chip flash storage, has a DMA controller, and includes two UARTs and two SPIs, as well as two I2C and one USB 1.1 controller. The device received 16 PWM channels and 30 GPIO needles, four of which are suitable for analog data input. And with a net $4 price tag.
The goal of this tutorial is to demonstrate how you can easily build a compact ML model to solve a multi-class classification task to detect broken tooth conditions in the gearbox.
Dataset Description
Gearbox Fault Diagnosis Dataset includes the vibration dataset recorded by using SpectraQuest’s Gearbox Fault Diagnostics Simulator.
Dataset has been recorded using 4 vibration sensors placed in four different directions and under variation of load from ‘0’ to ’90’ percent. Two different scenarios are included:1) Healthy condition 2) Broken tooth condition
There are 20 files in total, 10 for a healthy gearbox and 10 for a broken one. Each file corresponds to a given load from 0% to 90% in steps of 10%. You can find this dataset through the link: https://www.kaggle.com/datasets/brjapon/gearbox-fault-diagnosis
The experiment will be conducted on a $4 MCU, with no cloud computing carbon footprints 🙂
Step 1: Model training
For model training, I’ll use the free of charge platform, Neuton TinyML. Once the solution is created, proceed to the dataset uploading (keep in mind that the currently supported format is CSV only).
Number of coefficients = 397, File Size for Embedding = 2.52 Kb. That’s super cool! It is a really small model!Upon the model training completion, click on the Prediction tab, and then click on the Download button next to Model for Embedding to download the model library file that we are going to use for our device.
Step 2: Embedding on Raspberry Pico
Once you have downloaded the model files, it’s time to add our custom functions and actions. I am using Arduino IDE to program Raspberry Pico.
Note: Since we are going to make classification on the test dataset, we will use the CSV utility provided by Neuton to run inference on the data sent to the MCU via USB.
I tried to build the same model with TensorFlow and TensorFlow Lite as well. My model built with Neuton TinyML turned out to be 4.3% better in terms of Accuracy and 15.3 times smaller in terms of model size than the one built with TF Lite. Speaking of the number of coefficients, TensorFlow’s model has, 9, 330 coefficients, while Neuton’s model has only 397 coefficients (which is 23.5 times smaller than TF!).
The resultant model footprint and inference time are as follows:
Autonomous vehicle development and validation require the ability to replicate real-world scenarios in simulation. At GTC, NVIDIA founder and CEO Jensen Huang showcased new AI-based tools for NVIDIA DRIVE Sim that accurately reconstruct and modify actual driving scenarios. These tools are enabled by breakthroughs from NVIDIA Research that leverage technologies such as NVIDIA Omniverse platform Read article >
This week at GTC, we’re celebrating – celebrating the amazing and impactful work that developers and startups are doing around the world. Nowhere is that more apparent than among the members of our global NVIDIA Inception program, designed to nurture cutting-edge startups who are revolutionizing industries. The program is free for startups of all sizes Read article >
Warp is a Python API framework for writing GPU graphics and simulation code, especially within Omniverse.
Typically, real-time physics simulation code is written in low-level CUDA C++ for maximum performance. In this post, we introduce NVIDIA Warp, a new Python framework that makes it easy to write differentiable graphics and simulation GPU code in Python. Warp provides the building blocks needed to write high-performance simulation code, but with the productivity of working in an interpreted language like Python.
By the end of this post, you learn how to use Warp to author CUDA kernels in your Python environment and make use of some of the built-in high-level functionality that makes it easy to write complex physics simulations, such as an ocean simulation (Figure 1).
Installation
Warp is available as an open-source library from GitHub. When the repository has been cloned, you can install it using your local package manager. For pip, use the following command:
pip install warp
Initialization
After importing, you must explicitly initialize Warp:
import warp as wp
wp.init()
Launching kernels
Warp uses the concept of Python decorators to mark functions that can be executed on the GPU. For example, you could write a simple semi-implicit particle integration scheme as follows:
Because Warp is strongly typed, you should provide type hints to kernel arguments. To launch a kernel, use the following syntax:
wp.launch(kernel=simple_kernel, # kernel to launch
dim=1024, # number of threads
inputs=[a, b, c], # parameters
device="cuda") # execution device
Unlike tensor-based frameworks such as NumPy, Warp uses a kernel-based programming model. Kernel-based programming more closely matches the underlying GPU execution model. It is often a more natural way to express simulation code that requires fine-grained conditional logic and memory operations. However, Warp exposes this thread-centric model of programming in an easy-to-use way that does not require low-level knowledge of GPU architecture.
Compilation model
Launching a kernel triggers a just-in-time (JIT) compilation pipeline that automatically generates C++/CUDA kernel code from Python function definitions.
All kernels belonging to a Python module are runtime compiled into dynamic libraries and PTX. Figure 2. shows the compilation pipeline, which involves traversing the function AST and converting this to straight-line CUDA code that is then compiled and loaded back into the Python process.
The result of this JIT compilation is cached. If the input kernel source is unchanged, then the precompiled binaries are loaded in a low-overhead fashion.
Memory model
Memory allocations in Warp are exposed through the warp.array type. Arrays wrap an underlying memory allocation that may live in either host (CPU), or device (GPU) memory. Unlike tensor frameworks, arrays in Warp are strongly typed and store a linear sequence of built-in structures (vec3, matrix33, quat, and so on).
You can construct arrays from Python lists or NumPy arrays, or initialized, using a similar syntax to NumPy and PyTorch:
# allocate an uninitizalized array of vec3s
v = wp.empty(length=n, dtype=wp.vec3, device="cuda")
# allocate a zero-initialized array of quaternions
q = wp.zeros(length=n, dtype=wp.quat, device="cuda")
# allocate and initialize an array from a numpy array
# will be automatically transferred to the specified device
v = wp.from_numpy(array, dtype=wp.vec3, device="cuda")
Warp supports the __array_interface__, and __cuda_array_interface__ protocols, which allow zero-copy data views between tensor-based frameworks. For example, to convert data to NumPy use the following command:
# automatically bring data from device back to host
view = device_array.numpy()
Features
Warp includes several higher-level data structures that make implementing simulation and geometry processing algorithms easier.
Meshes
Triangle meshes are ubiquitous in simulation and computer graphics. Warp provides a built-in type for managing mesh data that provide support for geometric queries, such as closest point, ray-cast, and overlap checks.
The following example shows how to use Warp to compute the closest point on a mesh to an array of input positions. This type of computation is the building block for many algorithms in collision detection (Figure 3). Warp’s mesh queries make it simple to implement such methods.
Sparse volumes are incredibly useful for representing grid data over large domains, such as signed distance fields (SDFs) for complex objects or velocities for large-scale fluid flow. Warp includes support for sparse volumes defined using the NanoVDB standard. Construct volumes using standard OpenVDB tools such as Blender, Houdini, or Maya, and then sample inside Warp kernels.
You can create volumes directly from binary grid files on disk or in-memory, and then sample them using the volumes API:
wp.volume_sample_world(vol, xyz, mode) # world space sample using interpolation mode
wp.volume_sample_local(vol, uvw, mode) # volume space sample using interpolation mode
wp.volume_lookup(vol, ijk) # direct voxel lookup
wp.volume_transform(vol, xyz) # map point from voxel space to world space
wp.volume_transform_inv(vol, xyz) # map point from world space to volume space
Using volume queries, you can efficiently collide against complex objects with minimal memory overhead.
Hash grids
Many particle-based simulation methods, such as the discrete element method (DEM) or smoothed particle hydrodynamics (SPH), involve iterating over spatial neighbors to compute force interactions. Hash grids are a well-established data structure to accelerate these nearest neighbor queries and are particularly well suited to the GPU.
Hash grids are constructed from point sets as follows:
When hash grids are created, you can query them directly from within user kernel code as shown in the following example, which computes the sum of all neighbor particle positions:
@wp.kernel
def sum(grid : wp.uint64,
points: wp.array(dtype=wp.vec3),
output: wp.array(dtype=wp.vec3),
radius: float):
tid = wp.tid()
# query point
p = points[tid]
# create grid query around point
query = wp.hash_grid_query(grid, p, radius)
index = int(0)
sum = wp.vec3()
while(wp.hash_grid_query_next(query, index)):
neighbor = points[index]
# compute distance to neighbor point
dist = wp.length(p-neighbor)
if (dist
Figure 5 shows an example of a DEM granular material simulation for a cohesive material. Using the built-in hash-grid data structure allows you to write such a simulation in fewer than 200 lines of Python and runs at interactive rates for more than 100K particles.
Using the Warp hash-grid data allows you to easily evaluate the pairwise force interactions between neighboring particles.
Differentiability
Tensor-based frameworks, such as PyTorch and JAX, provide gradients of tensor computations and are well-suited for applications like ML training.
A unique feature of Warp is the ability to generate forward and backward versions of kernel code. This makes it easy to write differentiable simulations that can propagate gradients as part of a larger training pipeline. A common scenario is to use traditional ML frameworks for network layers, and Warp to implement simulation layers allowing for end-to-end differentiability.
When gradients are required, you should create arrays with requires_grad=True. For example, the warp.Tape class can record kernel launches and replay them to compute the gradient of a scalar loss function with respect to the kernel inputs:
After the backward pass has completed, the gradients with respect to the inputs are available through a mapping in the Tape object:
# gradient of loss with respect to input a
print(tape.gradients[a])
Summary
In this post, we presented NVIDIA Warp, a Python framework that makes it easy to write differentiable simulation code for the GPU. We encourage you to download the Warp preview release, share results, and give us feedback.
For more information, see the following resources:
Learn how servers can be built to deliver immersive workloads and take advantage of compute power to combine streaming XR applications with AI and other computing functions.
The current distribution of extended reality (XR) experiences is limited to desktop setups and local workstations, which contain the high-end GPUs necessary to meet computing requirements. For XR solutions to scale past their currently limited user base and support higher-end functionality such as AI services integration and on-demand collaboration, we need a purpose-built platform.
NVIDIA Project Aurora is a hardware and software platform that simplifies the deployment of enterprise XR applications onto corporate on-premises networks. This platform is also designed to support the integration of AI services within XR workloads.
Project Aurora leverages the XR streaming backbone of NVIDIA CloudXR and NVIDIA RTX Virtual Workstation (RTX vWS), bringing the horsepower of NVIDIA RTX A6000 and NVIDIA A40 to the edge to stream rich, real-time graphics from a machine room over a private 5G network.
Opportunities ahead
For XR developers, Project Aurora’s streamlined delivery of NVIDIA CloudXR streaming dramatically broadens the customer base from users tethered to high-power workstations to anyone with a simple headset or handheld device.
Users no longer need to leave their work areas to use a dedicated, attended, tethered “VR room” setup to experience high-power, high-quality immersion. Collaborators around the world, using their own separate local on-prem networks, can access and alter the same virtual environments at the same time.
Current use cases
Powerful XR use cases exist across multiple industries, allowing you to optimize workflows.
Medical professionals can explore immersive representations of anatomical models or even real patient data to train and work.
Manufacturing designers and engineers can shorten project lifecycles by leveraging digital twins of parts, assemblies, or entire manufacturing floors and plants.
Those same personnel can train in a virtual manufacturing environment without the need of physical resources or floor downtime.
In collaboration with an ever-growing list of partners, NVIDIA has shown the value of moving these use cases into Project Aurora for virtualized distribution of XR over high performance networks.
BT and Ericsson
BT and Ericsson deployed a VR digital twin manufacturing solution on a 5G mobile private network using the world’s first 5G-enabled VR headset powered by the Qualcomm Snapdragon XR2 Platform. The experience runs on Masters of Pie’s Radical SDK, enabling cloud-based virtual reality within computer-aided design (CAD) software.
By seamlessly integrating the high-performance edge rendering provided by the Project Aurora platform, existing factory operations have benefited from high-fidelity VR experience that is available on the manufacturing floor.
EE 5G
Using augmented reality, The Green Planet AR Experience, powered by EE 5G, takes guests on an immersive journey into the secret kingdom of plants. Visitors travel through changing seasons on six digitally enhanced worlds, including rainforests, deserts, freshwater, and saltwater. The worlds are all powered by an Ericsson 5G standalone private network. Audiences engage and interact with the plant life by using a handheld mobile device, which acts as a window into the natural world.
The compute power of a handheld device on its own would not be nearly enough to render these volumetric models under normal conditions, but with Project Aurora, the experience is delivered seamlessly.
AT&T
AT&T recently teamed with Warner Bros., Ericsson, Qualcomm, Dreamscape, NVIDIA, and Wevr on an immersive, location-based VR experience, Chaos at Hogwarts. This proof-of-concept offers a peek into how Project Aurora combined with 5G can enhance future user-generated experiences.
By using the high-bandwidth and low-latency characteristics of 5G paired with Project Aurora’s XR infrastructure, we can change today’s architecture to one that is more comfortable for fans, more productive for creators, and more profitable for venue operators.
Project Aurora components
Project Aurora is an EGX certified server powered by a GPU/CPU configuration optimized for XR with a virtualization system using RTX vWS. The build is available through multiple OEMs (HPE, Dell) and supports multiple orchestration tools, including Linux KVM and VMWare vSphere.
NVIDIA CloudXR is layered on this virtualized workstation server to form the Project Aurora base platform. Although it is built specifically to stream XR applications, the Project Aurora server also supports any graphics-based workloads natively supported by RTX vWS.
The Project Aurora hardware and software stack is a scalable design built with the help of multiple partners. The base unit of an Aurora build consists of four highly optimized servers from Dell or HPE that support multiple NVIDIA A40 GPUs with RTX Virtual Workstation software and high-performance, low-latency NVIDIA networking components.
Network traffic has been carefully separated. It runs across multiple NVIDIA ConnectX network interface cards (NICs) and can be broken down into the following main areas:
Internal
External
Storage
These traffic flows are distributed across the three NICs to provide performance, security, and redundancy in each area in the event of any NIC failure. Project Aurora is designed with a “performance first” approach, and to scale from the base level to any size client deployment.
Project Aurora partners
Project Aurora is more than just a scalable hardware and software platform. To make the Project Aurora platform easy for any IT crew to implement, NVIDIA has partnered with NVIDIA Partner Network (NPN) integration experts to assure that delivery and installation is simple, repeatable, and fully “white-glove.” Our sales distribution channel was built with the following partners:
The Grid Factory: Immersive technology integrator specializing in NVIDIA-based technologies, such as NVIDIA CloudXR, vGPU, Omniverse, and the EGX Server architecture.
Enterprise Integration: Value-added reseller coordinating between clients and various partners, including OEMs such as Dell and HPE and distributors such as Arrow.
NVIDIA Pro Services (NVPS): Delivers white-glove service to facilitate the standup of durable XR experiences.
Get started with Project Aurora
With the support of the NPN partners, to activate Project Aurora’s white-glove service, just contact one of our team members. The Project Aurora team works from application sizing and site survey through the ‘last-mile’ integration steps (authentication, profiling, server/network setup, tuning parameters, etc.) to achieve a successful XR distribution system.
Project Aurora customers are actively increasing towards hundreds, even thousands of users. If you are interested in learning more about Project Aurora or would like to get involved with our distribution channel to evaluate a new use case, contact Aurora_Outreach@nvidia.com.
This CUDA post examines the effectiveness of methods to hide memory latency using explicit prefetching.
NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also have high memory bandwidth, but sometimes they need your help to saturate that bandwidth.
In this post, we examine one specific method to accomplish that: prefetching. We explain the circumstances under which prefetching can be expected to work well, and how to find out whether these circumstances apply to your workload.
Context
NVIDIA GPUs derive their power from massive parallelism. Many warps of 32 threads can be placed on a streaming multiprocessor (SM), awaiting their turn to execute. When one warp is stalled for whatever reason, the warp scheduler switches to another with zero overhead, making sure the SM always has work to do.
On the high-performance NVIDIA Ampere Architecture A100 GPU, up to 64 active warps can share an SM, each with its own resources. On top of that, A100 has 108 SMs that can all execute warp instructions simultaneously.
Most instructions must operate on data, and that data almost always originates in the device memory (DRAM) attached to the GPU. One of the main reasons why even the abundance of warps on an SM can run out of work is because they are waiting for data to arrive from memory.
If this happens, and the bandwidth to memory is not fully utilized, it may be possible to reorganize the program to improve memory access and reduce warp stalls, which in turn makes the program complete faster. This is called latency hiding.
Prefetching
A technology commonly supported in hardware on CPUs is called prefetching. The CPU sees a stream of requests from memory arriving, figures out the pattern, and starts fetching data before it is actually needed. While that data travels to the execution units of the CPU, other instructions can be executed, effectively hiding the travel costs (memory latency).
Prefetching is a useful technique but expensive in terms of silicon area on the chip. These costs would be even higher, relatively speaking, on a GPU, which has many more execution units than the CPU. Instead, the GPU uses excess warps to hide memory latency. When that is not enough, you may employ prefetching in software. It follows the same principle as hardware-supported prefetching but requires explicit instructions to fetch the data.
To determine if this technique can help your program run faster, use a GPU profiling tool such as NVIDIA Nsight Compute to check the following:
Confirm that not all memory bandwidth is being used.
Confirm the main reason warps are blocked is Stall Long Scoreboard, which means that the SMs are waiting for data from DRAM.
Confirm that these stalls are concentrated in sizeable loops whose iterations do not depend on each other.
Unrolling
Consider the simplest possible optimization of such a loop, called unrolling. If the loop is short enough, you can tell the compiler to unroll it completely and the iterations are expanded explicitly. Because the iterations are independent, the compiler can issue all requests for data (“loads”) upfront, provided that it assigns distinct registers to each load.
These requests can be overlapped with each other, so that the whole set of loads experiences only a single memory latency, not the sum of all individual latencies. Even better, part of the single latency is hidden by the succession of load instructions itself. This is a near-optimal situation, but it may require a lot of registers to receive the results of the loads.
If the loop is too long, it could be unrolled partially. In that case, batches of iterations are expanded, and then you follow the same general strategy as before. Work on your part is minimal (but you may not be that lucky).
If the loop contains many other instructions whose operands need to be stored in registers, even just partial unrolling may not be an option. In that case, and after you have confirmed that the earlier conditions are satisfied, you must make some decisions based on further information.
Prefetching means bringing data closer to the SMs’ execution units. Registers are closest of all. If enough are available, which you can find out using the Nsight Compute occupancy view, you can prefetch directly into registers.
Consider the following loop, where array arr is stored in global memory (DRAM). It implicitly assumes that just a single, one-dimensional thread block is being used, which is not the case for the motivating application from which it was derived. However, it reduces code clutter and does not change the argument.
In all code examples in this post, uppercase variables are compile-time constants. BLOCKDIMX assumes the value of the predefined variable blockDim.x. For some purposes, it must be a constant known at compile time whereas for other purposes, it is useful for avoiding computations at run time.
for (i=threadIdx.x; i
}
Imagine that you have eight registers to spare for prefetching. This is a tuning parameter. The following code fetches four double-precision values occupying eight 4-byte registers at the start of each fourth iteration and uses them one by one, until the batch is depleted, at which time you fetch a new batch.
To keep track of the batches, introduce a counter (ctr) that increments with each successive iteration executed by a thread. For convenience, assume that the number of iterations per thread is divisible by 4.
double v0, v1, v2, v3;
for (i=threadIdx.x, ctr=0; i
}
Typically, the more values can be prefetched, the more effective the method is. While the preceding example is not complex, it is a little cumbersome. If the number of prefetched values (PDIST, or prefetch distance) changes, you have to add or delete lines of code.
It is easier to store the prefetched values in shared memory, because you can use array notation and vary the prefetch distance without any effort. However, shared memory is not as close to the execution units as registers. It requires an extra instruction to move the data from there into a register when it is ready for use. For convenience, we introduce macro vsmem to simplify indexing the array in shared memory:
#define vsmem(index) v[index+PDIST*threadIdx.x]
__shared__ double v[PDIST* BLOCKDIMX];
for (i=threadIdx.x, ctr=0; i
}
Instead of prefetching in batches, you can also do a “rolling” prefetch. In that case, you fill the prefetch buffer before entering the main loop and subsequently prefetch exactly one value from memory during each loop iteration, to be used PDIST iterations later. The next example implements rolling prefetching, using array notation and shared memory.
__shared__ double v[PDIST* BLOCKDIMX];
for (k=0; k
}
Contrary to the batched method, the rolling prefetch does not suffer anymore memory latencies during the execution of the main loop for a sufficiently large prefetch distance. It also uses the same amount of shared memory or register resources, so it would appear to be preferred. However, a subtle issue may limit its effectiveness.
A synchronization within the loop—for example, syncthreads—constitutes a memory fence and forces the loading of arr to complete at that point within the same iteration, not PDIST iterations later. The fix is to use asynchronous loads into shared memory, the simplest version of which is explained in the Pipeline interface section of the CUDA programmer guide. These asynchronous loads do not need to complete at a synchronization point, but only when they are explicitly waited on.
Here’s the corresponding code:
#include
__shared__ double v[PDIST* BLOCKDIMX];
for (k=0; k
}
As each __pipeline_wait_prior instruction must be matched by a __pipeline_commit instruction, we put the latter inside the loop that prefills the prefetch buffer, before entering the main computational loop, to keep bookkeeping of matching instruction pairs simple.
Performance results
Figure 1 shows, for various prefetch distances, the performance improvement of a kernel taken from a financial application under the five algorithmic variations described earlier.
Batched prefetch into registers (scalar batched)
Batched prefetch into shared memory (smem batched)
Rolling prefetch into registers (scalar rolling)
Rolling prefetch into shared memory (smem rolling)
Rolling prefetch into shared memory using asynchronous memory copies (smem rolling async)
Clearly, the rolling prefetching into shared memory with asynchronous memory copies gives good benefit, but it is uneven as the prefetch buffer size grows.
A closer inspection of the results, using Nsight Compute, shows that bank conflicts occur in shared memory, which cause a warp worth of asynchronous loads to be split into more successive memory requests than strictly necessary. The classical optimization approach of padding the array size in shared memory to avoid bad strides works in this case. The value of PADDING is chosen such that the sum of PDIST and PADDING equals a power of two plus 1. Apply it to all variations that use shared memory:
This leads to the improved shared memory results shown in Figure 2. A prefetch distance of just 6, combined with asynchronous memory copies in a rolling fashion, is sufficient to obtain optimal performance at almost 60% speedup over the original version of the code. We could actually have arrived at this performance improvement without resorting to padding by changing the indexing scheme of the array in shared memory, which is left as an exercise for the reader.
A variation of prefetching not yet discussed moves data from global memory to the L2 cache, which may be useful if space in shared memory is too small to hold all data eligible for prefetching. This type of prefetching is not directly accessible in CUDA and requires programming at the lower PTX level.
Summary
In this post, we showed you examples of localized changes to source code that may speed up memory accesses. These do not change the amount of data being moved from memory to the SMs, only their timing. You may be able to optimize more by rearranging memory accesses such that data is reused many times after it arrives on the SM.