Tourer vehicles just became a little more grand. Electric vehicle maker Human Horizons provided a detailed glimpse earlier this month of its latest production model: the GT HiPhi Z. The intelligent EV is poised to redefine the grand tourer vehicle category with innovative, software-defined capabilities that bring luxurious cruising to the next level. The vehicle’s Read article >
NVIDIA today announced a unified computing platform for speeding breakthroughs in quantum research and development across AI, HPC, health, finance and other disciplines.
Kristel Michielsen was into quantum computing before quantum computing was cool. The computational physicist simulated quantum computers as part of her Ph.D. work in the Netherlands in the early 1990s. Today, she manages one of Europe’s largest facilities for quantum computing, the Jülich Unified Infrastructure for Quantum Computing (JUNIQ) . Her mission is to help Read article >
NVIDIA introduces QODA, a new platform for hybrid quantum-classical computing, enabling easy programming of integrated CPU, GPU, and QPU systems.
The past decade has seen quantum computing leap out of academic labs into the mainstream. Efforts to build better quantum computers proliferate at both startups and large companies. And while it is still unclear how far we are away from using quantum advantage on common problems, it is clear that now is the time to build the tools needed to deliver valuable quantum applications.
To start, we need to make progress in our understanding of quantum algorithms. Last year, NVIDIA announced cuQuantum, a software development kit (SDK) for accelerating simulations of quantum computing. Simulating quantum circuits using cuQuantum on GPUs enables algorithms research with performance and scale far beyond what can be achieved on quantum processing units (QPUs) today. This is paving the way for breakthroughs in understanding how to make the most of quantum computers.
In addition to improving quantum algorithms, we also need to use QPUs to their fullest potential alongside classical computing resources: CPUs and GPUs. Today, NVIDIA is announcing the launch of Quantum Optimized Device Architecture (QODA), a platform for hybrid quantum-classical computing with the mission of enabling this utility.
As quantum computing progresses, all valuable quantum applications will be hybrid, with the quantum computer working alongside high-performance classical computing. GPUs, which were created purely for graphics, transformed into essential hardware for high-performance computing (HPC). This required new software to enable powerful and straightforward programming. The transformation of quantum computers from science experiments to useful accelerators also requires new software.
This new era of quantum software will enable performant hybrid computation and increase the accessibility of quantum computers for the broader group of scientists and innovators.
Quantum programming landscape
The last five years have seen the development of quantum programming approaches targeting small-scale, noisy quantum computing architectures. This development has been great for algorithm developers and enabled early prototyping of both standard quantum algorithms as well as hybrid variational approaches.
Due to the scarcity of quantum resources and practicalities of hardware implementations, most of these programming approaches have been at the pure Python level supporting a remote, cloud-based execution model.
As quantum architectures improve and algorithm developers consider true quantum acceleration of existing classical heterogeneous computing, the question arises: How should we support quantum coprocessing in the traditional HPC context?
NVIDIA has been a true pioneer in the development of HPC programming models, heterogeneous compiler platforms, and high-level application libraries that accelerate traditional scientific computing workflows with one or many NVIDIA GPUs.
We see quantum computing as another element of a heterogeneous HPC system architecture and envision a programming model that seamlessly incorporates quantum coprocessing into our existing CUDA ecosystem. Current approaches that start at the Python language level are not sufficient in this regard and will ultimately limit performant integration of classical and quantum compute resources.
QODA for HPC
NVIDIA is developing an open specification for programming hybrid quantum-classical compute architectures in an HPC context. We are announcing the QODA programming model specification and corresponding NVQ++ compiler platform enabling a backend-agnostic (physical, simulated), single-source, modern C++ approach to quantum-accelerated high-performance computing.
QODA is inherently interoperable with existing classical parallel programming models such as CUDA, OpenMP, and OpenACC. This compiler implementation also lowers quantum-classical C++ source code representations to binary executables that natively target cuQuantum-enabled simulation backends.
This programming and compilation workflow enables a performant programming environment for accelerating hybrid algorithm research and development activities through standard interoperability with GPU processing and circuit simulation that scales from laptops to distributed multi-node, multi-GPU architectures.
auto ghz = [](const int N) __qpu__ {
qoda::qreg q(N);
h(q[0]);
for (auto i : qoda::irange(N-1)) {
cnot(q[i], q[i+1]);
}
mz(q);
};
// Sample a GHZ state on 30 qubits
auto counts = qoda::sample(ghz, 30);
counts.dump();
As shown in the code example, QODA provides a CUDA-like kernel-based programming approach, with a modern C++ focus. You can define quantum device code as standalone function objects or lambdas annotated with __qpu__ to indicate that this is to be compiled to and executed on the quantum device.
By relying on function objects over free functions (the CUDA kernel approach), you can enable an efficient approach to building up generic standard quantum library functions that can take any quantum kernel expression as input.
One simple example of this is the standard sampling QODA function (qoda::sample(...)), which takes a quantum kernel instance and any concrete arguments for which the kernel is to be evaluated as the input, and returns the familiar mapping of observed qubit measurement bit strings to the corresponding number of times observed.
QODA kernel programmers have access to certain built-in types pertinent for quantum computing (qoda::qubit, qoda::qreg, qoda::spin_op, and so on), quantum gate operations, and all traditional classical control flow inherited from C++.
An interesting aspect of the language compilation approach detailed earlier is the ability to compile QODA codes that contain CUDA kernels, OpenMP and OpenACC pragmas, and higher-level CUDA library API calls. This feature will enable hybrid quantum-classical application developers to truly take advantage of multi-GPU processing in tandem with quantum computing.
Future quantum computing use cases will require classical parallel processing for things like data preprocessing and postprocessing, standard quantum compilation tasks, and syndrome decoding for quantum error correction.
An early look at quantum-classical applications
A prototypical hybrid quantum-classical algorithm targeting noisy, near-term quantum computing architectures is the variational quantum eigensolver (VQE). The goal for VQE is to compute the minimum eigenvalue for a given quantum mechanical operator, such as a Hamiltonian, with respect to a parameterized state preparation circuit by relying on the variational principle from quantum mechanics.
You execute the state preparation circuit for a given set of gate rotational parameters and perform a set of measurements dictated by the structure of the quantum mechanical operator to compute the expectation value at those concrete parameters. A user-specified classical optimizer is then used to iteratively search for the minimal expectation value by varying these parameters.
You can see what a general VQE-like algorithm looks like with the QODA programming model:
// Define your state prep ansatz…
auto ansatz = [](std::vector thetas) __qpu__ {
… Use C++ control flow and quantum intrinsic ops …
};
// Define the Hamiltonian
qoda::spin_op H = … use x, y, z to build up Hamiltonian … ;
// Create a specific function optimization strategy
int n_params = …;
qoda::nlopt::lbfgs optimizer;
optimizer.initial_parameters = qoda::random_vector(-1, 1, n_params);
// Run the VQE algorithm with QODA
auto [opt_val, opt_params] =
qoda::vqe(ansatz, H, optimizer, n_params);
printf("Optimal = %lfn", opt_val);
The main components required are the parameterized ansatz QODA kernel expression, shown in the code example as a lambda taking a std::vector.
The actual body of this lambda is dependent on the problem at hand, but you are free to build up this function with standard C++ control flow, in-scope quantum kernel invocations, and the logical set of quantum intrinsic operations.
The next component required is the operator whose expectation value you need for calculating. QODA represents these as the built-in spin_op type, and you can build these up programmatically with Pauli x(int), y(int), and z(int) function calls.
Next, you need a classical function optimizer, which is a general concept within the QODA language specification meant for subclassing to specific optimization strategies, either gradient-based or gradient-free.
Finally, the language exposes a standard library function for invoking the entire VQE workflow. It is parameterized on the QODA kernel instance modeling the state preparation ansatz, the operator for which you need the following values:
The minimal eigenvalue
The classical optimization instance
The total number of variational parameters
You are then returned a structured binding that encodes the optimal eigenvalue and the corresponding optimal parameters for the state preparation circuit.
The preceding workflow is extremely general and lends itself to the development of variational algorithms that are ultimately generic with respect to quantum kernel expressions, spin operators of interest, and classical optimization routines.
But it also demonstrates the underlying philosophy of the QODA programming model: To provide core concepts to describe quantum code expressions, and then promote the utility of a standard library of generic functions enabling hybrid quantum-classical algorithmic composability.
QODA Early Interest program
Quantum computers hold great promise to help us solve some of our most important problems. We’re opening up quantum computing to scientists and experts in domains where HPC and AI already play a critical role, as well as enabling easy integration of today’s best existing software with quantum software. This will dramatically accelerate quantum computers realizing their potential.
QODA provides an open platform to do just that, and NVIDIA is excited to work with the entire quantum community to make useful quantum computing a reality. Apply to the QODA Early Interest program to stay up-to-date on NVIDIA quantum computing developments.
Posted by Qihang Yu, Student Researcher, and Liang-Chieh Chen, Research Scientist, Google Research
Panoptic segmentation is a computer vision problem that serves as a core task for many real-world applications. Due to its complexity, previous work often divides panoptic segmentation into semantic segmentation (assigning semantic labels, such as “person” and “sky”, to every pixel in an image) and instance segmentation (identifying and segmenting only countable objects, such as “pedestrians” and “cars”, in an image), and further divides it into several sub-tasks. Each sub-task is processed individually, and extra modules are applied to merge the results from each sub-task stage. This process is not only complex, but it also introduces many hand-designed priors when processing sub-tasks and when combining the results from different sub-task stages.
Recently, inspired by Transformer and DETR, an end-to-end solution for panoptic segmentation with mask transformers (an extension of the Transformer architecture that is used to generate segmentation masks) was proposed in MaX-DeepLab. This solution adopts a pixel path (consisting of either convolutional neural networks or vision transformers) to extract pixel features, a memory path (consisting of transformer decoder modules) to extract memory features, and a dual-path transformer for interaction between pixel features and memory features. However, the dual-path transformer, which utilizes cross-attention, was originally designed for language tasks, where the input sequence consists of dozens or hundreds of words. Nonetheless, when it comes to vision tasks, specifically segmentation problems, the input sequence consists of tens of thousands of pixels, which not only indicates a much larger magnitude of input scale, but also represents a lower-level embedding compared to language words.
In “CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation”, presented at CVPR 2022, and “kMaX-DeepLab: k-means Mask Transformer”, to be presented at ECCV 2022, we propose to reinterpret and redesign cross-attention from a clustering perspective (i.e., grouping pixels with the same semantic labels together), which better adapts to vision tasks. CMT-DeepLab is built upon the previous state-of-the-art method, MaX-DeepLab, and employs a pixel clustering approach to perform cross-attention, leading to a more dense and plausible attention map. kMaX-DeepLab further redesigns cross-attention to be more like a k-means clustering algorithm, with a simple change on the activation function. We demonstrate that CMT-DeepLab achieves significant performance improvements, while kMaX-DeepLab not only simplifies the modification but also further pushes the state-of-the-art by a large margin, without test-time augmentation. We are also excited to announce the open-source release of kMaX-DeepLab, our best performing segmentation model, in the DeepLab2 library.
Overview Instead of directly applying cross-attention to vision tasks without modifications, we propose to reinterpret it from a clustering perspective. Specifically, we note that the mask Transformer object query can be considered cluster centers (which aim to group pixels with the same semantic labels), and the process of cross-attention is similar to the k-means clustering algorithm, which adopts an iterative process of (1) assigning pixels to cluster centers, where multiple pixels can be assigned to a single cluster center, and some cluster centers may have no assigned pixels, and (2) updating the cluster centers by averaging pixels assigned to the same cluster center, the cluster centers will not be updated if no pixel is assigned to them).
In CMT-DeepLab and kMaX-DeepLab, we reformulate the cross-attention from the clustering perspective, which consists of iterative cluster-assignment and cluster-update steps.
Given the popularity of the k-means clustering algorithm, in CMT-DeepLab we redesign cross-attention so that the spatial-wise softmax operation (i.e., the softmax operation that is applied along the image spatial resolution) that in effect assigns cluster centers to pixels is instead applied along the cluster centers. In kMaX-DeepLab, we further simplify the spatial-wise softmax to cluster-wise argmax (i.e., applying the argmax operation along the cluster centers). We note that the argmax operation is the same as the hard assignment (i.e., a pixel is assigned to only one cluster) used in the k-means clustering algorithm.
Reformulating the cross-attention of the mask transformer from the clustering perspective significantly improves the segmentation performance and simplifies the complex mask transformer pipeline to be more interpretable. First, pixel features are extracted from the input image with an encoder-decoder structure. Then, a set of cluster centers are used to group pixels, which are further updated based on the clustering assignments. Finally, the clustering assignment and update steps are iteratively performed, with the last assignment directly serving as segmentation predictions.
To convert a typical mask Transformer decoder (consisting of cross-attention, multi-head self-attention, and a feed-forward network) into our proposed k-means cross-attention, we simply replace the spatial-wise softmax with cluster-wise argmax.
The meta architecture of our proposed kMaX-DeepLab consists of three components: pixel encoder, enhanced pixel decoder, and kMaX decoder. The pixel encoder is any network backbone, used to extract image features. The enhanced pixel decoder includes transformer encoders to enhance the pixel features, and upsampling layers to generate higher resolution features. The series of kMaX decoders transform cluster centers into (1) mask embedding vectors, which multiply with the pixel features to generate the predicted masks, and (2) class predictions for each mask.
The meta architecture of kMaX-DeepLab.
Results We evaluate the CMT-DeepLab and kMaX-DeepLab using the panoptic quality (PQ) metric on two of the most challenging panoptic segmentation datasets, COCO and Cityscapes, against MaX-DeepLab and other state-of-the-art methods. CMT-DeepLab achieves significant performance improvement, while kMaX-DeepLab not only simplifies the modification but also further pushes the state-of-the-art by a large margin, with 58.0% PQ on COCO val set, and 68.4% PQ, 44.0% mask Average Precision (mask AP), 83.5% mean Intersection-over-Union (mIoU) on Cityscapes val set, without test-time augmentation or using an external dataset.
Designed from a clustering perspective, kMaX-DeepLab not only has a higher performance but also a more plausible visualization of the attention map to understand its working mechanism. In the example below, kMaX-DeepLab iteratively performs clustering assignments and updates, which gradually improves mask quality.
kMaX-DeepLab’s attention map can be directly visualized as a panoptic segmentation, which gives better plausibility for the model working mechanism (image credit: coco_url, and license).
Conclusions We have demonstrated a way to better design mask transformers for vision tasks. With simple modifications, CMT-DeepLab and kMaX-DeepLab reformulate cross-attention to be more like a clustering algorithm. As a result, the proposed models achieve state-of-the-art performance on the challenging COCO and Cityscapes datasets. We hope that the open-source release of kMaX-DeepLab in the DeepLab2 library will facilitate future research on designing vision-specific transformer architectures.
Acknowledgements We are thankful to the valuable discussion and support from Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Florian Schroff, Hartwig Adam, and Alan Yuille.
Accelerate your AI-based simulations using NVIDIA Modulus. The 22.07 release brings advancements with weather modeling, novel network architectures, geometry modeling, performance, and more.
Accelerate your AI-based simulations using NVIDIA Modulus. The 22.07 release brings advancements with weather modeling, novel network architectures, geometry modeling, and more—plus performance improvements.
Visual effects savant Surfaced Studio steps In the NVIDIA Studio this week to share his clever film sequences, Fluid Simulation and Destruction, as well as his creative workflows. These sequences feature quirky visual effects that Surfaced Studio is renowned for demonstrating on his YouTube channel.
Learn how the PennyLane lightning.gpu device uses the NVIDIA cuQuantum software development kit to speed up the simulation of quantum circuits.
Discover how the new PennyLane simulator device, lightning.gpu, offloads quantum gate calls to the NVIDIA cuQuantum software development kit to speed up the simulation of quantum circuits.
The release by U.S. President Joe Biden Monday of the first full-color image from the James Webb Space Telescope is already astounding — and delighting — humans around the globe. “We can see possibilities nobody has ever seen before, we can go places nobody has ever gone before,” Biden said during a White House press Read article >