Categories
Misc

Using Semaphore and Memory Sharing Extensions for Vulkan Interop with NVIDIA OpenCL

Learn about new OpenCL support for Vulkan interoperability using semaphores and memory sharing extensions.

Developers often use OpenCL for compute together with other APIs, such as OpenGL, to access functionality including graphics rendering. OpenCL has long enabled the sharing of implicit buffer and image objects with OpenGL, OpenGL ES, EGL, Direct3D 10, and Direct3D 11 through extensions:

  • cl_khr_gl_sharing
  • cl_khr_gl_event
  • cl_khr_egl_image
  • cl_khr_egl_event
  • cl_khr_d3d10_sharing
  • cl_khr_d3d11_sharing

Download sample code now.

New generation GPU APIs such as Vulkan use explicit references to external memory together with semaphores to coordinate access to shared resources. Until now, there have been no OpenCL extensions to enable external memory and semaphore sharing with this new class of API.

Interop between OpenCL and Vulkan has been in strong demand for both mobile and desktop platforms. NVIDIA has closely worked with the Khronos OpenCL Working Group to release a set of provisional cross-vendor KHR extensions. The extensions enable applications to efficiently share data between OpenCL and APIs such as Vulkan, with significantly increased flexibility compared to previous-generation interop APIs using implicit resources. 

This set of new external memory and semaphore sharing extensions provide a generic framework that enables OpenCL to import external memory and semaphore handles exported by external APIs, using a methodology that will be familiar to Vulkan developers. OpenCL then uses those semaphores to synchronize the external runtime, coordinating the use of shared memory. 
 

Diagram shows how OpenCL imports memory and semaphore handles from Vulkan, and uses semaphores to synchronize memory ownership and access.
Figure 1. Interoperability relationship between OpenCL and Vulkan software

External API-specific interop extensions can then be added to handle the details of interacting with specific APIs. Vulkan interop is available today, and additional APIs, such as DirectX 12, are planned. 

The OpenCL new external semaphore and memory sharing functionality includes separate sets of carefully structured extensions. 

Semaphore extensions

This set of extensions adds the ability to create OpenCL semaphore objects from OS-specific semaphore handles. 

The following extensions extend cl_khr_external_semaphore with handle-type-specific behavior:

  • cl_khr_external_semaphore_opaque_fd—Shares external semaphores using Linux fd handles with reference transference, similar to VK_KHR_external_semaphore_fd.
  • cl_khr_external_semaphore_win32—Shares external semaphores using win32 NT and KMT handles with reference transference, similar to VK_KHR_external_semaphore_win32.

Memory extensions 

These extensions add the ability to create OpenCL memory objects from OS-specific memory handles. They have a similar design to the Vulkan external memory extension VK_KHR_external_memory. 

The following extensions extend cl_khr_external_memory with handle-type-specific behavior: 

Using OpenCL

The typical interop use case consists of the following steps.

Check if the required support is available:

  • Check if the required extensions cl_khr_external_semaphore and cl_khr_external_memory are supported by the underlying OpenCL platform and devices with clGetPlatformInfo and clGetDeviceInfo.
  • To be able to use Win32 semaphore and memory handles, check if the cl_khr_external_semaphore_win32_khr and cl_khr_external_memory_win32_khr extensions are present.
  • To be able to use FD semaphore and memory handles, check if the cl_khr_external_semaphore_opaque_fd_khr and cl_khr_external_memory_opaque_fd_khr extensions are present. This can also be checked by querying the supported handle types.

Importing external semaphores requires cl_khr_external_semaphore. If cl_khr_external_semaphore_opaque_fd is supported, you can import external semaphores exported by Vulkan using FD handles in OpenCL with clCreateSemaphoreWithPropertiesKHR.

// Get cl_devices of the platform. 
clGetDeviceIDs(..., &devices, &deviceCount);
// Create cl_context with just first device 
clCreateContext(..., 1, devices, ...);
// Obtain fd/win32 or similar handle for external semaphore to be imported from the other API. 
int fd = getFdForExternalSemaphore();// Create clSema of type cl_semaphore_khr usable on the only available device assuming the semaphore was imported from the same device.
cl_semaphore_properties_khr sema_props[] = 
        {(cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_BINARY_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_HANDLE_OPAQUE_FD_KHR, 
         (cl_semaphore_properties_khr)fd, 0}; 
int errcode_ret = 0; 
cl_semaphore_khr clSema = clCreateSemaphoreWithPropertiesKHR(context, 
                                                             sema_props, 
                                                             &errcode_ret);

Importing images requires cl_khr_external_memory and support for images. In OpenCL, import external semaphores exported by Vulkan using Win32 handles with clCreateSemaphoreWithPropertiesKHR.

// Get cl_devices of the platform. 
clGetDeviceIDs(..., &devices, &deviceCount);
// Create cl_context with just first device 
clCreateContext(..., 1, devices, ...);
// Obtain fd/win32 or similar handle for external semaphore to be imported from the other API. 
void *handle = getWin32HandleForExternalSemaphore(); 
// Create clSema of type cl_semaphore_khr usable on the only available device assuming the semaphore was imported from the same device. 
cl_semaphore_properties_khr sema_props[] = 
        {(cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_BINARY_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_HANDLE_OPAQUE_WIN32_KHR, 
         (cl_semaphore_properties_khr)handle, 0}; 
int errcode_ret = 0; 
cl_semaphore_khr clSema = clCreateSemaphoreWithPropertiesKHR(context, 
                                                             sema_props, 
                                                             &errcode_ret);

In OpenCL, import external memory exported by Vulkan using the FD handle as buffer memory with clCreateBufferWithProperties.

// Get cl_devices of the platform. 
clGetDeviceIDs(..., &devices, &deviceCount); 
 
// Create cl_context with just first device 
clCreateContext(..., 1, devices, ...); 
 
// Obtain fd/win32 or similar handle for external memory to be imported from other API. 
int fd = getFdForExternalMemory(); 
 
// Create extMemBuffer of type cl_mem from fd. 
cl_mem_properties_khr extMemProperties[] = 
{   (cl_mem_properties_khr)CL_EXTERNAL_MEMORY_HANDLE_OPAQUE_FD_KHR, 
    (cl_mem_properties_khr)fd,
0
}; 
cl_mem extMemBuffer = clCreateBufferWithProperties(/*context*/          clContext, 
                                                   /*properties*/       extMemProperties, 
                                                   /*flags*/            0, 
                                                   /*size*/             size, 
                                                   /*host_ptr*/         NULL, 
                                                   /*errcode_ret*/      &errcode_ret);

In OpenCL, import external memory exported by Vulkan as image memory using clCreateImageWithProperties.

// Create img of type cl_mem. Obtain fd/win32 or similar handle for external memory to be imported from other API. 
int fd = getFdForExternalMemory();
// Set cl_image_format based on external image info 
cl_image_format clImgFormat = { }; 
     clImageFormat.image_channel_order = CL_RGBA; 
     clImageFormat.image_channel_data_type = CL_UNORM_INT8;

// Set cl_image_desc based on external image info 
size_t clImageFormatSize; 
cl_image_desc image_desc = { }; 
     image_desc.image_type = CL_MEM_OBJECT_IMAGE2D_ARRAY; 
     image_desc.image_width = width; 
     image_desc.image_height = height;    
     image_desc.image_depth = depth; 
     cl_mem_properties_khr extMemProperties[] = 
     {    (cl_mem_properties_khr)CL_EXTERNAL_MEMORY_HANDLE_OPAQUE_FD_KHR, 
          (cl_mem_properties_khr)fd, 
           0 
     };
cl_mem img = clCreateImageWithProperties(/*context*/        clContext, 
                                         /*properties*/     extMemProperties, 
                                         /*flags*/          0, 
                                         /*image_format*/   &clImgFormat, 
                                         /*image_desc*/     &image_desc, 
                                         /*errcode_ret*/    &errcode_ret)

Synchronize between OpenCL and Vulkan using semaphore wait and signal.

// Create clSema using one of the above examples of external semaphore creation. 
int errcode_ret = 0; 
cl_semaphore_khr clSema = clCreateSemaphoreWithPropertiesKHR(context, 
                                                             sema_props, 
                                                             &errcode_ret); 
while (true) { 
    // (not shown) Signal the semaphore from the other API,
    // Wait for the semaphore in OpenCL 
    clEnqueueWaitSemaphoresKHR(  /*command_queue*/           command_queue, 
                                 /*num_sema_objects*/        1, 
                                 /*sema_objects*/            &clSema, 
                                 /*num_events_in_wait_list*/ 0, 
                                 /*event_wait_list*/         NULL, 
                                 /*event*/                   NULL); 
    clEnqueueNDRangeKernel(command_queue, ...); 
    clEnqueueSignalSemaphoresKHR(/*command_queue*/           command_queue, 
                                 /*num_sema_objects*/        1, 
                                 /*sema_objects*/            &clSema, 
                                 /*num_events_in_wait_list*/ 0, 
                                 /*event_wait_list*/         NULL, 
                                 /*event*/                   NULL); 
    // (not shown) Launch work in the other API that waits on 'clSema'

Again, for more information, download the vk_ocl_interop_samples.zip samples file.

Try it out today!

Try out the NVIDIA OpenCL implementation of Vulkan interop by downloading the R510 (or later) drivers: 

For more information, see Khronos Releases OpenCL 3.0 Extensions for Neural Network Inferencing and OpenCL/Vulkan Interop.

Categories
Misc

Speeding up Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX

MatX is an experimental library that allows you to write high-performance GPU code in C++, with high-level syntax and a common data type across all functions.

Rob Smallshire once said, “You can write faster code in C++, but write code faster in Python.” Since its release more than a decade ago, CUDA has given C and C++ programmers the ability to maximize the performance of their code on NVIDIA GPUs.

More recently, libraries such as CuPy and PyTorch allowed developers of interpreted languages to leverage the speed of the optimized CUDA libraries from other languages. These interpreted languages have many excellent properties, including easy-to-read syntax, automatic memory management, and common types across all functions.

However, sometimes having these features means paying a performance penalty due to memory management and other factors outside your control. The decrease in performance is often worth it to save development time. Still, it may ultimately require rewriting portions of the application later when performance becomes an issue.

What if you could still achieve the maximum performance using C++ while still reaping all the benefits from the interpreted languages?

MatX overview

MatX is an experimental, GPU-accelerated, numerical computing C++ library aimed at bridging the gap between users wanting the highest performance possible, with the same easy syntax and types across all CUDA libraries. Using the C++17 support added in CUDA 11.0, MatX allows you to write the same natural algebraic expressions that you would in a high-level language like Python without the performance penalty that may come from it.

Tensor types

MatX includes interfaces to many of the popular math libraries, such as cuBLAS, CUTLASS, cuFFT, and CUB, but uses a common data type (tensor_t) across all these libraries. This greatly simplifies the API to these libraries by deducing information that it knows about the tensor type and calling the correct APIs based on that.

The following code examples show an FFT-based resampler.

Python

N = min(ns, ns_resamp)
nyq = N // 2 + 1

# Create an empty vector
sv = np.empty(ns)

# Real to complex FFT
svc = np.fft.rfft(sv)

# Slice
sv = svc[0:nyq]

# Complex to real IFFT
rsv = np.fft.irfft(sv, ns_resamp)

MatX

uint32_t N = std::min(ns, ns_resamp);  
uint32_t nyq = N / 2 + 1;

auto sv  = make_tensor({ns});  
auto svc = make_tensor({ns / 2 + 1});  
auto rv  = make_tensor({ns_resamp});

// Real to complex FFT
fft(svc, sv, stream);

// Slice the vector
auto sv = svc.Slice({0}, {nyq});

// Complex to real IFFT
ifft(rsv, sv, stream);

While the code length and readability are similar, the MatX version on an A100 runs about 2100x faster than the NumPy version running on CPU. The MatX version also has many hidden benefits over directly using the CUDA libraries, such as type checking, input and output size checking, and slicing a tensor without pointer manipulation.

The tensor types are not limited to FFTs, though, and the same variables can be used inside of other libraries and expressions. For example, if you wanted to perform a GEMM using CUTLASS on the resampler output, you could write the following:

matmul(resampOut, resampView, B, stream);

In this code, resampOut and B are appropriately sized tensors for the GEMM operation. As in the FFT sample preceding, types, sizes, batches, and strides are all inferred by the tensor metadata. Using a strongly typed C++ API also means that many runtime and compile-time errors can be caught without additional debugging.

In addition to supporting the optimized CUDA libraries as backends, these same tensor types can be used in algebraic expressions to perform element-wise operations:

(C = A * B + (D / 5.0) + cos(E)).run(stream);

Lazy evaluation

MatX uses lazy evaluation to create a GPU kernel at compile time representing the expression in parentheses. Only when the run function is called on the expression does the operation execute on the GPU. Over 40 different types of operators are supported and can be mixed and matched across different size and type tensors with compatible parameters. If you look at the earlier expression written as a CUDA kernel, it would look something like this:

__global__ void Expression( float *C, 
                            const float *A,
                            const float *B,
                            const float *D,
                            const float *E,
                            int length)
{
    for (int idx = blockIdx.x * blockDim.x + threadIdx.x; 
         idx 



While the earlier code is not complicated, it’s hiding several problems:

  • The data types are hard-coded to floats. To change to another type, you must edit the kernel signature. Astute readers would say to use templates and let the compiler deduce types for you. While that may work for some types, it won’t work for all types you may want to use. For example, cosf is not defined for half precision types, so you must use compile-time conditionals to handle different types.
  • Any small change to the function signature needs a completely different function. For example, what if you wanted to add a tensor F in some cases, but still retain this original signature? That would be two functions to maintain that are nearly identical.
  • While a grid-stride loop is good practice and is used to handle different sizes of blocks and grids, you must still have code ensuring that during kernel launch there are enough threads to keep the GPU busy.
  • All inputs are assumed to be 1D vectors; higher dimensions could break with non-unity strides.

There are numerous other deficiencies not listed, including the inability to broadcast different-sized tensors, no checking on sizes, requiring contiguous memory layouts, and more.

Obviously, this code only works under specific conditions, while the MatX version solves all these issues and more while typically maintaining the same performance as writing the kernel directly.

Additional MatX features

Other key features of MatX include the following:

  • Creating zero-copy tensor views by slicing, cloning, and permuting existing tensors.
  • Supporting arbitrary-dimension tensors.
  • Generators for generating data on-the-fly without storing in memory. Common examples would be to create a linearly spaced vector, hamming window, or a diagonal matrix.
  • Supports almost every type used in CUDA, including half precision (both FP16 and BF16) and complex numbers (both full and half precision).
  • Linear solver functions through cuSolver, sorting and scanning using CUB, random number generation using cuRAND, reductions, and more

Summary

MatX is open-sourced under the BSDv3 license. For more information, see the following resources:

Categories
Misc

From Earth Sciences to Factory Production: GPU Hackathon Optimizes Modeling Results

Group image of participants of the digital TWCC GPU HackathonThe recent Taiwan Computing Cloud GPU Hackathon helped 12 teams advance their HPC and AI projects, using innovative technologies to address pressing global challenges.Group image of participants of the digital TWCC GPU Hackathon

While the world is continuously changing, one constant is the ongoing drive of developers to tackle challenges using innovative technologies. The recent Taiwan Computing Cloud (TWCC) GPU Hackathon exemplified such a drive, serving as a catalyst for developers and engineers to advance their HPC and AI projects using GPUs. 

A collaboration between the National Center for High-Performance Computing, Taiwan Web Service Corporation, NVIDIA, and OpenACC, 12 teams and 15 NVIDIA mentors, used approaches to accelerate projects ranging from an AI-driven manufacturing scheduling model to a rapid flood prediction model. 

Tapping AI to optimize production efficiency 

One of the key areas of smart manufacturing is optimizing and automating production line processes. Team AI Scheduler, with members from the Computational Intelligence Technology Center (CITC) of Industrial Technology Research Center (ITRI), came to the hackathon to work on their manufacturing scheduling model using machine learning.  

Traditional scheduling models mostly employ heuristic rules, which can respond to dynamic events instantly. However, their short-term approach does not often lead to the optimal solution and proves inflexible when dealing with changing variables, which limits their ongoing viability. 

The team’s approach uses a Monte Carlo Tree Search (MCTS) method, combining the classic tree search implementations alongside machine learning principles of reinforcement learning. This method addresses existing heuristic limitations for improved efficiency of the overall scheduling model for improved efficiency. 

Working with their mentor, Team AI Scheduler learned to use NVIDIA Nsight Systems to identify bottlenecks and use GPUs to parallelize their code. At the conclusion of the event, the team was able to accelerate the simulation step of their MCTS algorithm. This reduced the scheduling time from 6 hours to 30 minutes and achieved a speedup of 11.3x in overall scheduling efficiency.  

“Having proved the feasibility of using GPUs to accelerate our model at this hackathon, the next step is to adopt it into our commercial models for industry use,” said Dr. Tsan-Cheng Su and Hao-Che Huang of CITC, ITRI. 

Using GPUs to see the big picture in Earth sciences 

Located between the Eurasian and the Philippine Sea Plate, Taiwan is one of the most tectonically active places in the world, and an important base for global seismological research. Geological research and the time scale of tectonic activity is often measured in units of thousands–or tens of thousands–of years. This requires the use of massive amounts of data and adequate compute power to analyze efficiently. 

Hackathon Team IES-Geodynamics, led by Dr. Tan, is pictured.
Figure 1. Led by Dr. Tan (center), Team IES-Geodynamics pictured. 

The IES-Geodynamics team, led by Dr. Tan from the Institute of Earth Research, Academia Sinica, came to the GPU Hackathon to accelerate their numerical geodynamical model. Named DynEarthSol, it simulates mantle convection, subduction, mountain building, and tectonics. Previously, the team handled large volumes of data by reducing the number of calculations and steps by chunking data into pieces and restricting the computing processes to fit the limited computing power of the CPU. This made it very difficult to see the full picture of the research. 

Over the course of the hackathon, the team used a new data input method that leveraged the GPU to calculate the data and multiple steps. Using OpenACC, Team IES-Geodynamics was able to port 80% of their model to GPUs and achieved a 13.6X speedup. 

“This is my second time attending a GPU Hackathon and I will definitely attend the next one,” said Professor Eh Tan, Research Fellow from IES, Academia Sinica. “We have learned the appropriate way to adopt GPUs and the user-friendly profiling tool gives us a great idea for how to optimize our model.” 

The team will continue to work towards porting the remaining 20% of their model. They look forward to running more high-resolution models using GPUs to gain a deeper understanding of formation activities in Taiwan. 

Rapid flood assessment for emergency planning and response 

Flooding is among the most devastating natural disasters. Causing massive casualties and economic losses, floods affect an average of 21 million people worldwide each year with numbers expected to rise due to climate change and other factors. Preventing and mitigating these hazards is a critical endeavor. 

THINKLAB, a team from National Yang Chiao University (NYCU), is working on the development of a model that can provide fast and accurate results for emergency purposes while maintaining simplicity in operation. The proposed hybrid inundation model (HIM) solves the zero-inertia equation through the Cellular Automata approach and works with subgrid-scale interpolation strategies to generate higher-resolution results.

Simulating flood extents using the hybrid inundation model (HIM).
Figure 2. Example of flood extents produced by the HIM.

Developed using Python and NumPy libraries, the HIM model ran without parallel or GPU computations at the onset of the hackathon. During the event, Team THINKLAB used CuPy to parallelize their code to run on GPUs, then focused on applying user-defined CUDA kernels to the parameters. The result was a 672-time speedup, bringing the computation time from 2 weeks to approximately 30 minutes. 

“We learned so many techniques during this event and highly recommend these events to others,” said Obaja Wijaya, team member of THINKLAB. “NVIDIA is the expert in this field and by working with their mentors we have learned how to optimize models/codes using GPU programming.” 

Additional hackathons and boot camps are scheduled throughout 2022. For more information on GPU Hackathons and future events, visit https://www.gpuhackathons.org

Categories
Misc

Is it possible to embed a TensorFlow Lite model into an 8-bit microcontroller?

Hey guys, need your help, please 🙂

submitted by /u/markwatsn
[visit reddit] [comments]

Categories
Misc

How many times does TensorFlow Lite usually compress a model built with TensorFlow?

What do u think about it?

submitted by /u/markwatsn
[visit reddit] [comments]

Categories
Misc

Colab script for object detection with tensorflow and keras – ValueError: Unexpected result of `train_function` (Empty logs)

Hello to everyone,

I am trying to adapt the script from this link keras example to my custom dataset but I run into the following issue:

‘ValueError: Unexpected result of train_function
(Empty logs). Please use Model.compile(…, run_eagerly=True)
, or tf.config.run_functions_eagerly(True)
for more information of where went wrong, or file a issue/bug to tf.keras
.

My dataset is (I flattened it in order to surpass error for converting dict to tensorflow)

<TensorSliceDataset element_spec={'image/filename': TensorSpec(shape=(), dtype=tf.string, name=None), 'image/id': TensorSpec(shape=(), dtype=tf.int32, name=None), 'is_crowd': TensorSpec(shape=(), dtype=tf.bool, name=None), 'area': TensorSpec(shape=(), dtype=tf.float32, name=None), 'bbox': TensorSpec(shape=(1, 4), dtype=tf.float32, name=None), 'id': TensorSpec(shape=(), dtype=tf.int32, name=None), 'image': TensorSpec(shape=(480, 640, 3), dtype=tf.float32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int32, name=None)}> 

while the example dataset is

<PrefetchDataset element_spec={'image': TensorSpec(shape=(None, None, 3), dtype=tf.uint8, name=None), 'image/filename': TensorSpec(shape=(), dtype=tf.string, name=None), 'image/id': TensorSpec(shape=(), dtype=tf.int64, name=None), 'objects': {'area': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'bbox': TensorSpec(shape=(None, 4), dtype=tf.float32, name=None), 'id': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'is_crowd': TensorSpec(shape=(None,), dtype=tf.bool, name=None), 'label': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}}> 

My script is publicly available here. If anyone can help with what I am doing wrong (i.e. input images, tensors, model building), I would be so grateful!!

submitted by /u/agristats
[visit reddit] [comments]

Categories
Misc

Any advice on how to deploy a deep-learning model on mobile devices?

We currently have an app built on Xamarin and C#. My aim is to provide an analytics platform (which I’ve built in TF), however what would be the best way to deploy it? I’ve done some readings of the docs, but I’d love to hear your guys experience / thoughts?

submitted by /u/PrijNaidu
[visit reddit] [comments]

Categories
Misc

Can you add YOLO to the top of a pretrained model?

I have a InceptionResNetV2 model that is trained for identification of insects. I was wondering if I could change the base identification part of YOLO to use my model? My understanding is that YOLO trains identification based on Darknet,VGG, or other small networks and then moves to a partitioning method for the object detection so based on my limited knowledge I’m guessing it should theoretically be possible to replace these small base models but I am not sure if it is this simple or if my neural network architecture could work. I couldn’t find much information about this online.

submitted by /u/188_888
[visit reddit] [comments]

Categories
Misc

Machine Learning in Scraping with Rails

Machine Learning in Scraping with Rails submitted by /u/Kagermanov
[visit reddit] [comments]
Categories
Offsites

4D-Net: Learning Multi-Modal Alignment for 3D and Image Inputs in Time

While not immediately obvious, all of us experience the world in four dimensions (4D). For example, when walking or driving down the street we observe a stream of visual inputs, snapshots of the 3D world, which, when taken together in time, creates a 4D visual input. Today’s autonomous vehicles and robots are able to capture much of this information through various onboard sensing mechanisms, such as LiDAR and cameras.

LiDAR is a ubiquitous sensor that uses light pulses to reliably measure the 3D coordinates of objects in a scene, however, it is also sparse and has a limited range — the farther one is from a sensor, the fewer points will be returned. This means that far-away objects might only get a handful of points, or none at all, and might not be seen by LiDAR alone. At the same time, images from the onboard camera, which is a dense input, are incredibly useful for semantic understanding, such as detecting and segmenting objects. With high resolution, cameras can be very effective at detecting objects far away, but are less accurate in measuring the distance.

Autonomous vehicles collect data from both LiDAR and onboard camera sensors. Each sensor measurement is recorded at regular time intervals, providing an accurate representation of the 4D world. However, very few research algorithms use both of these in combination, especially when taken “in time”, i.e., as a temporally ordered sequence of data, mostly due to two major challenges. When using both sensing modalities simultaneously, 1) it is difficult to maintain computational efficiency, and 2) pairing the information from one sensor to another adds further complexity since there is not always a direct correspondence between LiDAR points and onboard camera RGB image inputs.

In “4D-Net for Learned Multi-Modal Alignment”, published at ICCV 2021, we present a neural network that can process 4D data, which we call 4D-Net. This is the first attempt to effectively combine both types of sensors, 3D LiDAR point clouds and onboard camera RGB images, when both are in time. We also introduce a dynamic connection learning method, which incorporates 4D information from a scene by performing connection learning across both feature representations. Finally, we demonstrate that 4D-Net is better able to use motion cues and dense image information to detect distant objects while maintaining computational efficiency.

4D-Net
In our scenario, we use 4D inputs (3D point clouds and onboard camera image data in time) to solve a very popular visual understanding task, the 3D box detection of objects. We study the question of how one can combine the two sensing modalities, which come from different domains and have features that do not necessarily match — i.e., sparse LiDAR inputs span the 3D space and dense camera images only produce 2D projections of a scene. The exact correspondence between their respective features is unknown, so we seek to learn the connections between these two sensor inputs and their feature representations. We consider neural network representations where each of the feature layers can be combined with other potential layers from other sensor inputs, as shown below.

4D-Net effectively combines 3D LiDAR point clouds in time with RGB images, also streamed in time as video, learning the connections between different sensors and their feature representations.

Dynamic Connection Learning Across Sensing Modalities
We use a light-weight neural architecture search to learn the connections between both types of sensor inputs and their feature representations, to obtain the most accurate 3D box detection. In the autonomous driving domain it is especially important to reliably detect objects at highly variable distances, with modern LiDAR sensors reaching several hundreds of meters in range. This implies that more distant objects will appear smaller in the images and the most valuable features for detecting them will be in earlier layers of the network, which better capture fine-scale features, as opposed to close-by objects represented by later layers. Based on this observation, we modify the connections to be dynamic and select among features from all layers using self-attention mechanisms. We apply a learnable linear layer, which is able to apply attention-weighting to all other layer weights and learn the best combination for the task at hand.

Connection learning approach schematic, where connections between features from the 3D point cloud inputs are combined with the features from the RGB camera video inputs. Each connection learns the weighting for the corresponding inputs.

Results
We evaluate our results against state-of-the-art approaches on the Waymo Open Dataset benchmark, for which previous models have only leveraged 3D point clouds in time or a combination of a single point cloud and camera image data. 4D-Net uses both sensor inputs efficiently, processing 32 point clouds in time and 16 RGB frames within 164 milliseconds, and performs well compared to other methods. In comparison, the next best approach is less efficient and accurate because its neural net computation takes 300 milliseconds, and uses fewer sensor inputs than 4D-Net.

Results on a 3D scene. Top: 3D boxes, corresponding to detected vehicles, are shown in different colors; dotted line boxes are for objects that were missed. Bottom: The boxes are shown in the corresponding camera images for visualization purposes.

Detecting Far-Away Objects
Another benefit of 4D-Net is that it takes advantage of both the high resolution provided by RGB, which can accurately detect objects on the image plane, and the accurate depth that the point cloud data provides. As a result, objects at a greater distance that were previously missed by point cloud-only approaches can be detected by a 4D-Net. This is due to the fusion of camera data, which is able to detect distant objects, and efficiently propagate this information to the 3D part of the network to produce accurate detections.

Is Data in Time Valuable?
To understand the value of the 4D-Net, we perform a series of ablation studies. We find that substantial improvements in detection accuracy are obtained if at least one of the sensor inputs is streamed in time. Considering both sensor inputs in time provides the largest improvements in performance.

4D-Net performance for 3D object detection measured in average precision (AP) when using point clouds (PC), Point Clouds in Time (PC + T), RGB image inputs (RGB) and RGB images in Time (RGB + T). Combining both sensor inputs in time is best (rightmost columns in blue) compared to the left-most columns (green) which use a PC without RGB inputs. All joint methods use our 4D-Net multi-modal learning.

Multi-stream 4D-Net
Since the 4D-Net dynamic connection learning mechanism is general, we are not limited to only combining a point cloud stream with an RGB video stream. In fact, we find that it is very cost-effective to provide a large resolution single-image stream, and a low-resolution video stream in conjunction with 3D point cloud stream inputs. Below, we demonstrate examples of a four-stream architecture, which performs better than the two-stream one with point clouds in time and images in time.

Dynamic connection learning selects specific feature inputs to connect together. With multiple input streams, 4D-Net has to learn connections between multiple target feature representations, which is straightforward as the algorithm does not change and simply selects specific features from the union of inputs. This is an incredibly light-weight process that uses a differentiable architecture search, which can discover new wiring within the model architecture itself and thus effectively find new 4D-Net models.

Example multi-stream 4D-Net which consists of a stream of 3D point clouds in time (PC+T), and multiple image streams: a high-resolution single image stream, a medium-resolution single image stream and a video stream (of even lower resolution) images.

Summary
While deep learning has made tremendous advances in real-life applications, the research community is just beginning to explore learning from multiple sensing modalities. We present 4D-Net which learns how to combine 3D point clouds in time and RGB camera images in time, for the popular application of 3D object detection in autonomous driving. We demonstrate that 4D-Net is an effective approach for detecting objects, especially at distant ranges. We hope this work will provide researchers with a valuable resource for future 4D data research.

Acknowledgements
This work is done by AJ Piergiovanni, Vincent Casser, Michael Ryoo and Anelia Angelova. We thank our collaborators, Vincent Vanhoucke, Dragomir Anguelov and our colleagues at Waymo and Robotics at Google for their support and discussions. We also thank Tom Small for the graphics animation.