Categories
Misc

Speeding up Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX

MatX is an experimental library that allows you to write high-performance GPU code in C++, with high-level syntax and a common data type across all functions.

Rob Smallshire once said, “You can write faster code in C++, but write code faster in Python.” Since its release more than a decade ago, CUDA has given C and C++ programmers the ability to maximize the performance of their code on NVIDIA GPUs.

More recently, libraries such as CuPy and PyTorch allowed developers of interpreted languages to leverage the speed of the optimized CUDA libraries from other languages. These interpreted languages have many excellent properties, including easy-to-read syntax, automatic memory management, and common types across all functions.

However, sometimes having these features means paying a performance penalty due to memory management and other factors outside your control. The decrease in performance is often worth it to save development time. Still, it may ultimately require rewriting portions of the application later when performance becomes an issue.

What if you could still achieve the maximum performance using C++ while still reaping all the benefits from the interpreted languages?

MatX overview

MatX is an experimental, GPU-accelerated, numerical computing C++ library aimed at bridging the gap between users wanting the highest performance possible, with the same easy syntax and types across all CUDA libraries. Using the C++17 support added in CUDA 11.0, MatX allows you to write the same natural algebraic expressions that you would in a high-level language like Python without the performance penalty that may come from it.

Tensor types

MatX includes interfaces to many of the popular math libraries, such as cuBLAS, CUTLASS, cuFFT, and CUB, but uses a common data type (tensor_t) across all these libraries. This greatly simplifies the API to these libraries by deducing information that it knows about the tensor type and calling the correct APIs based on that.

The following code examples show an FFT-based resampler.

Python

N = min(ns, ns_resamp)
nyq = N // 2 + 1

# Create an empty vector
sv = np.empty(ns)

# Real to complex FFT
svc = np.fft.rfft(sv)

# Slice
sv = svc[0:nyq]

# Complex to real IFFT
rsv = np.fft.irfft(sv, ns_resamp)

MatX

uint32_t N = std::min(ns, ns_resamp);  
uint32_t nyq = N / 2 + 1;

auto sv  = make_tensor({ns});  
auto svc = make_tensor({ns / 2 + 1});  
auto rv  = make_tensor({ns_resamp});

// Real to complex FFT
fft(svc, sv, stream);

// Slice the vector
auto sv = svc.Slice({0}, {nyq});

// Complex to real IFFT
ifft(rsv, sv, stream);

While the code length and readability are similar, the MatX version on an A100 runs about 2100x faster than the NumPy version running on CPU. The MatX version also has many hidden benefits over directly using the CUDA libraries, such as type checking, input and output size checking, and slicing a tensor without pointer manipulation.

The tensor types are not limited to FFTs, though, and the same variables can be used inside of other libraries and expressions. For example, if you wanted to perform a GEMM using CUTLASS on the resampler output, you could write the following:

matmul(resampOut, resampView, B, stream);

In this code, resampOut and B are appropriately sized tensors for the GEMM operation. As in the FFT sample preceding, types, sizes, batches, and strides are all inferred by the tensor metadata. Using a strongly typed C++ API also means that many runtime and compile-time errors can be caught without additional debugging.

In addition to supporting the optimized CUDA libraries as backends, these same tensor types can be used in algebraic expressions to perform element-wise operations:

(C = A * B + (D / 5.0) + cos(E)).run(stream);

Lazy evaluation

MatX uses lazy evaluation to create a GPU kernel at compile time representing the expression in parentheses. Only when the run function is called on the expression does the operation execute on the GPU. Over 40 different types of operators are supported and can be mixed and matched across different size and type tensors with compatible parameters. If you look at the earlier expression written as a CUDA kernel, it would look something like this:

__global__ void Expression( float *C, 
                            const float *A,
                            const float *B,
                            const float *D,
                            const float *E,
                            int length)
{
    for (int idx = blockIdx.x * blockDim.x + threadIdx.x; 
         idx 



While the earlier code is not complicated, it’s hiding several problems:

  • The data types are hard-coded to floats. To change to another type, you must edit the kernel signature. Astute readers would say to use templates and let the compiler deduce types for you. While that may work for some types, it won’t work for all types you may want to use. For example, cosf is not defined for half precision types, so you must use compile-time conditionals to handle different types.
  • Any small change to the function signature needs a completely different function. For example, what if you wanted to add a tensor F in some cases, but still retain this original signature? That would be two functions to maintain that are nearly identical.
  • While a grid-stride loop is good practice and is used to handle different sizes of blocks and grids, you must still have code ensuring that during kernel launch there are enough threads to keep the GPU busy.
  • All inputs are assumed to be 1D vectors; higher dimensions could break with non-unity strides.

There are numerous other deficiencies not listed, including the inability to broadcast different-sized tensors, no checking on sizes, requiring contiguous memory layouts, and more.

Obviously, this code only works under specific conditions, while the MatX version solves all these issues and more while typically maintaining the same performance as writing the kernel directly.

Additional MatX features

Other key features of MatX include the following:

  • Creating zero-copy tensor views by slicing, cloning, and permuting existing tensors.
  • Supporting arbitrary-dimension tensors.
  • Generators for generating data on-the-fly without storing in memory. Common examples would be to create a linearly spaced vector, hamming window, or a diagonal matrix.
  • Supports almost every type used in CUDA, including half precision (both FP16 and BF16) and complex numbers (both full and half precision).
  • Linear solver functions through cuSolver, sorting and scanning using CUB, random number generation using cuRAND, reductions, and more

Summary

MatX is open-sourced under the BSDv3 license. For more information, see the following resources:

Categories
Misc

From Earth Sciences to Factory Production: GPU Hackathon Optimizes Modeling Results

Group image of participants of the digital TWCC GPU HackathonThe recent Taiwan Computing Cloud GPU Hackathon helped 12 teams advance their HPC and AI projects, using innovative technologies to address pressing global challenges.Group image of participants of the digital TWCC GPU Hackathon

While the world is continuously changing, one constant is the ongoing drive of developers to tackle challenges using innovative technologies. The recent Taiwan Computing Cloud (TWCC) GPU Hackathon exemplified such a drive, serving as a catalyst for developers and engineers to advance their HPC and AI projects using GPUs. 

A collaboration between the National Center for High-Performance Computing, Taiwan Web Service Corporation, NVIDIA, and OpenACC, 12 teams and 15 NVIDIA mentors, used approaches to accelerate projects ranging from an AI-driven manufacturing scheduling model to a rapid flood prediction model. 

Tapping AI to optimize production efficiency 

One of the key areas of smart manufacturing is optimizing and automating production line processes. Team AI Scheduler, with members from the Computational Intelligence Technology Center (CITC) of Industrial Technology Research Center (ITRI), came to the hackathon to work on their manufacturing scheduling model using machine learning.  

Traditional scheduling models mostly employ heuristic rules, which can respond to dynamic events instantly. However, their short-term approach does not often lead to the optimal solution and proves inflexible when dealing with changing variables, which limits their ongoing viability. 

The team’s approach uses a Monte Carlo Tree Search (MCTS) method, combining the classic tree search implementations alongside machine learning principles of reinforcement learning. This method addresses existing heuristic limitations for improved efficiency of the overall scheduling model for improved efficiency. 

Working with their mentor, Team AI Scheduler learned to use NVIDIA Nsight Systems to identify bottlenecks and use GPUs to parallelize their code. At the conclusion of the event, the team was able to accelerate the simulation step of their MCTS algorithm. This reduced the scheduling time from 6 hours to 30 minutes and achieved a speedup of 11.3x in overall scheduling efficiency.  

“Having proved the feasibility of using GPUs to accelerate our model at this hackathon, the next step is to adopt it into our commercial models for industry use,” said Dr. Tsan-Cheng Su and Hao-Che Huang of CITC, ITRI. 

Using GPUs to see the big picture in Earth sciences 

Located between the Eurasian and the Philippine Sea Plate, Taiwan is one of the most tectonically active places in the world, and an important base for global seismological research. Geological research and the time scale of tectonic activity is often measured in units of thousands–or tens of thousands–of years. This requires the use of massive amounts of data and adequate compute power to analyze efficiently. 

Hackathon Team IES-Geodynamics, led by Dr. Tan, is pictured.
Figure 1. Led by Dr. Tan (center), Team IES-Geodynamics pictured. 

The IES-Geodynamics team, led by Dr. Tan from the Institute of Earth Research, Academia Sinica, came to the GPU Hackathon to accelerate their numerical geodynamical model. Named DynEarthSol, it simulates mantle convection, subduction, mountain building, and tectonics. Previously, the team handled large volumes of data by reducing the number of calculations and steps by chunking data into pieces and restricting the computing processes to fit the limited computing power of the CPU. This made it very difficult to see the full picture of the research. 

Over the course of the hackathon, the team used a new data input method that leveraged the GPU to calculate the data and multiple steps. Using OpenACC, Team IES-Geodynamics was able to port 80% of their model to GPUs and achieved a 13.6X speedup. 

“This is my second time attending a GPU Hackathon and I will definitely attend the next one,” said Professor Eh Tan, Research Fellow from IES, Academia Sinica. “We have learned the appropriate way to adopt GPUs and the user-friendly profiling tool gives us a great idea for how to optimize our model.” 

The team will continue to work towards porting the remaining 20% of their model. They look forward to running more high-resolution models using GPUs to gain a deeper understanding of formation activities in Taiwan. 

Rapid flood assessment for emergency planning and response 

Flooding is among the most devastating natural disasters. Causing massive casualties and economic losses, floods affect an average of 21 million people worldwide each year with numbers expected to rise due to climate change and other factors. Preventing and mitigating these hazards is a critical endeavor. 

THINKLAB, a team from National Yang Chiao University (NYCU), is working on the development of a model that can provide fast and accurate results for emergency purposes while maintaining simplicity in operation. The proposed hybrid inundation model (HIM) solves the zero-inertia equation through the Cellular Automata approach and works with subgrid-scale interpolation strategies to generate higher-resolution results.

Simulating flood extents using the hybrid inundation model (HIM).
Figure 2. Example of flood extents produced by the HIM.

Developed using Python and NumPy libraries, the HIM model ran without parallel or GPU computations at the onset of the hackathon. During the event, Team THINKLAB used CuPy to parallelize their code to run on GPUs, then focused on applying user-defined CUDA kernels to the parameters. The result was a 672-time speedup, bringing the computation time from 2 weeks to approximately 30 minutes. 

“We learned so many techniques during this event and highly recommend these events to others,” said Obaja Wijaya, team member of THINKLAB. “NVIDIA is the expert in this field and by working with their mentors we have learned how to optimize models/codes using GPU programming.” 

Additional hackathons and boot camps are scheduled throughout 2022. For more information on GPU Hackathons and future events, visit https://www.gpuhackathons.org

Categories
Misc

Is it possible to embed a TensorFlow Lite model into an 8-bit microcontroller?

Hey guys, need your help, please 🙂

submitted by /u/markwatsn
[visit reddit] [comments]

Categories
Misc

How many times does TensorFlow Lite usually compress a model built with TensorFlow?

What do u think about it?

submitted by /u/markwatsn
[visit reddit] [comments]

Categories
Misc

Colab script for object detection with tensorflow and keras – ValueError: Unexpected result of `train_function` (Empty logs)

Hello to everyone,

I am trying to adapt the script from this link keras example to my custom dataset but I run into the following issue:

‘ValueError: Unexpected result of train_function
(Empty logs). Please use Model.compile(…, run_eagerly=True)
, or tf.config.run_functions_eagerly(True)
for more information of where went wrong, or file a issue/bug to tf.keras
.

My dataset is (I flattened it in order to surpass error for converting dict to tensorflow)

<TensorSliceDataset element_spec={'image/filename': TensorSpec(shape=(), dtype=tf.string, name=None), 'image/id': TensorSpec(shape=(), dtype=tf.int32, name=None), 'is_crowd': TensorSpec(shape=(), dtype=tf.bool, name=None), 'area': TensorSpec(shape=(), dtype=tf.float32, name=None), 'bbox': TensorSpec(shape=(1, 4), dtype=tf.float32, name=None), 'id': TensorSpec(shape=(), dtype=tf.int32, name=None), 'image': TensorSpec(shape=(480, 640, 3), dtype=tf.float32, name=None), 'label': TensorSpec(shape=(), dtype=tf.int32, name=None)}> 

while the example dataset is

<PrefetchDataset element_spec={'image': TensorSpec(shape=(None, None, 3), dtype=tf.uint8, name=None), 'image/filename': TensorSpec(shape=(), dtype=tf.string, name=None), 'image/id': TensorSpec(shape=(), dtype=tf.int64, name=None), 'objects': {'area': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'bbox': TensorSpec(shape=(None, 4), dtype=tf.float32, name=None), 'id': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'is_crowd': TensorSpec(shape=(None,), dtype=tf.bool, name=None), 'label': TensorSpec(shape=(None,), dtype=tf.int64, name=None)}}> 

My script is publicly available here. If anyone can help with what I am doing wrong (i.e. input images, tensors, model building), I would be so grateful!!

submitted by /u/agristats
[visit reddit] [comments]

Categories
Misc

Any advice on how to deploy a deep-learning model on mobile devices?

We currently have an app built on Xamarin and C#. My aim is to provide an analytics platform (which I’ve built in TF), however what would be the best way to deploy it? I’ve done some readings of the docs, but I’d love to hear your guys experience / thoughts?

submitted by /u/PrijNaidu
[visit reddit] [comments]

Categories
Misc

Can you add YOLO to the top of a pretrained model?

I have a InceptionResNetV2 model that is trained for identification of insects. I was wondering if I could change the base identification part of YOLO to use my model? My understanding is that YOLO trains identification based on Darknet,VGG, or other small networks and then moves to a partitioning method for the object detection so based on my limited knowledge I’m guessing it should theoretically be possible to replace these small base models but I am not sure if it is this simple or if my neural network architecture could work. I couldn’t find much information about this online.

submitted by /u/188_888
[visit reddit] [comments]

Categories
Misc

Machine Learning in Scraping with Rails

Machine Learning in Scraping with Rails submitted by /u/Kagermanov
[visit reddit] [comments]
Categories
Offsites

4D-Net: Learning Multi-Modal Alignment for 3D and Image Inputs in Time

While not immediately obvious, all of us experience the world in four dimensions (4D). For example, when walking or driving down the street we observe a stream of visual inputs, snapshots of the 3D world, which, when taken together in time, creates a 4D visual input. Today’s autonomous vehicles and robots are able to capture much of this information through various onboard sensing mechanisms, such as LiDAR and cameras.

LiDAR is a ubiquitous sensor that uses light pulses to reliably measure the 3D coordinates of objects in a scene, however, it is also sparse and has a limited range — the farther one is from a sensor, the fewer points will be returned. This means that far-away objects might only get a handful of points, or none at all, and might not be seen by LiDAR alone. At the same time, images from the onboard camera, which is a dense input, are incredibly useful for semantic understanding, such as detecting and segmenting objects. With high resolution, cameras can be very effective at detecting objects far away, but are less accurate in measuring the distance.

Autonomous vehicles collect data from both LiDAR and onboard camera sensors. Each sensor measurement is recorded at regular time intervals, providing an accurate representation of the 4D world. However, very few research algorithms use both of these in combination, especially when taken “in time”, i.e., as a temporally ordered sequence of data, mostly due to two major challenges. When using both sensing modalities simultaneously, 1) it is difficult to maintain computational efficiency, and 2) pairing the information from one sensor to another adds further complexity since there is not always a direct correspondence between LiDAR points and onboard camera RGB image inputs.

In “4D-Net for Learned Multi-Modal Alignment”, published at ICCV 2021, we present a neural network that can process 4D data, which we call 4D-Net. This is the first attempt to effectively combine both types of sensors, 3D LiDAR point clouds and onboard camera RGB images, when both are in time. We also introduce a dynamic connection learning method, which incorporates 4D information from a scene by performing connection learning across both feature representations. Finally, we demonstrate that 4D-Net is better able to use motion cues and dense image information to detect distant objects while maintaining computational efficiency.

4D-Net
In our scenario, we use 4D inputs (3D point clouds and onboard camera image data in time) to solve a very popular visual understanding task, the 3D box detection of objects. We study the question of how one can combine the two sensing modalities, which come from different domains and have features that do not necessarily match — i.e., sparse LiDAR inputs span the 3D space and dense camera images only produce 2D projections of a scene. The exact correspondence between their respective features is unknown, so we seek to learn the connections between these two sensor inputs and their feature representations. We consider neural network representations where each of the feature layers can be combined with other potential layers from other sensor inputs, as shown below.

4D-Net effectively combines 3D LiDAR point clouds in time with RGB images, also streamed in time as video, learning the connections between different sensors and their feature representations.

Dynamic Connection Learning Across Sensing Modalities
We use a light-weight neural architecture search to learn the connections between both types of sensor inputs and their feature representations, to obtain the most accurate 3D box detection. In the autonomous driving domain it is especially important to reliably detect objects at highly variable distances, with modern LiDAR sensors reaching several hundreds of meters in range. This implies that more distant objects will appear smaller in the images and the most valuable features for detecting them will be in earlier layers of the network, which better capture fine-scale features, as opposed to close-by objects represented by later layers. Based on this observation, we modify the connections to be dynamic and select among features from all layers using self-attention mechanisms. We apply a learnable linear layer, which is able to apply attention-weighting to all other layer weights and learn the best combination for the task at hand.

Connection learning approach schematic, where connections between features from the 3D point cloud inputs are combined with the features from the RGB camera video inputs. Each connection learns the weighting for the corresponding inputs.

Results
We evaluate our results against state-of-the-art approaches on the Waymo Open Dataset benchmark, for which previous models have only leveraged 3D point clouds in time or a combination of a single point cloud and camera image data. 4D-Net uses both sensor inputs efficiently, processing 32 point clouds in time and 16 RGB frames within 164 milliseconds, and performs well compared to other methods. In comparison, the next best approach is less efficient and accurate because its neural net computation takes 300 milliseconds, and uses fewer sensor inputs than 4D-Net.

Results on a 3D scene. Top: 3D boxes, corresponding to detected vehicles, are shown in different colors; dotted line boxes are for objects that were missed. Bottom: The boxes are shown in the corresponding camera images for visualization purposes.

Detecting Far-Away Objects
Another benefit of 4D-Net is that it takes advantage of both the high resolution provided by RGB, which can accurately detect objects on the image plane, and the accurate depth that the point cloud data provides. As a result, objects at a greater distance that were previously missed by point cloud-only approaches can be detected by a 4D-Net. This is due to the fusion of camera data, which is able to detect distant objects, and efficiently propagate this information to the 3D part of the network to produce accurate detections.

Is Data in Time Valuable?
To understand the value of the 4D-Net, we perform a series of ablation studies. We find that substantial improvements in detection accuracy are obtained if at least one of the sensor inputs is streamed in time. Considering both sensor inputs in time provides the largest improvements in performance.

4D-Net performance for 3D object detection measured in average precision (AP) when using point clouds (PC), Point Clouds in Time (PC + T), RGB image inputs (RGB) and RGB images in Time (RGB + T). Combining both sensor inputs in time is best (rightmost columns in blue) compared to the left-most columns (green) which use a PC without RGB inputs. All joint methods use our 4D-Net multi-modal learning.

Multi-stream 4D-Net
Since the 4D-Net dynamic connection learning mechanism is general, we are not limited to only combining a point cloud stream with an RGB video stream. In fact, we find that it is very cost-effective to provide a large resolution single-image stream, and a low-resolution video stream in conjunction with 3D point cloud stream inputs. Below, we demonstrate examples of a four-stream architecture, which performs better than the two-stream one with point clouds in time and images in time.

Dynamic connection learning selects specific feature inputs to connect together. With multiple input streams, 4D-Net has to learn connections between multiple target feature representations, which is straightforward as the algorithm does not change and simply selects specific features from the union of inputs. This is an incredibly light-weight process that uses a differentiable architecture search, which can discover new wiring within the model architecture itself and thus effectively find new 4D-Net models.

Example multi-stream 4D-Net which consists of a stream of 3D point clouds in time (PC+T), and multiple image streams: a high-resolution single image stream, a medium-resolution single image stream and a video stream (of even lower resolution) images.

Summary
While deep learning has made tremendous advances in real-life applications, the research community is just beginning to explore learning from multiple sensing modalities. We present 4D-Net which learns how to combine 3D point clouds in time and RGB camera images in time, for the popular application of 3D object detection in autonomous driving. We demonstrate that 4D-Net is an effective approach for detecting objects, especially at distant ranges. We hope this work will provide researchers with a valuable resource for future 4D data research.

Acknowledgements
This work is done by AJ Piergiovanni, Vincent Casser, Michael Ryoo and Anelia Angelova. We thank our collaborators, Vincent Vanhoucke, Dragomir Anguelov and our colleagues at Waymo and Robotics at Google for their support and discussions. We also thank Tom Small for the graphics animation.

Categories
Misc

Painting classification ESP32

Hi there! Hope you’re doing good.

First of all, I’m an Embedded Systems engineer trying to use TensorFlow Lite Micro to classify paintings, or pictures in general. I started out with Edge Impulse because, well, it’s an easy way to get started. I managed to get the public car detection project to work on an ESP32-cam in not too much time.

But then; classifying paintings… I don’t know why, but it appears there is one dominant class which gets classified more easily than others. That’s the case with three portraits and one unknown class. The confusion matrix, test dataset and test with my mobile phone (with which I’ve made the dataset) all look and work good. When I quantize the model and deploy it to an ESP32, everything appears to work, but one class specifically overshadows another one.

In short; training a MobileNetV1 96×96 rgb network results in a great confusion matrix identifying three portraits. When deployed to an ESP32, one specific class appears to not work. The other three (unknown and two portraits) seem to work correct.

What could be wrong here? Oh and btw, if anyone knows good resources for an embedded systems engineer to get to know ML better, that’s more than welcome.

submitted by /u/JVKran
[visit reddit] [comments]