Categories
Misc

Aligning Time Series at the Speed of Light

To say it with the words of Eamonn Keogh: “Time series is a ubiquitous and increasingly prevalent type of data […]”. Virtually any incrementally measured signal, be it along a time axis or a linearly ordered set, can be treated as time series. Examples include electrocardiograms, temperature or voltage measurements, audio, server logs, but also … Continued

To say it with the words of Eamonn Keogh: “Time series is a ubiquitous and increasingly prevalent type of data […]”. Virtually any incrementally measured signal, be it along a time axis or a linearly ordered set, can be treated as time series. Examples include electrocardiograms, temperature or voltage measurements, audio, server logs, but also heavy-weight data such as video and time-resolved MRI volumes. Hence, the efficient yet exact processing of the ever-increasing amount of time series data is crucial for every data scientist.

In this blog, we introduce rapidAligner – a CUDA-accelerated library to align a short time series snippet (query) in an exceedingly long stream of time series (subject) using the following three popular lock-step measures for the local alignment of uniformly sampled time series:

  • Rolling Euclidean distance (sdist)
  • Rolling mean-adjusted Euclidean distance (mdist)
  • Rolling mean and amplitude-adjusted Euclidean distance (zdist)

The rapidAligner library is free software that can be integrated with a broad variety of popular data science and machine learning frameworks such as NumPy, CuPy, RAPIDS, Numba, and Pytorch. The source code is publicly available under NVIDIA rapidAligner.

The rest of the article is structured as follows: Section one provides a brief introduction on popular lock-step measures and (local) normalization techniques. Section two demonstrates the usage of the rapidAligner library. Section 3 concludes this blog post.

A brief introduction to time series data mining

Time series are sequences of pairs (t[i], x[i]) where the real-valued time stamps t[i] are linearly ordered and their corresponding values x[i] are quantities measured at time t[i]. If all timestamps are equally spaced, i.e., t[i+1]-t[i] = const for all i, then you can neglect time and call the sequence of measurements x[i] a uniformly sampled time series. In the following, we will simply refer to uniformly sampled time series with real-valued scalars x[i] as time series without fancy attributes.

Assume you want to compare two time series Q=(q[0], q[1], …, q[m-1]) and  S=(s[0], s[1], …, s[m-1]) of same length |Q|=|S|=m. An obvious way would be to interpret Q and S as m-dimensional vectors and compute the Lp norm of their difference.

Popular choices for the parameter p are p=2 for so-called Euclidean distance and p=1 for so-called Manhattan or taxicab distance (see Figure 1). In this blog post, we address similarity measures that compare residues q[i]-s[i] using a one-to-one assignment i->i of indices – so-called lock-step measures. In a future post, we will discuss CUDA-accelerated measures using dynamic assignments of indices such as q[i]-s[j], also known as the class of elastic measures.

: A plot consisting of two graphs depicting two similar and approximately aligned heart beats from an electrocardiogram (ECG) measurement with straight lines indicating the one-to-one correspondence between indices.
Figure 1: Two electrocardiogram (ECG) measurements Q (blue signal) and S (orange signal) both of length |Q|=|S|=421 and their index-wise residues (grey vertical bars) down-sampled by a factor of 4.

However, when aligning a short query Q of length |Q|=m in a long stream S of length |S|=n, i.e., 0

For each alignment position j you have to sum over m contributions. As a result, the asymptotic worst-case complexity to compute all lock-step alignments is proportional to the product of the time series lengths m and n — O((n-m+1) * m) to be precise. This number can be huge even for moderately sized queries and streams which may render large scale time series alignment computationally intractable when performed in a naïve way. In Section 3 we will discuss for the special case p=2 how to implement a CUDA-accelerated scheme which runs in blazingly fast log-linear time.

When looking at larger portion of an ECG stream (see Figure 2) you may observe a temporal drift in the average signal value, also known as baseline wandering. This artifact often occurs in continuously measured time series and may be caused by a broad variety of external factors such as change of skin conductivity due to sweat in ECGs, body movement affecting the electrodes in ECGs, drift of electric resistance and thus voltage due to temperature variation in power supplies, temperature drift when recording environmental quantities, seasonal effects such as Christmas, or the temporal drift of stock prices amidst a global pandemic.

A plot consisting of two graphs depicting a short heartbeat sequence aligned in a much longer stream of continuously measured heartbeats. The locally averaged stream values are drifting over time.
Figure 2: A short ECG query Q (blue signal) aligned in a longer stream S of heartbeats (orange signal) using Euclidean distance as rolling similarity measure. Note the temporal drift in the values of S.

Baseline wandering is problematic when mining a stream for similar shapes – two similar shapes with different offsets on the measurement axis may have a larger distance than two dissimilar ones with similar offsets. A surprisingly simple and effective countermeasure is to introduce a normalization procedure for the query and candidate sequences. As an example, you could compute the mean value of the query muQ and for each of the n-m+1 candidate sequences muS[j] to remove the offset in the corresponding window (see Figure 3). In the following, we will call locally mean-adjusted rolling Euclidean distance mdist:

A plot depicting two signals on with non-vanishing mean on the left and the same signal translated to have vanishing mean on the right to visualize mean adjustment.
Figure 3: A heartbeat with a non-vanishing mean (blue signal on the left) and its mean adjusted variant (orange signal on the right).

A closer look at Figure 1 and Figure 2 further reveals a temporal variation in amplitudes. The range of values in the blue query is significantly smaller than the amplitude of the orange candidate sequence in Figure 1. Temporal drift in the scale might lead to meaningless matches when mining shapes.  A straightforward solution is to normalize the scale by dividing the values by the standard deviation of the query sigmaQ and alignment candidates sigmaS[j], respectively. The proposed mean and amplitude adjustment is called z-normalization referring to z-scores of normal random variables with vanishing mean and unit variance (see Figure 4). The corresponding rolling measure shall be called zdist:

The library rapidAligner supports the CUDA-accelerated computation of the three aforementioned rolling measures sdist, mdist, and zdist in a massively parallel fashion. In the next section, you will see its simple usage from within JupyterLab.

rapidAligner in action

Let’s start crunching numbers. In this section, you will align a single heartbeat in a 22 hour ECG stream using the three discussed measures sdist, mdist, and zdist. The data set is part of the experiments listed on the  website of the award-winning UCR-Suite. After cloning the rapidAligner repository you immediately import the rapidAligner library alongside with CuPY, NumPy, and Matplotlib for later validation and visualization.

In the next step, you load the ECG data. The length of the query is rather short with 421 entries, yet the stream exhibits roughly 20 million time ticks. A first inspection of the full query (blue) and the initial 1000 values of the stream (orange) reveals temporal drift in both offsets and amplitudes.

In the following, you align the query in the stream using the sdist measure computing all 20,140,000-421+1 distance scores a subsequent argmin-reduction to determine the best alignment position. For our experiments we choose a single A100 GPU in a DGX A100 server. Note that if you are only interested in the best match one could further accelerate the already rapid computation by employing a lower-bound cascade as demonstrated here. In contrast, rapidAligner’s sdist call returns alignment scores for all positions to allow for later processing such as computing a non-overlapping partition of ranked matches. We further repeat the alignment a few times for robust runtime measurement. Both compute modes “fft” and “naive” are blazingly fast and return indistinguishable results:

  • “naive”: all n-m+1 alignment candidates are (optionally) normalized and compared individually with a minimal memory footprint but O(n*m) asymptotic computational complexity. This mode is still reasonably fast using warp-aggregated statistics and accumulation schemes.
  • “fft”: If m > log_2(n), we can exploit the Convolution Theorem to accelerate the computation significantly resulting in O(n * log n) runtime but a higher memory footprint. This compute mode is fully independent of the query’s length and thus advisable for large input. The higher memory usage is mainly caused by computationally fast but out-of-place primitives such as CUDA-accelerated Fast Fourier Transforms and Prefix Scans.

Both the query and stream (subject) are stored as plain NumPy arrays in double precision on the CPU. As a result, the measured runtimes include costly memory transfers between the CPU and GPU. rapidAligner allows for seamless interoperability with all CUDA array interface compliant frameworks such as PyTorch, CuPy, Numba, RAPIDS, and Jax. Hence, you can further reduce the runtime when caching the data in fast GPU RAM to avoid unnecessary memory movement between CPU and GPU. Now, it becomes even more obvious that the Fourier based mode outperforms the naïve one even on short queries

That corresponds to a whopping 2.5 billion full alignments per second on a single GPU. In the following, we report runtimes exclusively using input data residing on the GPU. The match produced by sdist is already a good one but let us check if mean-adjustment improves the result:

As expected, mean adjustment returns a better match since now candidates are considered with a non-vanishing relative offset along the measurement axis. The execution times remain low with 10 ms for 20 million alignment positions. That corresponds to 2 billion full alignments per second. You could still improve the amplitude mismatch using zdist:

Et voilà, z-normalized rolling Euclidean distance reveals a doppelgänger in the database which is almost indistinguishable from the query. Performance remains high in more than 1.6 billion alignments per second. A stunning property of the Fourier mode is that the runtime is effectively independent of the query length, i.e., the runtime is constant for fixed stream length and varying query size. This becomes handy when aligning extra-ordinarily long queries.

Conclusion

Finding shapes in long streams of time series data is a computationally demanding task frequently occurring as a standalone routine or embedded as a subroutine in high-level algorithms, e.g., for the detection of anomalies. Hence, massively parallel accelerators with their unprecedented memory bandwidth such as the NVIDIA A100 GPU are ideally suited to address this challenge. rapidAligner is lightweight library processing billions of alignments per second while supporting common normalization modes for the candidate sequences. You can further employ highly optimized FFT routines cuFFT and prefix scans CUB from the CUDA-X software stack to provide an alignment mode that is independent from the query length. The source code and notebooks are publicly available under NVIDIA rapidAligner

Happy time series mining!

Categories
Misc

Training MaskRCNN Models on Custom Dataset with Datature

Training MaskRCNN Models on Custom Dataset with Datature submitted by /u/xusty
[visit reddit] [comments]
Categories
Misc

GFN Thursday: Rolling in the Deep (Silver) with Major ‘Metro Exodus’ and ‘Iron Harvest’ Updates

GFN Thursday reaches a fever pitch this week as we take a deeper look at two major updates coming to GeForce NOW from Deep Silver in the weeks ahead. Catching Even More Rays Metro Exodus was one of the first RTX games added to GeForce NOW. It’s still one of the most-played RTX games on Read article >

The post GFN Thursday: Rolling in the Deep (Silver) with Major ‘Metro Exodus’ and ‘Iron Harvest’ Updates appeared first on The Official NVIDIA Blog.

Categories
Misc

Update Complete: GFN Thursday Brings New Features, Games and More

No Thursday is complete without GFN Thursday, our weekly celebration of the news, updates and great games GeForce NOW members can play — all streaming from the cloud across nearly all of your devices. This week’s exciting updates to the GeForce NOW app and experience Include updated features, faster session loading and a bunch of Read article >

The post Update Complete: GFN Thursday Brings New Features, Games and More appeared first on The Official NVIDIA Blog.

Categories
Misc

Weird phenomenon with the Dataset API

I am developing a training pipeline, for which the tf nodes look like as follows:

index = tf.data.Dataset.from_tensor_slices(self.indices) if self.shuffle: index = index.shuffle(buffer_size=len(self.indices) images = index.map(self.make_image) coordinates = index.map(self.get_coordinates) ground_truth = coordinates.map(self.make_ground_truth) images = images.padded_batch(...) ground_truth = ground_truth.batch(...) return tf.data.Dataset.zip((images, ground_truth)) 

If executing the above code with shuffle == False, everything works fine. If shuffle is set to True, it seems the images and ground truths are somehow shuffled differently.

Is this an intended behaviour? How could this be easily solved?

Edit: I am using TensorFlow 2.0, but 2.1 also produces this behaviour

Edit2: Further investigation revealed further weirdness. So it seems, it is not specific to the Dataset.shuffle() method, its root cause is that there is branching in the chain of transformations. Correct me if I’m wrong, but somehow the branching causes the dataset to re-sample the same index as many times as many branches originate from the given node. If there is no shuffling, the re-sampling works as expected, but shuffling causes different indices to be fed to different branches.

I also switched to Dataset.from_generator() and I shuffle the indices in the generator (in NumPy) and it still produces the same bug.

Do you guys think this is a bug in TF? should I file an issue about this?

Or is my approach completely wrong? How could this situation be handled differently?

submitted by /u/CarpenterAcademic
[visit reddit] [comments]

Categories
Misc

Idea about Feature extraction

I am currently working on a system that extracts certain features out of 3D-objects (Voxelgrids to be precise), and i would like to compare those features to automatically made features when it comes to performance (classification) in a tensorflow cNN with some other data, but that is not the point here, just for background. My idea now was, to take a dataset (modelnet10), train a tensorflow cNN to classify them, and then use what it learned there on my dataset – not to classify, but to extract features.

So i want to throw away everything the cnn does,except for what it takes from the objects.

Is there anyway to get these features? and how do i do that? i certainly have no idea.

any ideas would be greatly appreciated.

submitted by /u/schlorkyy
[visit reddit] [comments]

Categories
Offsites

Model-Based RL for Decentralized Multi-agent Navigation

As robots become more ubiquitous in day-to-day life, the complexity of their interactions with each other and with the environment grows. In a controlled environment, such as a lab, multiple robots can coordinate their actions and efforts through a centralized planner that facilitates communication between individual agents. And while much research has been done to address reliable sensor-informed goal navigation, in many real-world applications aligning goals across independent robotic agents must be done without a centralized planner, which poses non-trivial challenges.

An example of such a challenging decentralized task is the rendezvous task, in which multiple agents must agree upon a time and place at which they can meet, without explicitly communicating with one another. This goal alignment task plays an important role in real world multiagent and human-robot settings, e.g., performing object handovers or determining goals on the fly. Solving the decentralized rendezvous task in this situation depends not just on the obstacles in the environment, but also the policies and dynamics of each agent. Addressing potential miscoordination and dealing with noisy sensor data depends on the agents’ ability to model the motions of other agents as well as their own, and to adapt to diverging intentions while using limited information.

An example of two independently controlled robots separated by obstacles that share the objective of meeting each other. How should they move in order to meet? Example trajectories are illustrated in red and blue arrows for each robot. Each robot makes an independent decision of where to go based on their own observations.

In “Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous”, presented at CoRL 2020, we propose an holistic approach to address the challenges of the decentralized rendezvous task, which we call hierarchical predictive planning (HPP). This is a decentralized, model-based reinforcement learning (RL) system that enables agents to align their goals on the fly in the real world. We evaluate HPP in a mixture of real-world and simulated environments and compare it to several learning-based planning and centralized baselines. In those evaluations, we show that HPP is able to more effectively predict and align trajectories, avoid miscoordinations, and directly transfer to the real world without additional fine-tuning.

Putting Together Prediction, Planning and Control
Akin to a standard navigation pipeline, our learning-based system consists of three modules: prediction, planning, and control. Each agent employs the prediction model to learn agent motion and to predict the future positions of itself (the ego-agent) and others based on its own observations (e.g., from LiDAR and team position information) of other agents’ behaviors and navigation patterns. So, each agent learns two prediction models, one for its own motion and one for the other agent. These motion predictors constitute the prediction module, and are used by each agent’s planning module.

The output of the prediction module — the estimate of where each agent, both the ego-agent and the other agents, is most likely to be given the ego-agent’s own sensor observations — is useful information for the planning module, which evaluates different goal locations and maintains a belief distribution over where the team should converge. The belief distribution is periodically updated using evaluations provided by the prediction model. An agent samples from this belief distribution to update the goal to which it should navigate.

The selected goal is passed to the agent’s control module, which is equipped with a pre-trained, imperfect navigation policy that can navigate to a given location in the obstacle-laden environment. The control policy then determines what action the robot should execute.

This process of observing other agents, updating belief distributions and navigating to an updated goal repeats until agents have successfully rendezvoused. While the hierarchical planning and control setup are not unusual, our work closes the loop between the control and planning for decentralized multiagent systems by use of the sensor-informed prediction module.

Training the Prediction Models
HPP trains motion predictors in simulation, assuming that each agent is controlled by a hidden, perhaps suboptimal, control policy capable of avoiding obstacles. The key difficulty lies in training prediction models without access to other agents’ sensor observations and control policies.

The predictors are trained via self-supervision. To collect the training data, we randomly place all the agents and obstacles in an environment, and each agent is given a random goal (unknown to other agents). As the agents move toward their respective goals, each agent records the experience — its sensor observations and the poses of all agents (itself and other agents). Next, from the recorded experience, the agent learns a separate predictor for each agent in the team including itself (target agent). The training dataset consists of ego-agent initial sensor observations, target agent’s pose and goal, labeled with future ego-observations and target agent poses. The goal and labels are inferred from the recorded experience.

As a result, the predictors learn temporal causality of the present and future ego-agent’s observations and target agent’s poses, conditioned on the target agent’s assumed goals — in other words the models predict where each agent will be in the future based on the present. The predictor training is done only with the information available to agents at the runtime, and in environments independent from the deployment environments.

The training environment for the model prediction models. The environment is filled with randomly filled obstacles. All agents (left in blue, upper right in red) are given the same random goal (center in green) and move with their own control modules towards it.

Selecting Goals for Alignment
A model-based RL planner for each agent uses the learned predictors in the deployment environments to guide the agents towards the rendezvous point. The planner takes into account what it believes the other agents would do when also completing the rendezvous task.

HPP illustration. Each robot independently considers several potential rendezvous points, and evaluates each point based how close it believes that the agents can get.

To perform this reasoning, each agent independently samples a series of potential goals and selects the goal that it believes it would be the most likely to succeed. This process effectively simulates a centralized planner for fictitious agents by using the prediction models to predict trajectories of those agents moving to a fixed goal. Conditioned on a proposed goal, the algorithm predicts the poses of the agents in the future, which are generated from sequential roll outs of the prediction models. Each goal is then evaluated by scoring the anticipated system state using the task reward favoring goals that bring agents closer together. We use the cross-entropy method (CEM) to convert these goal evaluations into belief updates over potential rendezvous points. Finally, the agent’s planner selects a goal for itself from this new belief distribution and passes this goal to the agent’s control module.

A simple illustration of the goal evaluation. At the end of a simulated trajectory, the agents (red, left, and blue, right) are either far (top) or close (bottom) to each other. The goal in the bottom image is better than the goal on top because agents end up closer to each other.

Results
We compare HPP against several baselines — MADDPG (learning-based), RRT (planning) with CEM, and centralized baselines that use heuristics for selecting the agent’s rendezvous point — in a mixture of real-world and simulated environments.

Evaluation environments, each of which are independent of the training environment for the agent’s control policy and prediction modules.

There are two main takeaways from our results. One is that HPP enables agents to predict and align trajectories, avoiding miscoordinations. For example:

The second takeaway is that HPP transfers directly into the real world without additional training. For example:

Conclusion
This work presents HPP, a model-based RL approach for decentralized multiagent coordination. Agents first learn to predict where they and their teammates are going to be from their own sensors and decide and navigate to a common goal. Our experiments demonstrate the method generalizes to new environments and handles miscoordination while making no assumptions about the dynamics of other agents. This may be of interest to the larger multiagent research community as a real-world example of a decentralized task using noisy sensors and imperfect controllers, to the motion planning community as an example of a learning-based planning system that closes the loop between the planner and controller, and to the RL community as an example of model-based RL as feedback in a hierarchical, self-supervised prediction setting.

Acknowledgements
This research was done by Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust with special thanks to Michael Everett, Oscar Ramirez and Igor Mordatch for the insightful discussions.

Categories
Misc

Perceiving with Confidence: How AI Improves Radar Perception for Autonomous Vehicles

Autonomous vehicles don’t just need to detect the moving traffic that surrounds them — they must also be able to tell what isn’t in motion.

The post Perceiving with Confidence: How AI Improves Radar Perception for Autonomous Vehicles appeared first on The Official NVIDIA Blog.

Categories
Misc

cuTENSOR v1.3.0 Now Available: Up to 2x Performance

Today, NVIDIA is announcing the availability of cuTENSOR version 1.3.0. This software can be downloaded now free for members of the NVIDIA Developer Program.

Today, NVIDIA is announcing the availability of cuTENSOR version 1.3.0. This software can be downloaded now free for members of the NVIDIA Developer Program.

Download Now

What’s New

  • Support for up to 40-dimensional tensors
  • Support 64-bit strides
  • Support for BFloat16 Element-wise operations
  • Improved performance for direct Tensor Contractions
  • Bug fixes

See the cuTENSOR Release Notes for more information.

About cuTENSOR

cuTENSOR is a high-performance CUDA library for tensor primitives; its key features are:

Learn more:

Recent Developer Blog posts:

Categories
Misc

HPL-AI Now Runs 2x Faster on NVIDIA DGX A100

NVIDIA announced its latest update to the HPL-AI Benchmark version 2.0.0, which will reside in the HPC-Benchmarks container version 21.4.

NVIDIA announced its latest update to the HPL-AI Benchmark version 2.0.0, which will reside in the HPC-Benchmarks container version 21.4. The HPL-AI (High Performance Linpack – Artificial Intelligence) benchmark helps evaluate the convergence of HPC and data-driven AI workloads.

Historically, HPC workloads are benchmarked at double-precision, representing the accuracy requirements in computational astrophysics, computational fluid dynamics, nuclear engineering, and quantum computing. AI workloads on the other hand can deliver acceptable results using much lower precision for training and inference. This distinctive characteristic led to the development of technology such as the Tensor Cores. Tensor Cores provide substantial speedups in mixed-precision workloads by accelerating precisions such as TF32, BF16, FP16, INT8, and INT4.

Many vendors provide specially tuned versions of HPL for their hardware. NVIDIA is releasing the NVIDIA HPC-Benchmarks container on NGC. This container includes three different versions of the HPL benchmark for double precision arithmetic; HPL-AI, applied to mixed precision workloads, and HPCG, which performs a fixed number of multigrid preconditioned conjugate gradient (PCG) iterations at double precision.

With this latest release, the HPL-AI benchmark deliveries double the performance over the initial container released in Fall 2020. This is largely due to major improvements to load balancing at the communication layers. Multi-node communications are usually handled by the CPU, which can lead to performance bottlenecks if GPU become idle. MPI-aware communication between GPUs allows the CPU to be bypassed and keep the GPU busy. When possible data transfers are sent at a lower precision reducing the time GPUs are waiting for data. After minimizing communication, GPUs can process larger datasets to provide maximum compute efficiency. Lastly, the latest version of the NVIDIA Math Libraries are used to deliver optimal performance on an A100.

The plot below shows that with 128 DGX A100s, or 1024 NVIDIA A100 GPUs, the latest release of HPL-AI performs 2.6 times faster.

To checkout these improvements, download the latest NVIDIA HPC-Benchmark container, version 21.4, from NGC. The HPC-Benchmark landing page includes detailed instructions and additional resources. If you have any questions or issues, please send email to HPCBenchmarks@nvidia.com.