At the latest UEFA Champions League Finals, one of the world’s most anticipated annual soccer events, pop stars Marshmello, Khalid and Selena Gomez shared the stage for a dazzling opening ceremony at Portugal’s third-largest football stadium — without ever stepping foot in it. The stunning video performance took place in a digital twin of the Read article >
Learn how the Time Series Prediction Platform provides an end-to-end framework that enables users to train, tune, and deploy time series models.
In this post, we detail the recently released NVIDIA Time Series Prediction Platform (TSPP), a tool designed to compare easily and experiment with arbitrary combinations of forecasting models, time-series datasets, and other configurations. The TSPP also provides functionality to explore the hyperparameter search space, run accelerated model training using distributed training and Automatic Mixed Precision (AMP), and deploy and run inference on accelerated model formats on the NVIDIA Triton Inference Server.
Accurately forecasting future time series values using previous values has proven pivotal in understanding and managing complex systems, including but not limited to power grids, supply chains, and financial markets. In these forecasting applications, single-digit percentage improvements in predictive accuracy can have vast financial, ecological, and social impacts. In addition to needing to be accurate, forecasting models also must be able to function on real-time timescale.
Figure 1: A depiction of the typical sliding-window time series forecasting problem. Each sliding window consists of time-sequential data that is split into two parts, the past, and the future.
The sliding window forecasting problem, shown preceding in Figure 1, involves using prior data and knowledge of future values to predict future target values. Traditional statistical methods, such as ARIMA and its variants, or Holt-Winters Regression have long been used to perform regression for these tasks. However, as data volume has increased and the problems to solve with regression have become increasingly complex, deep learning approaches have proven their ability to represent effectively and understand these problems.
Despite the advent of deep learning forecasting models, there historically has not been a way to effectively experiment with and compare the performance and accuracy of time series models across an arbitrary set of datasets. To this end, we’re delighted to publicly open-source the NVIDIA Time Series Prediction Platform.
What is the TSPP?
The Time Series Prediction Platform is an end-to-end framework that enables users to train, tune, and deploy time series models. Its hierarchical configuration system and rich feature specification API allow for new models, datasets, optimizers, and metrics to be easily integrated and experimented with. The TSPP is designed for use with vanilla PyTorch models and is agnostic to the cloud or local platforms.
Figure 2: The basic architecture of the NVIDIA Time Series Prediction Platform. The CLI feeds the input to the TSPP launcher, which instantiates the objects required for training (model, dataset, etc.) and runs the specified experiment to generate performance and accuracy results.
The TSPP, pictured in Figure 2, is centered around a command line-controlled launcher. Based on user input to the CLI, the launcher either instantiates a hyperparameter manager, which can run a set of training experiments in parallel or will run a single experiment by creating the described components, such as model, dataset, metric, etc.
Models supported
The TSPP supports the NVIDIA Optimized Temporal Fusion Transformer (TFT) by default. Within the TSPP, TFT training can be accelerated using multi-GPU training, automatic mixed precision, and exponential moving weight averaging. The model can be deployed using the aforementioned inference and deployment pipeline.
The TFT model is a hybrid architecture joining LSTM encoding and interpretable transformer attention layers. Prediction is based on three types of variables: static (constant for a given time series), known (known in advance for whole history and future), observed (known only for historical data). All of these variables come in two flavors: categorical, and continuous. In addition to historical data, we feed the model with historical values of the time series itself.
All variables are embedded in high-dimensional space by learning an embedding vector. Categorical variables embeddings are learned in the classical sense of embedding discrete values. The model learns a single vector for each continuous variable, which is then scaled by this variable’s value for further processing. The next step is to filter variables through the Variable Selection Network (VSN), which assigns weights to the inputs in accordance with their relevance to the prediction. Static variables are used as a context for variable selection of other variables and as an initial state of LSTM encoders.
After encoding, variables are passed to multi-head attention layers (decoder), which produce the final prediction. The whole architecture is interwoven with residual connections with gating mechanisms that allow the architecture to adapt to various problems.
Figure 3: Diagram of the TFT architecture: Bryan Lim, Sercan O. Arik, Nicolas Loeff, Tomas Pfister from Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting, 2019.
Accelerated training
When experimenting with deep learning models, training acceleration can greatly increase the number of experimental iterations you can make in a given time. The Time Series Prediction Platform provides the ability to accelerate training with any combination of Automatic Mixed Precision, Multi-GPU Training, and Exponential Moving Weight Averaging.
Training quick start
Once one is inside the TSPP container, running the TSPP is as simple as calling the launcher with the combination of a dataset, model, and other components that you want to use. For example, to train TFT with the Electricity dataset, we simply call:
The resulting logs, checkpoints, and initial config will be saved to outputs. For examples that include more complex workflows, please reference the repository documentation.
Automatic mixed precision
Automatic Mixed Precision (AMP) is a mode of execution for deep learning training, where applicable calculations are computed in 16-bit precision instead of 32-bit precision. AMP execution can greatly accelerate deep learning training without loss of accuracy. AMP is included in the TSPP and can be enabled by simply adding a flag to the launch call.
Multi-GPU training
Multi-GPU data parallel training provides acceleration for model training by increasing the global batch size by running model computations in parallel on all available GPUs. This approach can greatly improve model training time without loss of model accuracy, especially when many GPUs are used. It is included in the TSPP through PyTorch DistributedDataParallel and can be enabled by simply adding an element to the launch call.
Exponential moving weight averaging
Exponential Moving Weight Averaging is a technique that maintains two copies of a model, one that is being trained through backpropagation, and a second model that is the weighted average of the weights of the first model. At test and inference time, the averaged weights are used to compute outputs. This approach has proven to decrease time to convergence and increase convergence accuracy in practice, at the cost of doubling model GPU memory requirements. EMWA is included in the TSPP and can be enabled by simply adding a flag to the launch call.
Hyperparameter tuning
Model hyperparameter tuning is an essential part of the model development and experimentation process for deep learning models. For this purpose, the TSPP includes a rich integration with the Optuna hyperparameter search library. Users can run extensive hyperparameter searches by specifying hyperparameter names and distributions to search on. Once this is done, the TSPP can run multi-GPU or single-GPU trials in parallel until the desired number of hyperparameter options have been explored.
At search completion, the TSPP will return the hyperparameters of the best single run, as well as the log files of all runs. For ease of comparison, the log files are generated with the NVIDIA DLLogger and are easily searchable and compatible with Tensorboard plotting.
Configurability
Configurability in the TSPP is driven by Hydra, an open-source library provided by Facebook. Hydra allows users to define a hierarchical configuration system using YAML files that are combined at runtime, making launching runs as simple as stating ‘I want to try this model with this dataset’.
Feature specification
The feature specification, which is included in the dataset portion of configuration, is a standard description language for time-series datasets. It encodes the attributes of each tabular feature with information about whether it is known, observed, or static in the future, whether or not the feature is categorical or continuous, and many more optional attributes. This description language provides a framework for models to automatically configure themselves based on arbitrary described input.
Component integration
Adding a new dataset to the TSPP is as simple as creating a feature specification for it and describing the dataset itself. Once the feature specification and a few other key values have been defined, models integrated with the TSPP will be able to configure themselves to the new dataset.
Adding a new model to the TSPP simply requires that the model expects the data presented by the feature specification to be in the correct channel. If the model correctly interprets the feature spec, the model should work with all datasets integrated into the TSPP, past, and future.
In addition to models and datasets, the TSPP also supports the integration of arbitrary components, such as criterion, optimizers, and goal metrics. Through the use of Hydra’s direct instantiation of objects using config, users can integrate their own custom components and use them simply using the specification in a launch of the TSPP.
Inference and deployment
Inference is a key component of any Machine Learning pipeline. To this end, the TSPP has built-in support for inference that integrates seamlessly with the platform. In addition to supporting native inference, the TSPP also supports single-step deployment of converted models to NVIDIA Triton Inference Servers.
NVIDIA Triton model navigator
The TSPP provides full support for the NVIDIA Triton Model Navigator. Compatible models can be easily converted to optimized formats including TorchScript, ONNX, and NVIDIA TensorRT. In the same step, these converted models will be deployed to an NVIDIA Triton Inference Server. There is even an option to profile, analyze, and generate helm charts for a given model as part of a single step. For example, given a TFT output folder, we can convert and deploy a model to NVIDIA TensorRT format in fp16 by exporting to ONNX with the following command:
We benchmarked TFT within the TSPP on two datasets: the Electrical Load (Electricity) dataset from the UCI dataset repository, and the PEMs Traffic dataset (Traffic). TFT achieves strong results on both datasets, achieving the lowest seen error on both datasets and confirming the assessment of the authors of the TFT paper.
Dataset
Mean Absolute Error
Root Mean Squared Error
Electricity
43.807
142307.152
Traffic
0.005081
0.018
Table 1:
Training Performance
Figures 4 and 5 demonstrate the per-second throughput of TFT on the Electricity and Traffic datasets respectively. Each batch, with batch size 1024, contains a variety of time windows from different time series within the same dataset. The A100 runs were computed using Automatic Mixed Precision. As is evident, TFT has excellent performance and scaling on A100 GPUs, especially when compared with execution on a 96-core CPU.
Figure 4: TFT training throughput on Electricity dataset on GPU versus CPU. GPUs: 8x Tesla A100 80 GB. CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (96 threads).
Figure 5. TFT training throughput on Traffic dataset on GPU versus CPU. GPUs: 8x Tesla A100 80 GB. CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (96 threads).
Training time
Figures 6 and 7 demonstrate the end-to-end training time of TFT on the Electricity and Traffic datasets respectively. Each batch, with batch size 1024, contains a variety of time windows from different time series within the same dataset. The A100 completed runs were computed using Automatic Mixed Precision. In these experiments, on GPU, TFT is trained in minutes, while the CPU runs trained in approximately half a day.
Figure 6: TFT end-to-end training time on Electricity dataset on GPU compared to CPU. GPUs: 8x Tesla A100 80 GB. CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (96 threads).
Figure 7: TFT end-to-end training time on Traffic dataset on GPU compared to CPU. GPUs: 8x Tesla A100 80 GB. CPU: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (96 threads).
Inference Performance
Figures 8 and 9 demonstrate the relative single-device inference throughput and average latency of an A100 80GB GPU vs a 96-core CPU across a variety of batch sizes on the electricity dataset. Since larger batch sizes generally generate greater inference throughput, we consider the 1024 element batch results, where it is apparent that the A100 GPU has incredible performance, processing approximately 50,000 samples a second. Furthermore, larger batch sizes tend to lead to higher latency as is evident from the CPU values, which seem to scale proportionally with the batch size. In contrast, the A100 GPU has a near constant average latency when compared to the CPU.
Figure 8: TFT throughput on Electricity dataset when deployed to NVIDIA Triton Inference Server Container 21.12 on GPU vs CPU. GPUs: 1x Tesla A100 80 GB deployed using TensorRT 8.2. CPU: Dual AMD Rome 7742, 128 cores total @ 2.25 GHz (base), 3.4 GHz (max boost) (256 threads) deployed using ONNX.
Figure 9: TFT average latency on Electricity dataset when deployed to NVIDIA Triton Inference Server Container 21.12 on GPU vs CPU. GPUs: 1x Tesla A100 80 GB deployed using TensorRT 8.2. CPU: Dual AMD Rome 7742, 128 cores total @ 2.25 GHz (base), 3.4 GHz (max boost)(256 threads) deployed using ONNX.
End-to-end example
Tying together the preceding examples, we demonstrate a simple training and deployment of the TFT model on the Electricity dataset. We begin by building and launching the TSPP container from the source:
cd DeeplearningExamples/Tools/PyTorch/TimeSeriesPredictionPlatform
source scripts/setup.sh
docker build -t tspp .
docker run -it --gpus all --ipc=host --network=host -v /your/datasets/:/workspace/datasets/ tspp bash
Next, we launch the TSPP with the electricity dataset, TFT, and a quantile loss. We also overload the number of epochs to train for 10. Once the model has been trained, logs, config files, and a trained checkpoint will be created in outputs/{date}/{time}, in this case, outputs/01-02-2022/:
The NVIDIA Time Series Prediction Platform provides end-to-end GPU acceleration from training to inference for time series models. The reference example included in the platform is optimized and certified to run on NVIDIA DGX A100 and NVIDIA-Certified Systems. For a deeper dive into the performance achieved see our Temporal Fusion Transformer benchmarks.
Organizations can start training, comparing model architectures with their own datasets, and deploying the models in production today.
Secure, robust, tinyML based delivery of agricultural products to its valid consumers is the most important step for our AREC.
Secure, robust delivery of agricultural products to its valid consumers is the most important step for our AREC-Agricultural Records on Electronic Contracts solution. Not only it should be streamed toward getting accurate regional data analytics but should also make it nearly impossible to counterfeit goods, I took up this challenge because I am very aware of the situation in India, for example: during Covid-19 fake pesticides made up almost a quarter of the market exposing Himachal Pradesh farmers, their apple crops and the environment to many harmful side effects.
One of the biggest reasons behind the growth of fake products in the agrochemical industry is the lack of knowledge about genuine products among consumers. Illiterate farmers usually do not check the product authentication features and just go for the cheap price.
Rare adoption of anti-counterfeiting solutions by companies: Many agrochemical manufacturers do not implement anti-counterfeiting solutions in their products. This makes their products highly susceptible to duplication, tampering and even diversion.
Lack of stringent regulatory measures: Lack of strict regulatory measures against counterfeit products and poor laws against counterfeiters has led to the rapid rise of counterfeit agro-chemical products. Challenge: Develop an easily accessible solution for farmers that can provide detailed product-specific information, their usage, and the delivery details, and help them increase productivity and profitability.
Our idea is to bring a secure de-centralized, tamperproof unified blockchain ledger with a secure authentication device protocol to streamline all manufacturers and logistics data. The device sends and updates the following data to the ledger:
location
temp&humidity
seal/packet opened or not
tinyML based condition status (For instance, if there is a sudden change in temperature that would damage the goods – abnormal environmental conditions during transit of agrochemicals like pesticides can result in separation of active ingredients, a notification is sent that triggers an action to resolve the situation.)
Using the device ID barcode, the farmer can scan and get to know all details, in fact, it would be one time process to onboard the farmer to the Blockchain ledger, the product ID barcode would serve as farmer login and it would allow him to trace and lodge complain in case of any unsatisfactory results.
Used on a variety of зкщвгсеы – Information will be presented on the packaging. (Taking into account a variety of products (crop protection and seeds) and packaging (eg bags, 250ml, 20L, etc).
Customers can scan/read and interact with the Digital Label
Provides simple and actionable information about seeds, traits, and crop protection products
Uses scannable resources like QR codes to provide information about the product (nutrients and pesticides) like geography, dosage, applicability, authenticity, usage, and disposal guidelines
Supports customer feedback, and complaints – Customers can communicate with the company regarding a specific product
Can create a dynamic label that can be adjusted to cover important legal requirements like safety statements and product’s identification information
Product traceability (localization and expected delivery date- information flow from farmer to customers)
Provides product security, or anti-counterfeiting – Making sure that the product is not forged/Copied and is an original company product.
Has flexible Data Model to support the different data inputs and maintenance, since we need to keep legal requirements data, label artwork models, product usage data, product data, security data, delivery data, etc.
To build flexible and extensible data/information models that can handle vast and diverse information related to individual products.
Technology Overview:
MCU choice: Our application requires robust performance over a long time more than 6 months with minimum power consumption (here, focused on driving down cost, power consumption and size so 8bit MCU would be a great option too).
tinyML framework choice: Recently, explored the Neuton ai platform and was blown away by the flexibility it provides for embedding huge ML models with lower memory footprints even on an 8bit MCU. The training is easy with greatly optimized C codes for MCUs. It offers various model explainability tools that allow users to evaluate the trained model quality and understand the cause of prediction for the inference data.
Blockchain frameworkchoice:Daml idea is to abstract the key concepts of blockchains that allow them to provide consistency guarantees on distributed systems of record. Daml, a virtual shared ledger, governed by smart contracts, takes the place of a concrete blockchain. Adding this layer of abstraction allows Daml to realize a whole range of capabilities and advantages that are sorely lacking in individual blockchain and distributed ledger technologies.
Fleet Monitoring: There are several reasons to consider this to allow better regulation of tracking devices. I have tried Golioth and Toit and both seem to be fantastic in terms of security and visibility but I will lean more towards C/C++ support. (I will work on this in future)
Below is what our label would look like and the essential data printed on it. It would only contain a battery, microprocessor, SIM, modem and antenna under its unassuming surface.
Testing DAML blockchain with our product (I used toit to test a use case for fleet monitoring and remote logging), check the below images for understanding our Blockchain ledger pipeline. With blockchain, we have selected visibility of immutable data which can be accessed under specific roles by scanning a QR code on the label.
Than I build the model using Neuton Platfrom
Building TinyML model for our smart labels in 3 steps:
Step 1: Creating Solution and Adding Dataset
Step 2: TinyML model setting and Training
Our model has Coefficients: 120 Model Size Kb: 0.46 File Size for Embedding Kb: 0.553, pretty tiny and smart 🙂
Step 3: Make Predictions on MCUs
Looking for a challenge? Try maneuvering a Kenyan minibus through traffic or dropping seed balls on deforested landscapes. Or download Africa’s Legends and battle through fiendishly difficult puzzles with Ghana’s Ananse or Nigeria’s Oya by your side. Games like these are connecting with a hyper-connected African youth population that’s growing fast. Africa is the youngest Read article >
I’m working on a project where basically I would like to take a relatively small binary image and train a DL model to produce a vector of 360 outputs, one for each angle (i.e. the amplitude response at every angle produced by the image in question). I’m trying to figure out how to best build this from a model perspective. I think some convolutional layers might make sense, but I’m not sure whether I should just build a dense layer with 360 outputs, one for each angle, or something else entirely (perhaps a convolutional AE or RNN/LSTM or something like that, since we’re looking at sequence data?) And I’m not sure how the fact angles are involved might change anything either. Any ideas are appreciated!
Hi, I am working on a simple ML project and I am getting an error I can’t seem to resolve. I am getting: AssertionError: Duplicate registrations for type ‘experimentalOptimizer’ when I try to run any TensorFlow program that has an optimizer. I uninstalled Python, TensorFlow, and Keras and then reinstalled all three. The Adam optimizer file is present as well as several other optimizers. I couldn’t find any other documentation on this error elsewhere so any help would be greatly appreciated, thanks.