Categories
Offsites

Large Motion Frame Interpolation

Frame interpolation is the process of synthesizing in-between images from a given set of images. The technique is often used for temporal up-sampling to increase the refresh rate of videos or to create slow motion effects. Nowadays, with digital cameras and smartphones, we often take several photos within a few seconds to capture the best picture. Interpolating between these “near-duplicate” photos can lead to engaging videos that reveal scene motion, often delivering an even more pleasing sense of the moment than the original photos.

Frame interpolation between consecutive video frames, which often have small motion, has been studied extensively. Unlike videos, however, the temporal spacing between near-duplicate photos can be several seconds, with commensurately large in-between motion, which is a major failing point of existing frame interpolation methods. Recent methods attempt to handle large motion by training on datasets with extreme motion, albeit with limited effectiveness on smaller motions.

In “FILM: Frame Interpolation for Large Motion”, published at ECCV 2022, we present a method to create high quality slow-motion videos from near-duplicate photos. FILM is a new neural network architecture that achieves state-of-the-art results in large motion, while also handling smaller motions well.

FILM interpolating between two near-duplicate photos to create a slow motion video.

FILM Model Overview
The FILM model takes two images as input and outputs a middle image. At inference time, we recursively invoke the model to output in-between images. FILM has three components: (1) A feature extractor that summarizes each input image with deep multi-scale (pyramid) features; (2) a bi-directional motion estimator that computes pixel-wise motion (i.e., flows) at each pyramid level; and (3) a fusion module that outputs the final interpolated image. We train FILM on regular video frame triplets, with the middle frame serving as the ground-truth for supervision.

A standard feature pyramid extraction on two input images. Features are processed at each level by a series of convolutions, which are then downsampled to half the spatial resolution and passed as input to the deeper level.

Scale-Agnostic Feature Extraction
Large motion is typically handled with hierarchical motion estimation using multi-resolution feature pyramids (shown above). However, this method struggles with small and fast-moving objects because they can disappear at the deepest pyramid levels. In addition, there are far fewer available pixels to derive supervision at the deepest level.

To overcome these limitations, we adopt a feature extractor that shares weights across scales to create a “scale-agnostic” feature pyramid. This feature extractor (1) allows the use of a shared motion estimator across pyramid levels (next section) by equating large motion at shallow levels with small motion at deeper levels, and (2) creates a compact network with fewer weights.

Specifically, given two input images, we first create an image pyramid by successively downsampling each image. Next, we use a shared U-Net convolutional encoder to extract a smaller feature pyramid from each image pyramid level (columns in the figure below). As the third and final step, we construct a scale-agnostic feature pyramid by horizontally concatenating features from different convolution layers that have the same spatial dimensions. Note that from the third level onwards, the feature stack is constructed with the same set of shared convolution weights (shown in the same color). This ensures that all features are similar, which allows us to continue to share weights in the subsequent motion estimator. The figure below depicts this process using four pyramid levels, but in practice, we use seven.

Bi-directional Flow Estimation
After feature extraction, FILM performs pyramid-based residual flow estimation to compute the flows from the yet-to-be-predicted middle image to the two inputs. The flow estimation is done once for each input, starting from the deepest level, using a stack of convolutions. We estimate the flow at a given level by adding a residual correction to the upsampled estimate from the next deeper level. This approach takes the following as its input: (1) the features from the first input at that level, and (2) the features of the second input after it is warped with the upsampled estimate. The same convolution weights are shared across all levels, except for the two finest levels.

Shared weights allow the interpretation of small motions at deeper levels to be the same as large motions at shallow levels, boosting the number of pixels available for large motion supervision. Additionally, shared weights not only enable the training of powerful models that may reach a higher peak signal-to-noise ratio (PSNR), but are also needed to enable models to fit into GPU memory for practical applications.

The impact of weight sharing on image quality. Left: no sharing, Right: sharing. For this ablation we used a smaller version of our model (called FILM-med in the paper) because the full model without weight sharing would diverge as the regularization benefit of weight sharing was lost.

Fusion and Frame Generation
Once the bi-directional flows are estimated, we warp the two feature pyramids into alignment. We obtain a concatenated feature pyramid by stacking, at each pyramid level, the two aligned feature maps, the bi-directional flows and the input images. Finally, a U-Net decoder synthesizes the interpolated output image from the aligned and stacked feature pyramid.

FILM Architecture. FEATURE EXTRACTION: we extract scale-agnostic features. The features with matching colors are extracted using shared weights. FLOW ESTIMATION: we compute bi-directional flows using shared weights across the deeper pyramid levels and warp the features into alignment. FUSION: A U-Net decoder outputs the final interpolated frame.

Loss Functions
During training, we supervise FILM by combining three losses. First, we use the absolute L1 difference between the predicted and ground-truth frames to capture the motion between input images. However, this produces blurry images when used alone. Second, we use perceptual loss to improve image fidelity. This minimizes the L1 difference between the ImageNet pre-trained VGG-19 features extracted from the predicted and ground truth frames. Third, we use Style loss to minimize the L2 difference between the Gram matrix of the ImageNet pre-trained VGG-19 features. The Style loss enables the network to produce sharp images and realistic inpaintings of large pre-occluded regions. Finally, the losses are combined with weights empirically selected such that each loss contributes equally to the total loss.

Shown below, the combined loss greatly improves sharpness and image fidelity when compared to training FILM with L1 loss and VGG losses. The combined loss maintains the sharpness of the tree leaves.

FILM’s combined loss functions. L1 loss (left), L1 plus VGG loss (middle), and Style loss (right), showing significant sharpness improvements (green box).

Image and Video Results
We evaluate FILM on an internal near-duplicate photos dataset that exhibits large scene motion. Additionally, we compare FILM to recent frame interpolation methods: SoftSplat and ABME. FILM performs favorably when interpolating across large motion. Even in the presence of motion as large as 100 pixels, FILM generates sharp images consistent with the inputs.

Frame interpolation with SoftSplat (left), ABME (middle) and FILM (right) showing favorable image quality and temporal consistency.
Large motion interpolation. Top: 64x slow motion video. Bottom (left to right): The two input images blended, SoftSplat interpolation, ABME interpolation, and FILM interpolation. FILM captures the dog’s face while maintaining the background details.

Conclusion
We introduce FILM, a large motion frame interpolation neural network. At its core, FILM adopts a scale-agnostic feature pyramid that shares weights across scales, which allows us to build a “scale-agnostic” bi-directional motion estimator that learns from frames with normal motion and generalizes well to frames with large motion. To handle wide disocclusions caused by large scene motion, we supervise FILM by matching the Gram matrix of ImageNet pre-trained VGG-19 features, which results in realistic inpainting and crisp images. FILM performs favorably on large motion, while also handling small and medium motions well, and generates temporally smooth high quality videos.

Try It Out Yourself
You can try out FILM on your photos using the source codes, which are now publicly available.

Acknowledgements
We would like to thank Eric Tabellion, Deqing Sun, Caroline Pantofaru, Brian Curless for their contributions. We thank Marc Comino Trinidad for his contributions on the scale-agnostic feature extractor, Orly Liba and Charles Herrmann for feedback on the text, Jamie Aspinall for the imagery in the paper, Dominik Kaeser, Yael Pritch, Michael Nechyba, William T. Freeman, David Salesin, Catherine Wah, and Ira Kemelmacher-Shlizerman for support.

Categories
Misc

CUDA Toolkit 11.8 New Features Revealed

NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through…

NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through new hardware capabilities.

New architecture-specific features in NVIDIA Hopper and Ada Lovelace are initially being exposed through libraries and framework enhancements. The full programming model enhancements for the NVIDIA Hopper architecture will be released starting with the CUDA Toolkit 12 family.

CUDA 11.8 has several important features. This post offers an overview of the key capabilities.

NVIDIA Hopper and NVIDIA Ada architecture support

CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families.

CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.

Lazy module loading

Building on the lazy kernel loading feature in 11.7, NVIDIA added lazy loading to the CPU module side. What this means is that functions and libraries load faster on the CPU, with sometimes substantial memory footprint reductions. The tradeoff is a minimal amount of latency at the point in the application where the functions are first loaded. This is lower overall than the total latency without lazy loading.​

All libraries used with lazy loading must be built with 11.7+ to be eligible for lazy loading.

Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY set.

Improved MPS signal handling

You can now terminate with SIGINT or SIGKILL any applications running in MPS environments without affecting other running processes. While not true error isolation, this enhancement enables more fine-grained application control, especially in bare-metal data center environments.​

NVIDIA JetPack installation simplification

NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Starting from CUDA Toolkit 11.8, Jetson users on NVIDIA JetPack 5.0 and later can upgrade to the latest CUDA versions without updating the NVIDIA JetPack version or Jetson Linux BSP (board support package) to stay on par with the CUDA desktop releases.

For more information, see Simplifying CUDA Upgrades for NVIDIA Jetson Developers.

CUDA developer tool updates

Compute developer tools are designed in lockstep with the CUDA ecosystem to help you identify and correct performance issues.

Nsight Compute

In Nsight Compute, you can expose low-level performance metrics, debug API calls, and visualize workloads to help optimize CUDA kernels. New compute features are being introduced in CUDA 11.8 to aid performance tuning activity on the NVIDIA Hopper architecture.

You can now profile and debug NVIDIA Hopper thread block clusters, which provide performance boosts and increased control over the GPU. Cluster tuning is being released in combination with profiling support for the Tensor Memory Accelerator (TMA), the NVIDIA Hopper rapid data transfer system between global and shared memory.

A new sample is included in Nsight Compute for CUDA 11.8 as well. The sample provides source code and precollected results that walk you through an entire workflow to identify and fix an uncoalesced memory access problem. Explore more CUDA samples to equip yourself with the knowledge to use toolkit features and solve similar cases in your own application.

Nsight Systems

Profiling with Nsight Systems can provide insight into issues such as GPU starvation, unnecessary GPU synchronization, insufficient CPU parallelizing, and expensive algorithms across the CPUs and GPUs. Understanding these behaviors and the load of deep learning frameworks, such as PyTorch and TensorFlow, helps you tune your models and parameters to increase overall single or multi-GPU utilization.

Other tools

Also included in the CUDA toolkit, both CUDA-GDB for CPU and GPU thread debugging as well as Compute Sanitizer for functional correctness checking have support for the NVIDIA Hopper architecture.

Summary

This release of the CUDA 11.8 Toolkit has the following features:

  • First release supporting NVIDIA Hopper and NVIDIA Ada Lovelace GPUs
  • Lazy module loading extended to support lazy loading of CPU-side modules in addition to device-side kernels
  • Improved MPS signal handling for interrupting and terminating applications
  • NVIDIA JetPack installation simplification
  • CUDA developer tool updates

For more information, see the following resources:

Categories
Misc

Searidge Technologies Offers a Safety Net for Airports

Planes taxiing for long periods due to ground traffic — or circling the airport while awaiting clearance to land — don’t just make travelers impatient. They burn fuel unnecessarily, harming the environment and adding to airlines’ costs. Searidge Technologies, based in Ottawa, Canada, has created AI-powered software to help the aviation industry avoid such issues, Read article >

The post Searidge Technologies Offers a Safety Net for Airports appeared first on NVIDIA Blog.

Categories
Misc

Creator EposVox Shares Streaming Lessons, Successes This Week ‘In the NVIDIA Studio’

TwitchCon — the world’s top gathering of live streamers – kicks off Friday with the new line of GeForce RTX 40 Series GPUs bringing incredible new technology — from AV1 to AI — to elevate live streams for aspiring and professional Twitch creators alike.

The post Creator EposVox Shares Streaming Lessons, Successes This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Optimizing Fraud Detection in Financial Services with Graph Neural Networks and NVIDIA GPUs

Fraud is a major problem for many financial services firms, costing billions of dollars each year, according to a recent Federal Trade Commission report….

Fraud is a major problem for many financial services firms, costing billions of dollars each year, according to a recent Federal Trade Commission report. Financial fraud, fake reviews, bot assaults, account takeovers, and spam are all examples of online fraud and harmful activity.

Although these firms employ techniques to combat online fraud, the methods can have severe limitations. Simple rule-based techniques and feature-based algorithm techniques (logistic regression, Bayesian belief networks, CART, and others) aren’t adaptable enough to detect the full range of fraudulent or suspicious online behaviors. 

Fraudsters, for example, might set up many coordinated accounts to avoid triggering limitations on individual accounts. In addition, detecting fraudulent behavior patterns at scale is difficult due to the huge volume of data to sift through (billions of rows, tens of terabytes), the complexity of continually improving methodologies, and the scarcity of real cases of fraudulent activity required for training classification algorithms. For more details, see Intelligent Financial Fraud Detection Practices: An Investigation.

Although the cost of fraud is billions of dollars per year, there are very few fraudulent transactions among many legitimate transactions, leading to an imbalance in labeled data, when it is even available.  Detecting fraud becomes even more complex in the financial services industry, due to security concerns around personal data and the need for transparency in the methods used to detect the fraudulent activity. 

An explainable model enables fraud analysts to understand what inputs the algorithm used in the analysis and the reason(s) for flagging the transaction, building a stronger trust in the system. Additional benefits include the ability to communicate feedback to internal teams and provide customers with an explanation.

In recent years, graph neural networks (GNNs) have gained traction for fraud detection problems, revealing suspicious nodes (in accounts and transactions, for example) by aggregating their neighborhood information through different relations. In other words, by checking whether a given account has sent a transaction to a suspicious account in the past. 

In the context of fraud detection, the ability of GNNs to aggregate information contained within the local neighborhood of a transaction enables them to identify larger patterns that may be missed by just looking at a single transaction. 

To enable developers to quickly take advantage of GNNs to optimize and accelerate fraud detection, NVIDIA partnered with the Deep Graph Library (DGL) team and the PyTorch Geometric (PyG) team to provide a GNN framework containerized solution that includes the latest DGL or PyG, PyTorch, NVIDIA RAPIDS, and a set of tested dependencies. The NVIDIA-optimized GNN Framework containers are performance-tuned and tested for NVIDIA GPUs. 

This approach eliminates the need to manage packages and dependencies or build the framework from source. We are actively contributing to enhance the performance of these top GNN frameworks. We have added GPU support for unified virtual addressing (UVA), FP16 operations, neighborhood sampling, subgraph operations for minibatches and optimized sparse embeddings, sparse adam optimizer, graph batching, CSR-to-COO conversions, and much more.

This post first addresses the unique problems in credit card fraud detection and the most widely used detection techniques. It also highlights how GNNs accelerated by GPUs have a unique approach to addressing these issues. We walk through an end-to-end workflow showcasing best practices for preprocessing, training, and deployment for detecting fraud on a financial fraud dataset using graph neural networks. Last, we show benchmarks of end-to-end workflows on two industry scale datasets utilizing the optimizations contributed in DGL by NVIDIA engineers.

Overview of fraud detection

Fraud detection is a set of processes and analyses that allow firms to identify and prevent unauthorized activity. It has become one of the major challenges for most organizations, particularly those in banking, finance, retail, and e-commerce. 

Any kind of fraud negatively affects an organization’s bottom line and market reputation, and deters both future prospects and current customers. Given the scale and reach of these vulnerable organizations, it has become crucial for them to prevent fraud from happening and even predict suspicious actions in real time.

Fraud detection poses unique problems for machine learning researchers and engineers, a few of which are detailed below.

Complex and evolving fraud patterns 

Fraudsters update their knowledge and develop sophisticated techniques to cheat the system, often involving complex chains of transactions to avoid detection. 

Traditional ruled-based systems and tabular machine learning (ML) like SVMs and XGBoost often can only consider the immediate edges of a transaction (who sent money to who), often missing patterns of fraud with more complex context. Rule-based systems also need to be hand-tuned over time as patterns of fraud change and new exploits emerge.

Label quality

Available fraud datasets are often both imbalanced and without exhaustive labels. In the real world, only a small percentage of people intend to commit fraud. Domain experts typically classify transactions as either fraudulent or not, but cannot guarantee that all fraud has been captured in the dataset. 

This class imbalance and lack of exhaustive labels make it difficult to develop supervised models, as models trained on the labels we do have may incur higher rates of false negatives, and the imbalanced dataset can lead to models that also generate more false positives. Thus, training GNNs with alternative objectives and using their latent representations downstream can have beneficial effects.

Model explainability 

Predicting whether a transaction is fraudulent or not is not sufficient for transparency expectations in the financial services industry. It is also necessary to understand why certain transactions are flagged as fraud. This explanabity is important for understanding how fraud happens, how to implement policies to reduce fraud, and to make sure the process isn’t biased. Therefore, fraud detection models are required to be interpretable and explainable which limits the selection of models that analysts can use. 

Graph approaches for fraud detection

A series of transactions can be accurately described as a graph, with users being represented as nodes, and transactions between them being represented as edges. While feature-based algorithms like XGBoost and deep feature-based models like DLRM focus on the features of a single node or edge, graph-based approaches can take the features and structure of the local graph context (neighbors and neighbors of neighbors, for example) into account in their predictions.

In the traditional (non-GNN) graph domain, there are many approaches to generating salient predictions based on the graph structure. Statistical approaches that aggregate features from adjacent neighboring nodes or edges, or even their neighbors, can be used to provide information about locality to feature-based tabular algorithms like XGBoost. 

Algorithms like the Louvain method and InfoMap can detect communities and denser clusters of users on the graph, which can then be used to detect communities and generate features that represent graph structure as a hierarchy.

While these approaches can generate adequate results, the problem remains that the algorithms used lack expressivity with respect to the graph itself, as they do not consider the graph in its native format.

Graph neural networks build on the concept of representing local structural and feature context natively within the model. Information from both edge and node features is propagated through aggregation and message passing to neighboring nodes. 

When multiple layers of graph convolution are performed, this results in a node’s state containing some information from nodes multiple layers away, effectively allowing the GNN to have a “receptive field” of nodes or edges multiple jumps away from the node or edge in question. 

In the context of the fraud detection problem, this large receptive field of GNNs can account for more complex or longer chains of transactions that fraudsters can use for obfuscation. Additionally, changing patterns can be accounted for by iterative retraining of the model.

Graph neural networks also benefit from being able to encode meaningful representations of nodes or edges while training on an unsupervised or self-supervised task, such as Bootstrapped Graph Latents (BGRL) or link prediction with negative sampling. This allows GNN users to pre-train a model without labels, and to fine-tune the model on the much sparser labels later in the pipeline, or to output strong representations of the graph. The representation output can be used for downstream models like XGBoost, other GNNs, or clustering techniques.

GNNs also have a suite of tools to enable explainability with respect to the input graph. Certain GNN models like heterogeneous graph transformer (HGT) and graph attention network (GAT) enable an attention mechanism across the adjacent edges of a node at each layer of the GNN, allowing the user to identify the path of messages that the GNN is using to derive its final state. Even if GNN models have no attention mechanism, a variety of approaches have been proposed in order to explain GNN output in the context of the entire subgraph, including GNNExplainer, PGExplainer, and GraphMask.

The next section walks through an end-to-end credit card fraud detection workflow. This workflow uses TabFormer, a card transaction fraud dataset, and trains a R-GCN (relational graph convolutional network) model on a variation of the link prediction task in order to generate enriched node embeddings. These node embeddings are passed to a downstream XGBoost model which is trained and subsequently performs fraud detection. 

This XGBoost model can then be easily deployed. The embeddings trained can subsequently be used for other unsupervised techniques like clustering to identify undiscovered patterns of use without needing labels. Last, we will show benchmarks of end-to-end workflows on two industry scale datasets utilizing the optimizations contributed in DGL by NVIDIA engineers.

Building an end-to-end fraud detection workflow with GNNs

Data preprocessing

We are using the Tabformer dataset provided by IBM to demonstrate this workflow. The TabFormer dataset is a synthetic close approximation of a real-world financial fraud-detection dataset, consisting of:

  • 24 million unique transactions
  • 6,000 unique merchants
  • 100,000 unique cards
  • 30,000 fraudulent samples (0.1% of total transactions)

To begin, preprocess the dataset using a predefined workflow. The workflow leverages cuDF, a GPU DataFrame library, to perform feature transformations on the original dataset to prepare it for graph construction. cuDF is a drop in replacement of pandas that enables the preprocessing of data directly on GPUs. 

In this dataset, the card_id is defined as one card by one user. A specific user can have multiple cards, which would correspond to multiple different card_ids for this graph. The merchant_id is the categorical encoding of the feature, ‘Merchant Name’. The data is split such that the training data is all transactions before the year 2018, the validation data is all transactions during the year 2018, and the test data is all transactions after the year 2018. 

# Read the dataset
data = cudf.read_csv(self.source_path)
data[“card_id”] = data[“user”].astype(“str”) + data[“card”].astype(“str”)

# Split the data based on the year
data["split"] = cudf.Series(np.zeros(data["year"].size), dtype=np.int8)
data.loc[data["year"] == 2018, "split"] = 1
data.loc[data["year"] > 2018, "split"] = 2
train_card_id = data.loc[data["split"] == 0, "card_id"]
train_merch_id = data.loc[data["split"] == 0, "merchant_id"]

Strip the ‘$’ from the ‘Amount’ to cast that value as a float. Keep card_id and merchant_id in the validation and test datasets only if they are included in the train datasets.  

The graph is constructed with transaction edges between card_id and merchant_id.

Further preprocessing includes one hot encoding the Use chip feature, label encoding the ‘Is Fraud?’ feature, and target encoding the categorical representations of Merchant State, Merchant City, Zip, and MCC. In addition, the possible values of ‘Errors?’ are one hot encoded.

# Target encoding
high_card_cols = ["merchant_city", "merchant_state", "zip", "mcc"]
for col in high_card_cols:
    tgt_encoder = TargetEncoder(smooth=0.001)
    train_df[col] = tgt_encoder.fit_transform(
        train_df[col], train_df["is_fraud"])
    valtest_df[col] = tgt_encoder.transform(valtest_df[col])

# One hot encoding `use_chip`
oneh_enc_cols = ["use_chip"]
data = cudf.concat([data, cudf.get_dummies(data[oneh_enc_cols])], axis=1)

# Label encoding `is_fraud`
label_encoder = LabelEncoder()
train_df["is_fraud"] = label_encoder.fit_transform(train_df["is_fraud"])
valtest_df["is_fraud"] = label_encoder.transform(valtest_df["is_fraud"])

# One hot encoding the errors
exploded = data["errors"].str.strip(",").str.split(",").explode()
raw_one_hot = cudf.get_dummies(exploded, columns=["errors"])
errs = raw_one_hot.groupby(raw_one_hot.index).sum()

Once the dataset is preprocessed, transform the tabular format of the dataset into a graph.

Modeling tabular data as a graph

Transforming a table (or multiple tables) into a graph centers around mapping the existing table(s) into the edges, nodes, and features for both structures. In the case of this dataset, we begin by using the transaction table to create edges between the cards and the merchants. In contemporary GNN frameworks, graph edges are represented at a basic level by pairs of node IDs. Nodes are implicit based on the IDs included in the edge lists.

# Defining node type
for ntype in ["card", "merchant"]:
   node_type = {MetadataKeys.NAME: ntype, MetadataKeys.FEAT: []}
   self.node_types.append(node_type)

# Adding attributes of edge data
self.edge_data = dict()
self.edge_data["transaction"] = cudf.DataFrame({
  MetadataKeys.SRC_ID: data["card_id"],
  MetadataKeys.DST_ID: data["merchant_id"],})

# Defining features
features = []
for key in data.keys():
  if key not in ["card_id", "merchant_id"]:
    self.edge_data["transaction"][key] = data[key]
      feat = {
       MetadataKeys.NAME: key,
       MetadataKeys.DTYPE: str(self.edge_data["transaction"][key].dtype),
       MetadataKeys.SHAPE: self.edge_data["transaction"][key].shape,}
      if key in ["is_fraud"]:
        feat[MetadataKeys.LABEL] = True
      features.append(feat)

With the base graph created, it’s time to add the transaction features onto the edges in the graph. Note that in this case, the transaction data is only edge-specific, so the output graph has no node features.

Once the graph is created and populated with features, the model can be applied to it. 

Training the GNN model

Given the label imbalance and imperfect labeling of the dataset, we elected to use an unsupervised task, link prediction, to train the model to create meaningful representations of the nodes. The objective of link prediction is to predict the probability that an edge exists between two nodes. In financial services, this is translated to predicting the probability that a transaction exists between an individual and a merchant. 

Some target nodes within the batch are true edges, which are actual edges that exist in the graph, while others, generated by a negative sampler, are negative edges that do not truly exist. Negative edges are necessary in this case because our training task is the classification between real and fake. There are a variety of proposed ways in which to generate negative edges, but simply uniformly sampling the nodes to get the node endpoints is widely employed and achieves good results. While it is possible to negatively sample actual edges with this approach, most graphs are sparse enough that the probability of this is almost negligible.

Since most transaction graphs are too large to represent in GPU memory, we need to employ a subsampling technique in order to generate smaller localities for our graph to process. Sampling is usually done in two phases in DGL. 

First, perform seed sampling in order to identify the edges or nodes targeted for the GNN to predict on. Next, perform block sampling, also known as neighborhood sampling, to generate the subgraphs surrounding the seeds to use as input to the GNN.

The graph contains edges and nodes that could leak future information from the test set, so we must create an individual data loader and sampling routine for our train, validation, and test sets. The train dataloader is moderately simple, utilizing just edges in the training set for seed sampling, and the train set graph for block sampling. 

For the validation data loader, use the validation edges for seed sampling, but use only the training set graph for block sampling in order to prevent the leakage of information. Apply the same idea to the test set, where the test edges are used for seed sampling and the graph defined by the union of the training and validation sets for block sampling.

In order to accelerate dataloading, use a feature called Universal Virtual Addressing (UVA), which allows us to instantiate our graph such that it can be directly accessed by all the GPUs instead of through the host. When the graph is highly featured, UVA can increase model throughput by a factor of up to 5x.

With data loaders defined and the graph built, instantiate the R-GCN model. Graph convolutional networks are known for encoding features from structured neighborhoods, assigning the same weight to edges connected to the source node. R-GCN builds on top of this and provides relation-specific transformations that depend on the type and direction of an edge. 

The edge’s type information supplements the message calculated for each node. Node’s features and edge’s type are passed as input to the R-GCN model which are transformed into an embedding. R-GCN layers can extract high-level node representations by message passing and graph convolutions.

A diagram showing a) input and output of the R-GCN layer, b) the use of R-GCN in entity classification, and c) the use of R-GCN in link prediction with an additional decoder.
Figure 1. RGCN architecture as featured in Modeling Relational Data with Graph Convolution

Begin by creating a learnable node-level embedding that stores a 64-element representation tensor for each node. Given that it cannot be used (negative edges are featureless) and the graph has no node features, the node embeddings here will serve as numerical features on nodes in addition to the pure structure of the graph. This embedding table is used as input to the model R-GCN, which is defined using standardized hyperparameters. 

The specified model output is of width 64. Note that this number is not reflective of a number of classes: using link prediction, the R-GCN model should generate a node representation that can be used by a downstream operation to predict the probability of an edge between two nodes. There are many proposed ways to do this, including multi-layer perceptrons. This example uses the cosine similarity of the two nodes in order to generate the probability that nodes are actually connected by an edge. Thus, the model is wrapped in a link predictor module to output probabilities given input representations.

Next, define the optimizers, one for each the model itself and the embedding table. This two-optimizer setup is common within other contexts involving embedding tables, and is used to some effect here in improving model convergence.

With the components defined, it is now time to train the model. Not unlike other domains, the model can be trained on a single node using distributed data parallel (DDP) to further accelerate the model on multiple GPUs.

Using GNN embeddings for downstream tasks

Once the R-GCN model has been trained, generate robust node embeddings using the network. To do this, perform graph convolution at a one-hop scale for each of the graph layers for the entire graph, and use the late-stage activations generated by the model as embeddings of the nodes of the graph.

With the node embeddings generated, join the embeddings onto the original preprocessed dataset on the respective node IDs. Next, fit an XGBoost model to the edge feature dataset augmented with the extracted embedding values from the upstream GNN model. 

First, create a Dask client by connecting to the LocalCUDACluster, which is a Dask based CUDA cluster capable of executing python processes on multiple GPUs. Then the edge feature dataset is read into Dask and sampled such that the size of the final training dataset, which is defined as edge features augmented with embedding values, does not exceed 40% of the total GPU storage. This is necessary for Dask XGBoost as the full train data must be on GPU memory and the process of creating the DMatrix consumes the rest of the memory. 

Next, the embeddings from the upstream model are read and the node features are appended to its corresponding ID. Finally the XGBoost model is trained to predict ‘Is Fraud?’ and it outputs the AUPRC score of 0.9 on the test set. To demonstrate the efficacy of the GNN-created node embeddings, the best XGBoost model trained on the transactions without them achieves an AUCPR score of 0.79 on the test set.

The model checkpoints can further be used to deploy this model on NVIDIA Triton Inference Server.

Deployment

Once the XGBoost model has been trained, deploy the model and spin up an inference server using a Python backend to handle embedding lookup, and a Forest Inference Library (FIL) backend to perform GPU-accelerated forest library inference. 

The deployment pipeline comprises three parts:

  • A Python backend model, referred to as the embedding model. It reads in the embedding tensors. This backend accepts the card IDs and merchant IDs as input and returns their embeddings.
  • A FIL backend model, referred to as the XGBoost model. It loads in the saved XGB model from training. This backend accepts the augmented data (features plus embeddings) and returns the XGB prediction for each row.
  • Another Python backend model which we refer to as the downstream model. This model unifies the full deployment. This backend accepts the card IDs, merchant IDs, and the features. First it calls the embedding model using business logic scripting (BLS) to get the embeddings. Next, it joins the features and embeddings to create the augmented data. It then calls the XGB model, again using BLS, and returns its predictions.

Query this service with a data sample to get the probability of the transaction being fraudulent. This probability can then be used for developing subsequent business logic.

Benchmarks

We have performed extensive tests on one fraud detection and one benchmark dataset: TabFormer and MAG240M, respectively. To make our experiments reproducible, we have used DGX A100 (80 GB) for all the benchmark runs. This server has 64 core, dual socket AMD EPYC 7742 CPU processors and eight NVIDIA A100 (80 GB SXM4) GPU processors.

The next section presents the speedup achieved by optimizing the end-to-end workflow for GPUs.

TabFormer dataset

Comparing the time it takes to preprocess the dataset using pandas on a CPU and cuDF on a GPU shows a batch size of 8,192 achieves a 39x speedup with GPU (Figure 2).

Graph comparing preprocessing time on TabFormer in two cases: pandas on CPU and cuDF on GPU.
Figure 2. A comparison of preprocessing time on TabFormer

Next, comparing the training time per epoch, before and after enabling this feature, shows the advantages of using UVA. With the same batch size and a fanout of [5, 5] configuration, a 2.8x speedup is achieved on a single GPU (Figure 3).

Graph comparing training time per epoch on TabFormer in two cases, with UVA and without UVA.
Figure 3. Comparing training time per epoch, with and without UVA

Finally, comparing the training time with the same batch size and fanout configuration, but on a CPU and a GPU (with UVA on) shows a speedup of 5.63x on a single GPU (Figure 4).

Graph showing a comparison of training time per epoch on a 64-core dual socket AMD CPU and a NVIDIA A100 (80 GB SXM4) GPU.
Figure 4. A comparison of training time per epoch on a 64-core dual socket AMD CPU and a NVIDIA A100 (80 GB SXM4) GPU

MAG240M dataset

The MAG240M dataset is a part of the OGB Large Scale Challenge. It is the largest public benchmark dataset for node-level tasks with ~245 million nodes and ~1.7 billion edges.

For this dataset, we first look at the total workflow time (the time it takes to preprocess the data), load, plus construct the graph and train the RGCN model. With a batch size of 4,096 and fanout of [150, 100] (used to achieve best results in hyperparameter search), we observe a ~9x speedup where the CPU takes 1,514 minutes and 1x NVIDIA A100 GPU takes 169 minutes (Figure 5).

A graph comparing the total workflow time on CPU and one NVIDIA A100 GPU for MAG240M dataset. The workflow includes preprocessing, loading plus constructing the graph and training the GNN.
Figure 5. A comparison of total workflow time loading plus constructing graph and training GNN on a 64-core dual socket AMD CPU and an NVIDIA A100 (80 GB SXM4)

As this is a large dataset, the workflow has been scaled across multiple GPUs in the same node. We observed a 20% reduction in total time when scaling from one to two GPUs and a 50% reduction from one to eight GPUs (Figure 6).

Graph showing scaling from one to eight NVIDIA A100 80 GB GPUs. Total workflow time includes preprocessing, loading plus constructing graph, and training GNN.
Figure 6. Scaling from one to eight NVIDIA A100 80 GB GPUs

Summary

NVIDIA has partnered with DGL and PyG to add support for graph operations on GPU and optimize preprocessing and training operations. Learn more about how NVIDIA is actively contributing to enhance these top GNN frameworks.

This post has presented an end-to-end workflow of fraud detection with GNNs including preprocessing, modeling tabular data as graph, training GNN, using GNN embeddings for downstream tasks, and deployment. This approach makes use of the NVIDIA optimized DGL, with a set of dependencies like RAPIDS cuDF and NVIDIA Triton Inference Server. We further demonstrated benchmarks on two datasets wherein we observed a 29x speedup of RGCN on MAG240M dataset on one NVIDIA A100 GPU versus CPU.

To learn more, watch the GTC session, Accelerate and Scale GNNs with Deep Graph Library and GPUs with Da Zheng, a senior applied scientist at AWS. See also Accelerating GNNs with Deep Graph Library and GPUs and Accelerating GNNs with PyTorch Geometric and GPUs, hosted by NVIDIA engineers. 

If you have DGL early access or PyG early access, you can now try containers that are performance-tuned and tested for NVIDIA GPUs. 

Categories
Misc

New Course: Introduction to Robotic Simulations in Isaac Sim

Learn how to use NVIDIA Isaac Sim to tap into the simulation loop of a 3D engine and initialize experiments with objects, robots, and physics logic.

Learn how to use NVIDIA Isaac Sim to tap into the simulation loop of a 3D engine and initialize experiments with objects, robots, and physics logic.

Categories
Misc

Evolving from Network Simulation to Data Center Digital Twin

Digital twins are attracting increasing attention across industries. While the concept is relatively new to many, digital twins are not new to IT, which has for…

Digital twins are attracting increasing attention across industries. While the concept is relatively new to many, digital twins are not new to IT, which has for some time recognized the benefits. One such benefit is the value of simulating a network environment. Network operators have been chasing network simulators for years.

Cisco’s Packet Tracer was an early industry network simulator that was quite popular. This simple tool provided a first exposure to network simulation for countless classically trained network admins. Packet Tracer only offered the capability to simulate a handful of generic network devices with a limited list of supported features. Even then, it was easy to see the value network simulation offered to operators.

The rise of data center infrastructure simulation

The capabilities of network simulators have grown immensely over the years, aided by the move to the cloud. Many infrastructure appliances were re-envisioned as cloud-native offerings and run as VMs and containers in the public cloud. They were also ideally suited for data center infrastructure simulation. 

Armed with a plethora of new simulated device images, the value that can be extracted from simulations has increased. What started as network simulation has grown into a new category of holistic data center infrastructure simulation. This increasingly complex environment is also increasingly driven by automation. The adoption of automation is another key driver for the use of network simulation.

Business leaders are realizing that critical business applications sit directly on top of these brittle sets of interacting software systems, and the value of simulation is becoming prominent to business productivity.

The value of data center simulation

The benefits of data center simulation are evident in the data center lifecycle, from planning to building and maintaining (Figure 1).

Diagram of the deployment lifecycle including Day 0, Day 1, and Day 2 and the various use cases for a digital twin at each deployment phase.
Figure 1. The data center deployment lifecycle and various use cases for a digital twin at each phase 

Planning

On Day 0 of the data center deployment lifecycle, you can get ahead of supply chain challenges and model your environment while your hardware is on order. In the time which would normally be spent waiting for equipment to arrive, you can accomplish many preliminary tasks, including:

  • Define cabling architecture
  • Create initial configurations
  • Automate and deploy the entire virtual data center

At this stage, simulation can help you build confidence that your solution is going to work as intended and model the interaction surfaces between your different software systems. For example, you can model your DCIM, your automation platform and the applications themselves, as well as other tools. 

You can also get ahead of multi-vendor issues by verifying interoperability in your virtual Proof of Concept (vPoC). You can train staff on your new solution and build familiarity with your specific deployment even before the first devices have arrived on the loading dock. To learn more, see Close Knowledge Gaps and Elevate Training with Digital Twin NVIDIA Air.

Building 

On Day 1 of deployment, you directly benefit from the lessons learned from using simulation during planning. The resulting configuration, automation and topology information generated in the digital twin can be leveraged to accelerate the deployment of your physical data center. For larger deployments, you can use technology such as Prescriptive Topology Manager to validate the cable plans in physical deployments against the topology built ahead of time in your digital twin to spot cabling issues. 

In environments with many thousands of cables, just answering the question “is it plugged in” can be a monumental task. With the digital twin, cable validation for the whole data center can be done in seconds. Aside from layer one, the digital twin can be used as a reference for the physical deployment to verify the initial state of the control plane in layers one through four.

Maintaining

With the data center deployed, the operational phase of the lifecycle begins. During this phase, you can use the digital twin to model the changes in the environment prior to deployment. You can attach the digital twin to your CI/CD pipeline to automatically validate any configuration or topology change prior to deployment and any resulting impact on connectivity of your applications. 

Additional benefits of the digital twin during this phase are focused around operators, specifically the ability to troubleshoot the virtual environment in ways which would never be allowed in production. This kind of deep troubleshooting and chaos engineering can get ahead of numerous problems which may be hiding in your architecture. It can also rapidly accelerate the onboarding of new personnel giving them a risk-free learning environment that identically matches production.

Data center digital twins

With current simulation technology, it is now possible to simulate thousands of routers, switches, and data center infrastructure devices fully loaded with configurations. 

The most important aspects of the future of data center digital twins include:

  • Connecting the digital twin with the physical twin and synchronizing the two
  • Increasing the accuracy of simulation such that all relevant behaviors of the data center and network can be simulated completely 

Increased accuracy and synchronization are what separate network simulation from a true digital twin. And every IT administrator should be striving to achieve a true digital twin.

Check out NVIDIA Air to start building your own data center digital twin. Enhance your data center operations with the open network operating system NVIDIA Cumulus Linux.

Categories
Offsites

Have you seen more math videos in your feed recently? (SoME2 results)

Categories
Misc

Detecting Objects in Point Clouds Using ROS 2 and TAO-PointPillars

Accurate, fast object detection is an important task in robotic navigation and collision avoidance. Autonomous agents need a clear map of their surroundings to…

Accurate, fast object detection is an important task in robotic navigation and collision avoidance. Autonomous agents need a clear map of their surroundings to navigate to their destination while avoiding collisions. For example, in warehouses that use autonomous mobile robots (AMRs) to transport objects, avoiding hazardous machines that could potentially damage robots has become a challenging problem.

This post presents a ROS 2 node for detecting objects in point clouds using a pretrained model from NVIDIA TAO Toolkit based on PointPillars. The node takes point clouds as input from real or simulated lidar scans, performs TensorRT-optimized inference to detect objects in this input data, and outputs the resulting 3D bounding boxes as a Detection3DArray message for each point cloud. 

While multiple ROS nodes exist for object detection from images, the advantages of performing object detection from lidar input include the following:

  • Lidar can calculate accurate distances to many detected objects simultaneously. With object distance and direction information provided directly from lidar, it’s possible to get an accurate 3D map of the environment. To obtain the same information in camera/image-based systems, a separate distance estimation process is required which demands more compute power.
  • Lidar is not sensitive to changing lighting conditions (including shadows and bright light), unlike cameras.

An autonomous system can be made more robust by using a combination of lidar and cameras. This is because cameras can perform tasks that lidar cannot, such as detecting text on a sign. 

TAO-PointPillars is based on work presented in the paper, PointPillars: Fast Encoders for Object Detection from Point Clouds, which describes an encoder to learn features from point clouds organized in vertical columns (or pillars). TAO-PointPillars uses both the encoded features as well as the downstream detection network described in the paper.

For our work, a PointPillar model was trained on a point cloud dataset collected by a solid state lidar from Zvision. The PointPillar model detects objects of three classes: Vehicle, Pedestrian, and Cyclist. You can train your own detection model following the TAO Toolkit 3D Object Detection steps, and use it with this node.

For details on running the node, visit NVIDIA-AI-IOT/ros2_tao_pointpillars on GitHub. You can also check out NVIDIA Isaac ROS for more hardware-accelerated ROS 2 packages provided by NVIDIA for various perception tasks. 

Block diagram of the ROS 2 TAO-PointPillars node with names of ROS 2 topics subscribed to and published by the node.
Figure 1. ROS 2 TAO-PointPillars node

ROS 2 TAO-PointPillars node 

This section provides more details about using the ROS 2 TAO-PointPillars node with your robotic application, including the input/output formats and how to visualize results. 

Node Input: The node takes point clouds as input in the PointCloud2 message format. Among other information, point clouds must contain four features for each point (x, y, z, r) where (x, y, z, r) represent the X coordinate, Y coordinate, Z coordinate and reflectance (intensity), respectively.

Reflectance represents the fraction of a laser beam reflected back at some point in 3D space. Note that the range for reflectance values should be the same in the training data and inference data. Parameters including intensity range, class names, NMS IOU threshold can be set from the launch file of the node.

You can find ROS 2 bags for testing the node by visiting ZVISION-lidar/zvision_ugv_data on GitHub.

An example of the Zvision camera point of view (left), corresponding point cloud from the lidar (center), and the result after inference using TAO-PointPillars (right). The people and truck are detected correctly.
Figure 2. An example of the Zvision camera point of view (left), corresponding point cloud from the lidar (center), and the result after inference using TAO-PointPillars (right). The people and truck are detected correctly.

Node Output: The node outputs 3D bounding box information, object class ID, and score for each object detected in a point cloud in the Detection3DArray message format. Each 3D bounding box is represented by (x, y, z, dx, dy, dz, yaw) where (x, y, z, dx, dy, dz, yaw) are, respectively, the X coordinate of object center, Y coordinate of object center, Z coordinate of object center, length (in X direction), width (in Y direction), height (in Z direction) and orientation in 3D Euclidean space. 

The coordinate system used by the model during training and that used by the input data during inference must be the same for meaningful results. Figure 3 shows the coordinate system used by the TAO-PointPillars model.

The coordinate system used by TAO-PointPillars. Origin is the center of lidar. X axis is to the front, Y axis is to the left and Z axis is upwards. Yaw is the rotation in the X-Y plane, in counter-clockwise direction. So X axis corresponds to yaw = 0 and Y axis corresponds to yaw = pi / 2.
Figure 3. The coordinate system used by the TAO-PointPillars model

Since Detection3DArray messages cannot currently be visualized on RViz, you can find a simple tool to visualize results by visiting NVIDIA-AI-IOT/viz_3Dbbox_ros2_pointpillars on GitHub.

For the example shown in Figure 4 below, the frequency of input point clouds is ~10 FPS and of output Detection3DArray messages is ~10 FPS on Jetson AGX Orin.

A GIF compilation of three images: The Zvision camera point of view; the point cloud from the Zvision lidar; and the detection results using TAO-PointPillars.
Figure 4. Counterclockwise from top left: An image from the Zvision camera point of view; the point cloud from the Zvision lidar; and the detection results using TAO-PointPillars

Summary

Accurate object detection in real time is necessary for an autonomous agent to navigate its environment safely. This post showcases a ROS 2 node that can detect objects in point clouds using a pretrained TAO-PointPillars model. (Note that the TensorRT engine for the model currently only supports a batch size of one.) This model performs inference directly on lidar input, which maintains advantages over using image-based methods. For performing inference on lidar data, a model trained on data from the same lidar must be used. There will be a significant drop in accuracy otherwise, unless a method like statistical normalization is implemented.

Categories
Misc

Google Colab’s ‘Pay As You Go’ Offers More Access to Powerful NVIDIA Compute for Machine Learning

Colabs’s new Pay as You Go option helps you accomplish more with machine learning.Access additional time on NVIDIA GPUs with the ability to upgrade to NVIDIA…

Colabs’s new Pay as You Go option helps you accomplish more with machine learning.
Access additional time on NVIDIA GPUs with the ability to upgrade to NVIDIA A100 Tensor Core GPUs when you need more power for your ML project.