Categories
Misc

Detecting Financial Fraud Using GANs at Swedbank with Hopsworks and NVIDIA GPUs

Recently, one of Sweden’s largest banks trained generative adversarial neural networks (GANs) using NVIDIA GPUs as part of its fraud and money-laundering prevention strategy. Financial fraud and money laundering pose immense challenges to financial institutions and society. Financial institutions invest huge amounts of resources in both identifying and preventing suspicious and illicit activities. There are … Continued

Recently, one of Sweden’s largest banks trained generative adversarial neural networks (GANs) using NVIDIA GPUs as part of its fraud and money-laundering prevention strategy. Financial fraud and money laundering pose immense challenges to financial institutions and society. Financial institutions invest huge amounts of resources in both identifying and preventing suspicious and illicit activities. There are large institutions reportedly saving $150 million in a single year through the use of AI fraud detection.

Existing approaches to identifying financial fraud and money laundering rely on databases of human-engineered rules that match suspicious patterns in financial transactions. As new schemes are identified, new rules are added to the rules base.

Swedbank has developed new solutions to these problems using combinations of deep learning techniques on GPUs, producing new state-of-the-art solutions for identifying suspicious activities. The approach is to model problems in a semi-supervised fashion using anomaly detection via GANs. The solution requires software and hardware that can scale to process and train models on large volumes of data. Hopsworks has trained models on a dataset as large as 40 terabytes (TBs) in size. To this end, the Hopsworks software platform was used with NVIDIA V100 GPUs to engineer features at scale and efficiently train GANs using many GPUs in parallel.

Rules-based vs. model-based fraud detection

Existing approaches to identifying fraud and money laundering rely on databases of human-engineered rules that attempt to match patterns that are indicative of fraud. As new fraud schemes are identified, new rules are added to the rule engines. For example, in money laundering, there are well-known patterns of smurfing money at lots of accounts and aggregating that money using small, under-the-radar transactions at hubs for later spending.

In the Rules-Based Fraud Detection code example, you can see the rule-based approach to identifying suspicious financial transactions. Here, you define a large set of rules that are applied to all financial transactions. If a transaction matches any of the rules, an alert is triggered. If the alert was incorrectly triggered (false positive), it induces costs. If no alert was triggered, but one should have been (false negative), you must design a new rule to identify the fraud scheme. Companies maintain these rule databases and routinely ship updates to customers.

Rules-Based Fraud Detection

# Rule 1
IF transfersLastDay > 10 && amount > $5k
THEN
 alert
END

# Rule 2
IF country is LISTED && amount > $1k
THEN
 alert
END
… 
# Rule N
… 
Train Fraud Detection Model

dataset=tf.data(“financial_transactions”)
model = …
model.compile(…)
model.fit(dataset, …)

Detect Fraud with Model 

IF model.predict(amount,transfersLastDay,
           country, ….) == TRUE
THEN
 alert
END

Given enough historical financial transaction data, model-based approaches are better at pattern matching than rule-based approaches, as they can generalize to learn fraud schemes that are like existing fraud schemes. In the Train Fraud Detection Model code example, you can see that you must first curate a labeled training dataset: financial_transactions. With that dataset, you can train the model and then the trained model can then be used on new financial transactions to predict if they are fraud or not-fraud. An alert is sent if a financial transaction is suspected of fraud.

GANs are a natural choice for financial fraud prediction as they can learn the patterns of lawful transactions from historical data. For every new financial transaction, the model computes an anomaly score; financial transactions with high scores are labeled as suspicious transactions.

GANs are difficult to both train and deploy in production, needing lots of GPUs and parallel hyperparameter search as well as distributed training support. Great care must be exercised and advanced machine learning experience is certainly desired. One of the GAN implementations is based on the Unsupervised Learning of Anomaly Detection from Contaminated Image Data using Simultaneous Encoder Training paper, which describes a architecture for anomaly detection that tolerates small amounts of incorrectly labeled samples and supports parallel encoder training.

Understanding fraud using a graph representation of entities and transactions

To detect fraudulent patterns and trigger alerts, you can use graph and tabular features and the DL-based GAN techniques described earlier. Graphs consist of nodes, also known as vertices, and edges, also known as arcs. In financial applications, graphs can model transactional interactions of businesses and individuals.

To show the utility of graphs, here’s an example. Mark the businesses and individuals with different titles: businesses are marked as “Corp” and individuals are marked as “Indiv”. The edges are used to represent transactions with associated dates and amounts and the arrows represent the direction of transactions.

There are various expected graph patterns, such as a normal scatter pattern, also known as a dandelion, that happens when an organization pays salaries. Such a pattern occurs on certain dates, salaries are relatively fixed, and the money flow is outbound from a single payer. An anomalous scatter pattern has a sudden burst of transactions that has never been seen previously for involved nodes or bidirectional money flows.

Figure 1 shows a gather-scatter pattern, where money flows initially inbound to the central node in the month of January. These flows are subsequently outbound to other nodes in the month of February. In the world of money-laundering, this gather-scatter pattern is used to hide the distribution of funds from financial institutions. Similarly, Figure 2 shows a scatter-gather pattern that again has a bidirectional flow of money on different dates. In this case, the source and destination of the money are two different central entities.

Central node is a corporation and there are four close nodes (three corporations, one individual) sending money to the central node in January up to February 2. There are four further out nodes (two corporations, two individuals) receiving money later in February from the central node.
Figure 1. Gather-scatter pattern through a central entity.
Originating node is an individual and there are receiving nodes before August 17. Upon August 17 and after those nodes send money to the second central individual node.
Figure 2. Scatter-gather pattern through two central entities. Outflow occurs before August 17 and inflow thereafter.

Based on tabular features as well as graph features, DL-based GAN methods can detect such fraud patterns, with an example based on using Hopsworks on NVIDIA GPUs. Such methods coexist with rule-based techniques to lead to better results, accuracy, and a confusion matrix.

Challenges in modelling fraud as a binary classification problem

Figure 3 shows the confusion matrix of a financial fraud binary classifier. For problems such as money laundering, false negatives should be weighed significantly higher. Use a variant of the F1 score to evaluate models: precision, recall, and fallout should not be weighted equally.

A 2x2 table with true positive of fraud being correct in the upper left and true negative of not-fraud in the lower right. In the upper right is the case of a false positive where fraud is predicted when there is not-fraud. In the lower left is the case of a false negative where no fraud is predicted when fraud is present.
Figure 3. Confusion matrix of a financial fraud binary classifier.

There are other challenges in detecting money laundering patterns:

  • Massive class imbalance—Transactions labeled as suspicious may be less than 0.0001% of total historical transactions.
  • Non-stationarity—New money-laundering schemes are constantly being invented. To identify new patterns as they appear, techniques must be able to adapt themselves or be easily adapted.

Logical Clocks, developers of the open-source Hopsworks platform, have published as open source a full end-to-end example for detecting fraud:

  • A sample raw dataset of financial transactions
  • Feature engineering programs to compute complex features such as graph embedding and store them in a feature store
  • Notebooks to find good hyperparameters for the GANs
  • Distributed training of a GAN using many GPUs.

The code can be reproduced on any Hopsworks cluster, including managed Hopsworks clusters available on AWS, Microsoft Azure, and on-premises installations of Hopsworks with NVIDIA GPUs. Hopsworks clusters can manage up to thousands of GPUs and allocate them to applications on-demand.

NVIDIA GPUs for accelerating financial data science

Identifying fraud and money laundering among large numbers of customer records is a classic financial machine learning (ML) and deep learning (DL) use case. Because it requires many trillion floating point operations (TOPS), applying GPUs accelerates the neural network training process significantly. Many data scientists know that NVIDIA GPUs have been helping in the ML training process for several years.

When the neural network training is complete and the inference phase becomes more important, the recently introduced open-source software NVIDIA Triton Inference Server can help simplify and manage inference acceleration and production model deployment. Triton Server can run as a Docker container, on bare metal, or inside a virtual machine in a virtualized environment. Hopsworks supports serving models on Triton Server using KFServing.

Hopsworks supports ML/DL training using TensorFlow, PyTorch, and Scikit-Learn with additional support for transparent data-parallel training, hyperparameter tuning, and parallel ablation studies on TensorFlow and PyTorch using the Maggy framework. Hopsworks works with multi-GPU, single-node systems as well as clusters of multi-GPU systems. DGX A100 systems are now the universal systems for AI infrastructure for distributed training on the GPU. Each DGX A100 system provides the following configuration:

  • 8x NVIDIA 100 Tensor Core GPUs
  • 80 GB of GPU memory for a total of 640 GB
  • SXM (NVLink) form factor
  • Connected with NVIDIA NV Switches
  • 5 AI petaFLOPS or 10 petaOPS INT8, respectively

Multi-GPU, multi-node DGX A100 systems constitute superpods on the Hopsworks platform, which can accelerate DL training and inferencing workloads considerably. You can achieve similar configurations on NVIDIA GPUs by working with the NVIDIA Partner Network (NPN) of OEM partners, system integrators, and value-added resellers.

Figure 4 shows the architecture of DL systems using Hopsworks that can leverage data-parallel distributed GPU training using TensorFlow CollectiveAllReduceStrategy. The Maggy framework, supported in Hopsworks, facilitates the ease of development with TensorFlow CollectiveAllReduceStrategy on multi-GPU, multi-node systems, through the transparent management of distribution using Spark. Large clusters additionally benefit from GPU interlinks using NVSwitch. In the future, such an architecture will also be possible using the NVIDIA Rapids.ai framework and Spark on GPUs.

Optimizing distributed  training using Hopsworks on NVIDIA-certified multi-GPU, multi-node systems

Image shows four CPUs with pairs communicating and four GPUs per CPU node typically using NVIDIA NVLINK for high-speed communicating.
Figure 4. Multi-GPU, multi-node architecture to train GANs for the anomaly detection task. Note the places where NVLINK is used for high bandwidth.

For inferencing of trained models using other architectures, NVIDIA supports multiple inferencing workloads, concurrent applications, and instances of DL models with the NVIDIA Triton Inferencing Server framework, increasing GPU utilization. Hopsworks clients have used GANs, vision, and other DL models requiring extensive distributed training on the GPU to develop cutting-edge AI systems.

In the following end-to-end money-laundering example from LogicalClocks, a GAN model for anomaly detection was trained on DGX systems using a setup on a multi-GPU, multi-node framework. Training times using such setups can provide almost linear scaling, also known as strong scaling, to accelerate your DL training. Also, inferencing of such models using the Triton Server framework can use the GPUs efficiently.

The number of GPUs from 1 to 4 is on the bottom axis. The samples per second rate of GAN training is on the left side or y-axis. The plotted line goes from 9,000 at 1 GPU to 31,000 at 4 GPUs.
Figure 5. Close to linear speedup when applying NVIDIA V100 GPUs to the problem of GAN training.

You can also use other frameworks including RAPIDS.ai, Spark on GPU, and the NVIDIA GRAPH framework CuGraph for accelerating such features on the GPU on the Hopsworks platform.

Get in touch with us

Teamwork is the key to engineering accurate financial fraud and money laundering solutions. Going from rule-based to model-based approaches is a common technology objective. The goal is to reduce the number of falsely classified outcomes that financial institutions may receive when using fraud detection or money laundering models in production. Nowadays, customers expect more accuracy from their financial firms in terms of preventing fraud and limiting false alarms.

For more information and to share your experiences with this important use case and state-of-the-art approach, please leave a comment below.

Categories
Misc

Best Practices for Using AI to Develop the Most Accurate Retail Forecasting Solution

Jupyter Notebook Best Practices for Using RAPIDS  A leading global retailer has invested heavily in becoming one of the most competitive technology companies around.  Accurate and timely demand forecasting for millions of item-by-store combinations is critical to serving their millions of weekly customers. Key to their success in forecasting is RAPIDS, an open-source suite of GPU-accelerated libraries. RAPIDS helps them tear through their massive-scale data and … Continued

Jupyter Notebook Best Practices for Using RAPIDS 

A leading global retailer has invested heavily in becoming one of the most competitive technology companies around. 

Accurate and timely demand forecasting for millions of item-by-store combinations is critical to serving their millions of weekly customers. Key to their success in forecasting is RAPIDS, an open-source suite of GPU-accelerated libraries. RAPIDS helps them tear through their massive-scale data and has improved forecasting accuracy by several percentage points – it now runs orders of magnitude faster on a reduced infrastructure GPU footprint.  This enables them to respond in real-time to shopper trends and have more of the right products on the shelves, fewer out-of-stock situations, and increased sales.

With RAPIDS, data practitioners can accelerate pipelines on NVIDIA GPUs, reducing data operations including data loading, processing, and training from days to minutes.  RAPIDS abstracts the complexities of accelerated data science by building on and integrating with popular analytics ecosystems like PyData and Apache Spark, enabling users to see benefits immediately. Compared to similar CPU-based implementations, RAPIDS delivers 50x performance improvements for classical data analytics and machine learning (ML) processes at scale which drastically reduces the total cost of ownership (TCO) for large data science operations. 

End-to-end accelerated data science. Data preparation to model training to visualization.
Figure 1. Data science pipeline with GPUs and RAPIDS

To learn and solve complex data science and AI challenges, leaders in retail often leverage what are called ‘Kaggle competitions’. Kaggle is a platform that brings together data scientists and other developers to solve challenging and interesting problems posted by companies. In fact, there have been over 20 competitions for solving retail challenges within the past year. 

Leveraging RAPIDS and best practices for a forecasting competition, NVIDIA Kaggle Grandmaster Kazuki Onodera won 2nd place in the Instacart Market Basket Analysis Kaggle competition using complex feature engineering, gradient boosted tree models, and special modeling of the competition’s F1 evaluation metric.  Along the way, we documented the best practices for ETL, feature engineering, building and customizing the best models for building an AI based Retail forecasting solution.  

This blog post will walk readers through the components of a Kaggle competition to explain data science best practices for improving forecasting in retail.  Specifically, the blog post explains the Instacart Market Basket Analysis Kaggle competition goals, introduces RAPIDS, then offers a workflow to show how to explore the data visually, develop features, train the model, and run a forecasting prediction. Then, the post will dive into some advanced techniques for feature engineering with model explainability and hyperparameter optimization (HPO). 

For an even more detailed look into the methodology, see Kazuki Onodera’s fantastic interview with Medium.com.

Join Paul Hendricks at NVIDIA GTC 2021, where he hosts a session on Best Practices for ETL, Feature Engineering, and Model Development for Retail Forecasting Using NVIDIA RAPIDS Data Science Libraries.

Access this Jupyter notebook, where we share these best practices for GPU-accelerated forecasting within the context of the Instacart Market Basket Analysis Kaggle competition.

The Forecasting Challenge 

Instacart Market Basket Analysis competition challenged Kagglers to predict which grocery products a consumer will purchase again and when. Imagine, for example, having milk ready to be added to your cart right when you run out, or knowing that it’s time to stock up again on your favorite ice cream. 

This focus on understanding temporal behavior patterns makes the problem fairly different from standard item recommendation, where user needs and preferences are often assumed to be relatively constant across short windows of time. Whereas Netflix might be fine assuming you want to watch another movie like the one you just watched, it’s less clear that you’ll want to reorder a fresh batch of almond butter or toilet paper if you bought them yesterday. 

Problem Overview 

The goal of this competition was to predict grocery reorders: given a user’s purchase history (a set of orders, and the products purchased within each order), which of their previously purchased products will they repurchase in their next order? 

The problem is a little different from the general recommendation problem, where we often face a cold start issue of making predictions for new users and new items that we’ve never seen before. For example, a movie site may need to recommend new movies and make recommendations for new users. 

The sequential and time-based nature of the problem also makes it interesting: how do we take the time since a user last purchased an item into account? Do users have specific purchase patterns and do they buy different kinds of items at different times of the day? 

To get started, we’ll first load some of the modules we’ll be using in this notebook and set the random seed for any random number generator we’ll be using. 

Modules to predict grocery reorders.

RAPIDS Overview 

Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data – as the name suggests – comes in a structured form, often represented by a table or CSV. We’ll focus the majority of the tutorials on working with these types of data.

There are many tools in the Python ecosystem for structured, tabular data but few are as widely used as pandas. pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing, and many more.

For more information on pandas, check out the excellent documentation here: http://pandas.pydata.org/pandas-docs/stable/

pandas is fantastic for working with small datasets that fit into your system’s memory. However, datasets are growing larger and data scientists are working with increasingly complex workloads – the need for accelerated compute arises.

cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide. 

Getting familiar with the data

The dataset for this competition contains several files capturing orders from Instacart users over time, with the goal of the competition to predict if a user will re-order a product and specifically, which products will those customers will re-order. From the Kaggle data description (https://www.kaggle.com/c/instacart-market-basket-analysis/data), we see that we have over three million grocery orders with a customer base of over 200,000 Instacart users. And that for each user, we are provided between 4 and 100 of their orders, with the sequence of products purchased in each order as well as the time of their orders and a relative measure of time between orders. Also provided are the week and hour of the day the order was placed, and a relative measure of time between orders.

Our products, aisles, and departments datasets are composed of metadata about our products, aisles, and departments respectively. Each dataset (products, aisles, departments, and orders, etc.) has a unique identifier mapping for each entity in that dataset e.g. order_id represents a unique order within the orders dataset, product_id represents a unique product within the products dataset, etc. We’ll use these unique identifiers later to combine all of these separate datasets into one coherent view for exploratory data analysis, feature engineering, and modeling.

Below, we will read in our data and inspect our different tables using cuDF. 

Products, aisles and departments datasets using cuDF.
Datasets for days since prior order and order hour of the day.

Additionally, we’ll read in our orders datasets. The first indicates to which set (prior, train, test) an order belongs. Additional files specify which products were purchased in each order. Again, from the Kaggle description of the data, we see that the order_products__prior.csv contains previous order contents for all customers. And that the column ‘reordered’ indicates that the customer has a previous order that contains the product. We are informed that some orders will have no reordered items.

Datasets for orders including days since prior order and order hour of the day.
Datasets for orders including days since prior order and order hour of the day.

Exploring the data

When we think about our data science workflow, one of the most important steps is Exploratory Data Analysis. This is where we examine our data and look for clues and insights into which features we can use (or need to create) to feed our model. There are many ways to explore the data and each Exploratory Data Analysis is different for each problem – however, it still remains incredibly important as it informs our feature engineering process, ultimately determining how accurate our model will be. 

In the notebook, we look at a couple different cross sections of the day. Specifically, we examine the distribution of the order counts, the days of week and times customers typically place orders, the distribution of number of days since the last order, and the most popular items across all orders and unique customers (de-duplicating so as to ignore customers who have a “favorite” item that they place repeated orders for). 

Distribution of the order counts, the days of week and times customers typically place orders.
Figure 2. Exploring the data

From this we see respectively that: 

  • There are no orders less than 4 and max is capped at 100. 
  • The orders are high Saturday and Sunday (days 0 and 1) and low during Wednesday. 
  • The majority of orders are made during the daytime. And customers primarily order once a week or month (see the peaks at days 7 and 30). 

Similar exploratory analysis for product popularity is provided in the notebook.

Feature Engineering

If Exploratory Data Analysis is the most important part of our data science workflow, Feature Engineering is a close second. This is where we identify which features should be fed into the model and create features where we believe they might be able to help the model do a better job of predicting. 

Feature Engineering is where we identify which features should be fed into the model and create features where we believe they might be able to help the model do a better job of predicting.
Figure 3: Machine Learning is an iterative process.

We start by just identifying our unique User X Item combinations and sorting them. We’ll create a dataset where each user maps to their most recent order number, day of week and hour, and how many days it’s been since that order. And we’ll extend our dataset, creating labels and features to be used later in our machine learning model, such as: 

  • How many kinds of products have the user ordered? 
  • How many products have the user ordered within one cart? 
  • From which departments have the user-ordered products? 
  • When has the user ordered products (day of the week)?  
  • Has this user ordered this product at least once before? 
  • How many orders have a user placed that have included this item? 

Solving for the business problem (train and predict)

The mathematical operations underlying many machine learning algorithms are often matrix multiplications. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. RAPIDS makes it easy to build machine learning models in an accelerated fashion while still using a nearly identical interface to Scikit-Learn and XGBoost. 

There are many ways to create a model – one can use Linear Regression models, SVMs, tree-based models like Random Forest and XGBoost, or even Neural Networks. In general, tree-based models tend to work better with tabular data for forecasting than Neural Networks. Neural Networks work by mapping the input (feature space) to another complex boundary space and determining what values should belong to those points within that boundary space (regression, classification). Tree-based models on the other hand work by taking the data, identifying a column, and then finding a split point in that column to map a value to, all the while optimizing the accuracy. We can create multiple trees using different columns, and even different columns within each tree.

For a more detailed description of tree-based models XGBoost, see this fantastic documentation: https://xgboost.readthedocs.io/en/latest/tutorials/model.html 

In addition to their better accuracy performance, tree-based models are very easy to interpret (important for when predictions or decisions resulting from the predictions must be explained and justified, maybe for compliance and legal reasons e.g. finance, insurance, healthcare). Tree-based models are very robust and work well even when there is a small set of data points.  

In the section below, we’ll set the different parameters for our XGBoost model and train five different models – each on a different subset of users to avoid overfitting to a particular set of users. 

import xgboost as xgb

NFOLD = 5
PARAMS = {
    'max_depth':8, 
    'eta':0.1,
    'colsample_bytree':0.4,
    'subsample':0.75,
    'silent':1,
    'nthread':40,
    'eval_metric':'logloss',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist'
         }

models = []
for i in range(NFOLD):
    train_ = train[train.user_id % NFOLD != i]
    valid_ = train[train.user_id % NFOLD == i]
    dtrain = xgb.DMatrix(train_.drop(['user_id', 'product_id', 'label'], axis=1), train_['label'])
    dvalid = xgb.DMatrix(valid_.drop(['user_id', 'product_id', 'label'], axis=1), valid_['label'])    
    model = xgb.train(PARAMS, dtrain, 9999, [(dtrain, 'train'),(dvalid, 'valid')],
                      early_stopping_rounds=50, verbose_eval=5)
    models.append(model) 
    break

There are several parameters that should be set before XGBoost can be run. 

  • General parameters relate to which booster we are using to do boosting, commonly tree or linear model.
  • Booster parameters depend on which booster you have chosen. 
  • Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks. 

For more information on the configurable parameters within the XGBoost module, see the documentation here: https://xgboost.readthedocs.io/en/latest/parameter.html 

Feature Importance

Once we’ve trained our models, we might want to look at the internal workings and understand which of the features we’ve crafted are contributing the most to the predictions. This is called Feature Importance. One of the advantages for tree-based models for forecasting is that understanding the differing importance of our features is very easy. 

With understanding how our features contribute to the model accuracy, we can choose to remove features that aren’t important or try to iterate and create new features, re-train, and re-assess if those new features are more important. Ultimately, being able to iterate quickly and try new things in this workflow will lead to the most accurate model and the greatest ROI (for forecasting, oftentimes cost-savings from reduced out-of-stock and poorly placed inventory). Iteration traditionally can take a significant amount of time due to computational intensity. RAPIDS allows users to churn through model iteration with NVIDIA accelerated computing so users can iterate quickly and determine the best performing model. 

In the Feature Importance section of the notebook, we define convenience code to access the importance of the features in each model. We then pass in our list of models that we trained, iterate over them one by one, and average the importance of each variable across all the models. Lastly, we visualize feature importance using a horizontal bar chart. 

We see specifically that three of our features are contributing the most to our predictions: 

  • user_product_size – How many orders has a user placed that has included this item?
  • user_product_t-1 – Has this user ordered this product at least once before?
  • order_number – The number of orders that user has created.
Figure 4: Determining top features.

All of this makes sense and aligns with our understanding of the problem. Customers who have placed an order for an item before are more likely to repeat an order for that product, and users who place multiple orders of that product are even more likely to re-order. Additionally, the number of orders a customer has created correlates with their likelihood of re-ordering. 

The code uses the default XGBoost implementation of feature importance – but we are free to choose any implementation or technique. A wonderful technique (also developed by an NVIDIA Kaggle Grandmaster, Ahmet Erdem) is called LOFO

From the description of the LOFO GitHub page, we see that LOFO (Leave One Feature Out) Importance calculates the importance of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric. And that LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set.

This methodology allows us to effectively determine which features are important for the model. LOFO has several advantages compared to other importance types:

  • It does not favor granular features.
  • It generalizes well to unseen test sets. 
  • It is model agnostic. 
  • It gives negative importance to features that hurt performance upon inclusion. 

For more information on LOFO, see here: https://github.com/aerdem4/lofo-importance 

Hyperparamater Optimization (HPO) 

When we trained our XGBoost models, we used the following parameters: 

PARAMS = {    'max_depth':8,     'eta':0.1,    'colsample_bytree':0.4,    'subsample':0.75,    'silent':1,    'nthread':40,    'eval_metric':'logloss',    'objective':'binary:logistic',    'tree_method':'gpu_hist'         } 

Of these, only a few may be changed and effect the accuracy of our model: max_depth, eta, colsample_bytree, and subsample. However, these may not be the most optimal parameters. The art and science of identifying and training models with the model optimal hyperparamers is called hyperparameter optimization. 

While there is no magic button one can press to automatically identify the most optimal hyperparameters, there are techniques that allow you to explore the range of all possible hyperparameter values, quickly test them, and find values that are closest. 

A full exploration of these techniques is beyond the scope of this notebook. However, RAPIDS is integrated into many Cloud ML Frameworks for doing HPO as well as with many of the different open source tools. And being able to use the incredible speedups from RAPIDS allows you to go through your ETL, feature engineering, and model training workflow very quickly for each possible experiment – ultimately resulting in fast HPO explorations through large hyperparameter spaces and a significant reduction in Total Cost of Ownership (TCO). 

Conclusion 

In this blog, we walked through the components of a Kaggle competition to explained data science best practices for improving forecasting in retail.  Specifically, the blog post explained the Instacart Market Basket Analysis Kaggle competition goals, introduced RAPIDS, then offered a workflow to show how to explore the data visually, develop features, train the model, and run a forecasting prediction. Then, the post reviewed techniques for feature engineering with model explainability and hyperparameter optimization (HPO).  

To learn more, be sure to: 

  • See this Jupyter notebook on forecasting
    where we show best practices for GPU accelerated forecasting within the context of the Instacart Market Basket Analysis Kaggle competition in which NVIDIA Kaggle Grandmaster Kazuki Onodera won 2nd place, using complex feature engineering, gradient boosted tree models and special modeling of the competition’s F1 evaluation metric.   
  • Read Kazuki Onodera’s detailed interview at Medium.com
Categories
Misc

Beginner’s Guide to GPU- Accelerated Event Stream Processing in Python

This tutorial is the seventh installment of introductions to the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process geospatial, signal, and system log data, or use SQL language via … Continued

This tutorial is the seventh installment of introductions to the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process geospatial, signal, and system log data, or use SQL language via BlazingSQL to process data.

In the age of the Internet, abundant IoT devices, social media, web servers, and more, data flows at incredible speeds. In 2019, Forbes reported that every minute, Americans use approximately 4.4PB of internet data: which converts to roughly 1MB of data per Internet user per minute.

Not only is the volume of data increasing over time, but so are the speeds at which data arrives. Over the years, we went from dial-up modem connections with speeds up to 56kbit in the early 1990s to contemporary 10Gbit networks starting gaining some popularity. 1Gbit networks are still the most widely used type of interconnecting devices at home and in the office unless you are on a WiFi network. 

Many of the Internet services offered these days rely on prompt and fast processing of this constant waterfall of data. cuStreamz is one of the newer additions to the RAPIDS stack. It aims to take the streaming data processing historically done on CPU and accelerate on the GPU. Thanks to GPUs’ immense parallelism, processing streaming data has now become much faster with a friendly Python interface.

In the previous posts we showcased other areas: 

Today, we talk about cuStreamz – a library that uses GPUs to process streaming data. To help get familiar with cuStreamz, we also published a cheat sheet that can be downloaded here cuStreamz cheatsheet, and an interactive notebook with all the current functionality of cuStreamz showcased here.

Streaming frameworks

First released in 2011, Apache Kafka has quickly become a standard for managing vast quantities of fast-moving data with low latency and high-level APIs. Kafka is a distributed platform that maintains a list of topics that systems can subscribe to (the so-called, consumers), and publish their data onto (the producers). Data in Kafka, like many other distributed systems, is replicated among multiple workers (or brokers): if any of the brokers disconnects from the cluster, or otherwise dies, the data is not lost and still available from other brokers. This improves the resiliency and availability of the system that is required by today’s Internet service companies.

Streamz is a Python framework that focuses on processing high-velocity data and allows for branching, joining, controlling the flow of messages, and sinking the data to disk or other streams. Here’s what a set of distinct pipelines might look like;

Figure 1: Source: https://streamz.readthedocs.io/en/latest/_images/complex.svg

The pipeline can branch into multiple branches. A popular Lambda architecture also implements two branches: one to process fast-moving, near real-time data, and another one to provide batch processing.

RAPIDS cuStreamz builds on top of the streamz framework and allows the messages to be batched into cuDF DataFrames instead of text messages. This, on its own, enables significant speed-ups of processing of messages that purport to the same schema by tapping into the power of GPUs. Also, if the data volume cannot fit on a single machine, cuStreams supports pushing the data processing using Dask-cuDF DataFrames.

Setting up locally

It is easy to get started. In this section, we will show you how to set up your own mini-Kafka cluster using Docker. To use cuStreamz, you will, of course, need an NVIDIA GPU with Pascal architecture (GTX 1000-series) or newer as required by RAPIDS.

To get started with Kafka, you need to install Docker and Docker-compose: the installation instructions for Docker can be found here https://docs.docker.com/engine/install/ while the Docker-compose installation manual is here https://docs.docker.com/compose/install/. Please note that you will need a Linux machine to run this as neither Windows nor MacOSX is officially supported: https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#is-microsoft-windows-supported.

To use cuStreamz, your machine will need NVIDIA drivers and CUDA environment present (instructions to follow can be found here https://developer.nvidia.com/cuda-downloads) and NVIDIA-docker addition so Docker can connect to your GPU: find it here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker.

Kafka cluster

Next, let’s set up our Kafka cluster. If you clone the Github repository, inside the cheatsheets/cuStreamz folder navigate to Kafka and open docker-compose.yaml file.

Docker-compose uses the YAML configuration files to set up the whole cluster. The first service we start is the zookeeper. Zookeeper is a service used to track naming and configuration data for Kafka; it maintains information about the cluster nodes’ status and their topics, partitions, replication, etc. Besides, the Zookeeper service allows multiple clients to carry out concurrent reads and writes to the service to keep up with the volume and velocity of the incoming and outgoing data calls.

services:
 zookeeper:
   image: 'confluentinc/cp-zookeeper:5.4.3'
   hostname: zookeeper
   networks:
     - kafka
   environment:
     - ZOO_MY_ID=1
     - ZOOKEEPER_CLIENT_PORT=2181
     - ZOO_SERVERS=zookeeper:2888:3888
   ports:
     - 2181:2181
   volumes:
     - ./data/zookeeper/data:/data
     - ./data/zookeeper/datalog:/datalog

In this example, we use the cp-zookeeper:5.4.3 image from Confluent to start our Zookeeper service; the server started will be named zookeeper. The Zookeeper service can be replicated among multiple servers, so it can become resilient; the Zookeeper servers talk to each other on port 2888, and the leader-of-the-pack runs on port 3888. Clients that want to use the Zookeeper connect to the service on port 2181, and that port gets forwarded to the host via the config ports. We also map some host folders to the container so the data that Zookeeper stores is persisted.

Next, we start two Kafka worker nodes (one shown here for brevity).

kafka0:
 image: confluentinc/cp-kafka:5.4.3
 hostname: kafka0
 networks:
   - kafka
 ports:
   - "9092:9092"
 environment:
   KAFKA_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka0:19092,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9092
   KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka0:19092,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9092
   KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
   KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
   KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
   KAFKA_BROKER_ID: 0
   KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
 volumes:
   - ./data/kafka0/data:/var/lib/kafka/data
 depends_on:
   - zookeeper

The cp-kafka image comes from the Confluent’s Docker Hub; here, we also use version 5.4.3. There are plenty of environmental variables but let’s review just the most important from our point of view:

  • KAFKA_LISTENERS identifies a list of server names and ports the server will be listening to. Note that the external and internal ports are different: to facilitate communication between multiple docker containers the server will be placed on the Docker internal network (in our case kafka_kafka) and the kafka0 server will be listening on port 19092. If you would like to connect to this service from the host you can use the localhost and port 9092. The same list is provided in the KAFKA_ADVERTISED_LISTENERS environmental variable.
  • KAFKA_INTER_BROKER_LISTENER_NAME tells the Docker which server name to use for internal communication between containers: in our case, this is LISTENER_DOCKER_INTERNAL but any recognizable name should work. Should you, however, change this name you will have to change the KAFKA_LISTENERS and the KAFKA_ADVERTISED_LISTENERS.
  • KAFKA_ZOOKEEPER_CONNECT specifies the address of the zookeeper to connect to; in our case, that is zookeeper: 2181.
  • KAFKA_BROKER_ID is a unique identifier of the kafka node and by convention should be included in the name of the service and server name.

We also identify the zookeeper as a service this container depends on.

To start all these services, simply navigate to the folder where the docker-compose.yaml file is saved and run docker-compose up in the terminal (if you want to stop the service press Ctrl-C or from another terminal window type docker-compose down). Once the services are running, you can check the list of all containers by running docker ps command.

With all the services running, let’s create a sample topic. Run the following command in the terminal.

docker exec -ti  bash

Once inside, run the following command.

kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor  --partitions  --topic test

Now, you should be able to subscribe to the topic test to either sink or consume the messages. Your Kafka service is running!!!

Let’s get streaming!

In this example, we will be using the official RAPIDS container. Go ahead and pull the latest one following the examples here https://rapids.ai/start.html. Start the container using the command listed on the RAPIDS website. You should now be able to navigate to https://localhost:8888 and access JupyterLab.

Before we move forward, we need to connect this container to the kafka_kafka network: do so with the following command from the terminal.

docker network connect kafka_kafka 

From now on, we should be able to access the kafka0:19092 server from the RAPIDS container.

Note that if you do not have custreamz available in your container, you can install it using the following command.

conda install -c rapidsai -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=11

Next, let’s subscribe to our test topic with cuStreamz!

import streamz

consumer_conf = {'bootstrap.servers': 'kafka0:19092',
                 'group.id': 'custreamz',
                 'session.timeout.ms': '60000'
                }

source = streamz.Stream.from_kafka_batched(
    'test'
    , consumer_conf
    , poll_interval='2s'
    , asynchronous=True
    , dask=False
    , engine="cudf"
    , start=False
)

We will be using the .from_kafka_batched(...) method to subscribe as this allows us to use the CUDA Kafka connector and return the messages in the form of a cudf DataFrame. The first parameter specifies the topic name and is followed by the dictionary with configuration. Next, we set up the interval the stream object will be checking the Kafka topic for new messages; 2 seconds in this example. The engine set cudf specifies that the messages should be returned as DataFrames. We can now provide the rest of the pipeline and start the listener.

from streamz.dataframe import DataFrame

def process_batch(messages):
    batch_df = cudf.DataFrame()
    
    for message in messages:
        df_split = messages[message].str.tokenize()
        df_split = (
            df_split
            .to_frame('word')
            .reset_index()
            .groupby(by='word')
            .agg({'index': 'count'})
            .rename(columns={'index': 'count'})
            .reset_index()
        )
        print("nWord Count for this batch:")
        
        batch_df = cudf.concat([batch_df, df_split])
    
    return batch_df

stream_df = source.map(process_batch)

# Create a streamz dataframe to get stateful word count
sdf = DataFrame(stream_df, example=cudf.DataFrame({'word':[], 'count':[]}))

# Formatting the print statements
def print_format(sdf):
    print("nGlobal Word Count:")
    return sdf

# Print cumulative word count from the start of the stream, after every batch. 
# One can also sink the output to a list.
sdf.groupby('word').sum().stream.gather().map(print_format)

After this run;

source.start()

Et voila! We now have a running listener to the test topic!

The code here is pretty self-explanatory, but at the high level, we expect the message to come as a DataFrame. We will count all the words occurring in the message by using the .tokenize() functionality of RAPIDS cudf and then count the number of individual words. Finally, we create a Streamz DataFrame that we use to produce the final tally of words by summing the occurrences of each word.

With the consumer running now, let’s produce some messages! Open a new notebook and install kafka-python package by running in a cell.

 !pip install kafka-python  

Next, we start a producer.

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers='kafka0:19092'
    , value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

The bootstrap_servers is the address of our kafka0 server. Every message we will emit will be JSON string UTF-8 encoded. Now we can start pushing the messages onto the topic message bus:

producer.send('test',{'text': 'RAPIDS rocks!'})

What your notebook with cuStreamz consumer running should produce is a DataFrame with index being RAPIDS and rocks! rows, and a count 1 against each of these words. You can now play more with it!

With the introduction of cuStreamz, the RAPIDS ecosystem can speed up the processing of fast-moving data. You can try the above examples and more for yourself at app.blazingsql.com and download the cuStreamz cheatsheet here.

Categories
Misc

Noob looking for help

Hello.

I’m looking to put on a project for school, a RNN to predict stock prices (not really innovative i know).

All the programs i see to learn how to make this model has a very strange feedback 50% goes like “you rock this is great” the other 50% goes like “this makes no sense”.

Is this project something hard to do with Keras+TensorFlow?

Any 1 can help me navigate the immensive amount of documents about TF to get me started?

Where i can find the maths behind this models?

submitted by /u/Impossible-Hunter224
[visit reddit] [comments]

Categories
Misc

Why I am getting less accuracy and big loss?

I chose the TensorFlow estimator for implementation due to having a distributed training API support. Well, honestly saying, I found a code, which was quite understandable. So I chose that to implement sensor-based signal recognition on multiple GPUs.

The code was basically for MNIST data set training on multiple GPUs. I got the error after executing the code, and that error was coming due to not working mnist dataset API downloading. Here is the link to that code. https://github.com/shu-yusa/tensorflow-mirrored-strategy-sample/blob/master/cnn_mnist.py

I could not found any solution on google. There might be are multiple issues behind that; implementing in Tensorflow1 can be one of them. I tried to convert that code into Tensorflow2. Mostly code is transformed; however, tf.contrib related things did not restore. So I decided to edit that code for sensor-based signal (Time series).

However, when I ran the code, the accuracy was 30%, and the lost value was bigger. On the other hand, when I implemented CNN on the same data set on Low-level tensor API, I received 95% accuracy. Now I do not know why it is giving low accuracy on the tf estimator. In my opinion, one of the reasons can be wrong input feeding to the CNN. Here is the code:

def cnn_model_fn(features, labels, mode):

“””Model function for CNN.”””

# Input Layer

# Reshape X to 4-D tensor: [batch_size, width, height, channels]

# input 1 * segment_size, and have three channel, in accelrometer we have x, y, z

input_layer = tf.reshape(features[“x”], [-1, 1, segment_size, num_input_channels])

# Convolutional Layer #1

# Computes 32 features using a 5×5 filter with ReLU activation.

# Padding is added to preserve width and height.

# Input Tensor Shape: [batch_size, 28, 28, 1]

# Output Tensor Shape: [batch_size, 28, 28, 32]

conv1 = tf.compat.v1.layers.conv2d(

inputs=input_layer,

filters=32,

kernel_size=[1, 12],

padding=”same”,

activation=tf.nn.relu)

# Pooling Layer #1

# First max pooling layer with a 2×2 filter and stride of 2

# Input Tensor Shape: [batch_size, 28, 28, 32]

# Output Tensor Shape: [batch_size, 14, 14, 32]

pool1 = tf.compat.v1.layers.max_pooling2d(inputs=conv1, pool_size=[1, 4], strides=2, padding=’same’)

# Convolutional Layer #2

# Computes 64 features using a 5×5 filter.

# Padding is added to preserve width and height.

# Input Tensor Shape: [batch_size, 14, 14, 32]

# Output Tensor Shape: [batch_size, 14, 14, 64]

conv2 = tf.compat.v1.layers.conv2d(

inputs=pool1,

filters=64,

kernel_size=[1, 12],

padding=”same”,

activation=tf.nn.relu)

# Pooling Layer #2

# Second max pooling layer with a 2×2 filter and stride of 2

# Input Tensor Shape: [batch_size, 14, 14, 64]

# Output Tensor Shape: [batch_size, 7, 7, 64]

pool2 = tf.compat.v1.layers.max_pooling2d(inputs=conv2, pool_size=[1, 4], strides=2, padding=’same’)

# Flatten tensor into a batch of vectors

# Input Tensor Shape: [batch_size, 7, 7, 64]

# Output Tensor Shape: [batch_size, 7 * 7 * 64]

pool2_flat = tf.reshape(pool2, [-1, 1 * 50 * 64])

# Dense Layer

# Densely connected layer with 1024 neurons

# Input Tensor Shape: [batch_size, 7 * 7 * 64]

# Output Tensor Shape: [batch_size, 1024]

dense = tf.compat.v1.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)

# Add dropout operation; 0.6 probability that element will be kept

dropout = tf.compat.v1.layers.dropout(

inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)

# Logits layer

# Input Tensor Shape: [batch_size, 1024]

# Output Tensor Shape: [batch_size, 10]

logits = tf.compat.v1.layers.dense(inputs=dropout,

units=6) # unit =10 in our case we have 6 classes so will 6 units at last layer

predictions = {

# Generate predictions (for PREDICT and EVAL mode)

“classes”: tf.argmax(input=logits, axis=1),

# Add softmax_tensor` to the graph. It is used for PREDICT and by the`

# logging_hook`.`

“probabilities”: tf.nn.softmax(logits, name=”softmax_tensor”)

}

if mode == tf.estimator.ModeKeys.PREDICT:

return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

# labels = tf.argmax(tf.cast(labels, dtype=tf.int32), 1)

# Calculate Loss (for both TRAIN and EVAL modes)

loss = tf.compat.v1.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

# here we define how we calculate our accuracy

# if you want to monitor your training accuracy you need these two lines

# accuracy = tf.compat.v1.metrics.accuracy(labels=labels, predictions=predictions[‘classes’], name=’acc_op’)

# tf.summary.scalar(‘accuracy’, accuracy[1])

# Configure the Training Op (for TRAIN mode)

if mode == tf.estimator.ModeKeys.TRAIN:

optimizer = tf.compat.v1.train.GradientDescentOptimizer(learning_rate=0.001)

train_op = optimizer.minimize(

loss=loss,

global_step=tf.compat.v1.train.get_global_step())

return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

# Add evaluation metrics (for EVAL mode)

eval_metric_ops = {

“accuracy”: tf.compat.v1.metrics.accuracy(labels,

predictions=predictions[“classes”])}

return tf.estimator.EstimatorSpec(

mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

After debugging:

in `def cnn_model_fn(features, labels, mode):` I am getting {‘x’: <tf.Tensor ‘IteratorGetNext:0’ shape=(?, 600) dtype=float64>} and in labels Tensor(“IteratorGetNext:1”, shape=(?,), dtype=int64) and in mode {str} train.

**Here the result on test data:**

Saving ‘checkpoint_path’ summary for global step 1000: /tmp/tmp77ffy2i9/model.ckpt-1000

{‘accuracy’: 0.3959022, ‘loss’: 1.698279, ‘global_step’: 1000}**

Can anyone help why my model giving less accuracy and huge loss value?

submitted by /u/Nafees060
[visit reddit] [comments]

Categories
Misc

Shackling Jitter and Perfecting Ping, How to Reduce Latency in Cloud Gaming

Looking to improve your cloud gaming experience? First, become a master of your network. Twitch-class champions in cloud gaming shred Wi-Fi and broadband waves. They cultivate good ping to defeat two enemies — latency and jitter. What Is Latency?  Latency or lag is a delay in getting data from the device in your hands to Read article >

The post Shackling Jitter and Perfecting Ping, How to Reduce Latency in Cloud Gaming appeared first on The Official NVIDIA Blog.

Categories
Misc

LSTM or Transformer for multi input/output time Series prediction?

Hello everybody, I’m about to write a kind of weather forecasting neural network and do not really find something good about transformers… My initial thought was a classic LSTM but in the last days I heard a lot of good stuff about Transformers, especially with multiple Input variables. Does Anybody here know more about that topic and would love to explain to me why I would use LSTM, Transformers or something else?

The data input will be a list of variables like temperature, humidity, elevation, wind speed, wind direction, …

The data output should be a 0-100 possibility of a specific event to occur.

I have some Billion labeled Data Points of from historical data.

Thx for your Help!

submitted by /u/deep-and-learning
[visit reddit] [comments]

Categories
Misc

GAN + TFLite + Golang: A simple example

submitted by /u/i8code
[visit reddit] [comments]

Categories
Misc

Unable to decode bytes as JPEG, PNG, GIF, or BMP

I get this error using this repo https://github.com/aydao/stylegan2-surgery . I checked all Images in the dataset, their not corrupted and all have the right file format

submitted by /u/shinysamurzl
[visit reddit] [comments]

Categories
Misc

Converting PyTorch and TensorFlow Models into Apple Core ML using CoreMLTools

Converting PyTorch and TensorFlow Models into Apple Core ML using CoreMLTools submitted by /u/analyticsindiam
[visit reddit] [comments]