Categories
Misc

Tensorflow 1.14, Fix : “google.protobuf.message.DecodeError”: Error parsing message

Protobuf v3.15 Error: google.protobuf.message.DecodeError, When using tf.graph(), loading TensorFlow model into memory. After changing tf.graph() snippet above into TensorFlow v2, same error was getting.

I have tried protobuf 3.12.4(same on colabs), same error appeared

https://stackoverflow.com/questions/66842689/tensorflow-1-14-fix-google-protobuf-message-decodeerror-error-parsing-mess

Traceback (most recent call last): File "object_detection/webcam.py", line 25, in <module> od_graph_def.ParseFromString(serialized_graph) google.protobuf.message.DecodeError: Error parsing message [ WARN:0] global C:projectsopencv-pythonopencvmodulesvideoiosrccap_msmf.cpp (674) SourceReaderCB::~SourceReaderCB terminating async callback 

I have reinstalled different protobuf version and still same error is getting.

I have trained a “SSD MobileNet” model using TensorFlow version 1.14 CPU for Webcam Object-detection with OpenCV. After installing required libraries of TensorFlow, I run model_builder_tf1.py and it successfully passed all 21 tests.

Snippet: to load TensorFlow model into memory using tf.graph()

detection_graph = tf.Graph() with detection_graph.as_default(): od_graph_def = tf.compat.v1.GraphDef() with tf.gfile.GFile(PATH_TO_FROZEN_GRAPH, 'rb') as fid: serialized_graph = fid.read() od_graph_def.ParseFromString(serialized_graph) tf.import_graph_def(od_graph_def, name='') sess = tf.compat.v1.Session(graph=detection_graph) 

Note that TensorFlow 1.14 is installed on conda environment.

Using protobuf==3.8, another of error appeared

AttributeError: module ‘google.protobuf.descriptor’ has no attribute ‘_internal_create_key

Can someone please give a solution to this problem.

submitted by /u/jhivesh
[visit reddit] [comments]

Categories
Misc

Doubling Network File System Performance with RDMA-Enabled Networking

This post was originally published on the Mellanox blog. Network File System (NFS) is a ubiquitous component of most modern clusters. It was initially designed as a work-group filesystem, making a central file store available to and shared among several client servers. As NFS became more popular, it was used for mission-critical applications, which required access … Continued

This post was originally published on the Mellanox blog.

Network File System (NFS) is a ubiquitous component of most modern clusters. It was initially designed as a work-group filesystem, making a central file store available to and shared among several client servers. As NFS became more popular, it was used for mission-critical applications, which required access to storage. Next, migration to higher performing networks was implemented to improve client-to-NFS communications. In addition to higher networking speeds (today 100 GbE and soon 200 GbE), the industry has been looking for technologies that offload stateless networking functions that run on the CPU to the IO subsystems. This leaves more CPU cycles free to run business applications and maximizes the data center efficiency.

One of the more popular networking offload technologies is RDMA (Remote Direct Memory Access). RDMA makes data transfers more efficient and enables fast data move­ment between servers and storage without involving its CPU. Throughput is increased, latency reduced, and CPU power is freed up for the applications. RDMA technology is already widely used for efficient data transfer in render farms and large cloud deployments, including the following:

  • Microsoft Azure
  • HPC solutions (including machine learning and deep learning)
  • iSER and NVMe-oF-based storage
  • Mission-critical SQL database solutions such as Oracle RAC (Exadata)
  • IBM DB2 pureScale
  • Microsoft SQL solutions and Teradata
Figure 1. Data communication over TCP vs. RDMA.

Figure 1 shows why IT managers have been deploying RoCE (RDMA over Converged Ethernet). RoCE uses advances in Ethernet to enable more efficient RDMA over Ethernet and enables widespread deployment of RDMA technologies in mainstream data center applications.

The growing deployment of RDMA-enabled networking solutions in public and private clouds—like RoCE that enables ruining RDMA over Ethernet, plus the recent NFS protocol extensions—enables NFS communication over RoCE. For more information, see the Open Source NFS/RDMA Roadmap presentation given at the OpenFabrics Workshop in 2017 by Chuck Lever, an upstream Linux contributor and Linux kernel architect at Oracle. For more information about how to run NFS over RoCE, see How to Configure NFS over RDMA (RoCE).

To evaluate the boost that RoCE enables compared to TCP, we ran the IOzone test, measured the read/write IOPS, and throughput of multi-thread read or write tests. The tests were performed on a single client against a Linux NFS server using tmpfs, so that storage latency was removed from the picture and transport behavior exposed.

  • Client server: Intel Core i5-3450S CPU @ 2.80GHz one socket, four cores, HT disabled 16-GB RAM, 1333 MHz DDR3, non-ECC HCA together with the NVIDIA Mellanox ConnectX-5 100 GbE NIC (SW version 16.20.1010) plugged into in a PCIe 3.0 x16 slot.
  • NFS server: Intel Xeon CPU E5-1620 v4 @ 3.50GHz one socket, four cores, HT disabled 64-GB RAM, 2400 MHz DDR4 HCA, together with the ConnectX-5 100 GbE NIC (16.20.1010) plugged into in a PCIe 3.0 x16 slot.

The client and the NFS server were connected over a single 100-GbE NVIDIA Mellanox LinkX copper cable to the NVIDIA Mellanox Spectrum switch using the SN2700 model with its 32 x 100-GbE ports, which is the lowest latency Ethernet switch available in the market today. This makes it ideal for running latency-sensitive applications over Ethernet.

The following charts show the bandwidth and IOPS measured for performance over RoCE vs. TCP, running the IOzone test.

Figure 2. Running NFS over RoCE enables 2X to 3X higher bandwidth (using a 128-KB block size, read and write with 16 threads, aggregate throughput).
Figure 3. NFS over RoCE enables up to 140% higher IOPs (using a 8-KB block size, read and write with 16 threads, aggregate IOPs).
Figure 4. NFS over RoCE enables up to 150% higher IOPs (using a 2-KB block size, read and write with 16 threads, aggregate IOPs).

Conclusion

Running NFS over RDMA-enabled networks—such as RoCE, which offloads the CPU from performing the data communication job—generates a significant performance boost. As a result, Mellanox expects that NFS over RoCE will eventually replace NFS over TCP and become the leading transport technology in data centers.

Categories
Misc

Integrating with Telephone Networks to Enable Real-Time AI Services

Many of you may not recognize my company, Ribbon Communications. We are best known for building and securing large telecom networks for communication service providers (also known as phone companies). However, there’s a good chance that in the next day or two, you’ll place a call that traverses a piece of our gear somewhere in … Continued

Many of you may not recognize my company, Ribbon Communications. We are best known for building and securing large telecom networks for communication service providers (also known as phone companies). However, there’s a good chance that in the next day or two, you’ll place a call that traverses a piece of our gear somewhere in the world. In addition to service providers, we have substantial practice working with large enterprises, the kinds of organizations that need carrier-grade services, either because of their size or the critical nature of their communications. That includes universities, healthcare institutions, financial services, government agencies, and so on.

A short while ago, one of our customers, one of the largest investment banks in the world, approached Ribbon with a problem. They wanted to use advanced AI to analyze their contact center calls, in real-time, so that they could make immediate business decisions based on AI-based observations. They wanted to be able to ingest the audio stream, immediately transcribe it into text, and then also immediately analyze the text to look for issues such as customer satisfaction, threatening behavior, and fraud attempts. The sooner the text was transcribed, the easier it would be to store and search. Our customer could also use it for other forms of trend analysis that could spot upcoming issues, for example, customer sentiment with a certain agent.

Anyone that has ever tried to search a recording can appreciate why a bank with thousands of calls a day would rather store transcriptions than audio and would rather use AI tools to search for issues compared to traditional search tools. Unfortunately, the bank was stymied by several common technical issues that stood in their way:

  • The bank needed a secure element that could sit in the middle of thousands of contact center calls and replicate all the call media streams so the streams could be sent to an AI engine.
  • Because the element is in the middle of these calls, it can’t ever fail, and it can’t degrade the calls. It also had to be extremely secure such that a third party couldn’t find a way to intercept the streams. Nor could it be compromised or overloaded using a DoS attack.
  • The telephone network uses a different media format than AI engines accept: Real-time Transport Protocol (RTP). The bank could not just send raw audio streams of all calls to an AI engine.
  • The bank wanted to use the real-time audio streams to execute multiple AI-based services at the same time. That means that they needed multiple copies of the real-time audio sent to different AI services simultaneously to enable different constituencies in the bank to analyze the data and use the results for their own purposes.

Because the bank could not overcome these issues, they were forced to record calls in another format, store them, and then send the recordings to an AI engine for analysis. Recording was not acceptable as it introduced two drawbacks:

  • The transcription and analysis are not real-time so there’s no way to leverage AI to react to issues happening right now. That dramatically reduces the value.
  • Recordings can reduce audio quality. As you all know, lower audio quality inherently reduces the transcription accuracy of an AI platform.
Ribbon’s AI Gateway integrates into the public phone network, converting phone conversations into linear audio streams so those streams can be sent to NVIDIA’s Jarvis platform where they are translated into text and used in real-time by data analytics tools.
Figure 1. Ribbon AI gateway concept.

Ribbon, working in collaboration with NVIDIA, created a solution. We used our extensive experience in managing telephone network audio and signaling and combined that with the NVIDIA Jarvis advanced conversational AI platform, powered by GPU technology.

Ribbon is well-known for its telecom network security software—session border controllers (SBCs) —that provides swire-speed packet inspection and media manipulation. We took that know-how and created a secure interface to the telecom network so that we could securely access and replicate thousands of high-quality streams of telephone audio from the bank’s contact center (or any telephony source).

In real-time, we convert those streams from RTP into AI-acceptable audio. The audio goes to Jarvis, to be transcribed in real-time. Line-of-business owners can then use that data for many different applications. The bank already has distinct use cases in mind but it’s obvious that developers could find thousands of use cases and the value could be applied across hundreds of different industries. Any organization that receives a high volume of calls is a potential target. Target applications include:

  • Regulatory compliance
  • Real-time security or fraud analysis
  • Real-time sentiment analysis
  • Real-time translation 

Figure 1 shows that the Ribbon AI gateway becomes a secure bridge between the telephone network and AI data analytics domain. After the audio moves into text, the breadth of potential applications grows exponentially. The ability to get that almost instantaneously expands the potential opportunities to use the data and value of that data.

Ribbon’s AI gateway architecture

Ribbon’s AI Gateway has multiple internal services including a SIP Call Control Interface to integrate with a Session Border Controller. It has a Media Relay Function that converts the audio. IT also has a REST Server and File System that manage rest requests for business applications and store data, respectively.
Figure 2. AI gateway architecture, interface view.

Figure 2 provides a view of the AI gateway components, how they connect to the contact center and integrate with the Jarvis AI engine.

In the diagram, the Ribbon SBC acts as a secure spigot that delivers thousands of high-quality streams of audio to the AI gateway, using standard telecoms protocols: Session Initiation Protocol (SIP) for call signaling and RTP with various codecs for call audio. Each call participant has a separate audio stream sent to the AI gateway, to ensure the quality of the audio.

The Media Relay component then converts each audio stream from RTP to AI-acceptable audio and delivers it to the AI engine for conversion to text or other Jarvis application functions. The text for each audio stream is then sent back to the AI gateway.

The AI gateway is controlled using REST APIs. This allows a business application to dynamically instruct and control how each call is handled. For example, a business application might be set up to focus on the quality of engagements for an organization’s premium customers. The application would match incoming caller ID to the premium customers’ phone number. When there is a match, those calls would be selected for transcription and real-time analysis. The same type of filtering could be used to look for new customers, customers in a certain geography, time of day, and so on. Alternatively, they could target all calls or only a percentage to sample calls, based on defined rules.

An application can instruct the AI gateway whether to use the Jarvis AI engine to convert a call’s audio to text from speech or use some other Jarvis AI function. It is even possible to instruct the AI gateway to perform different functions on the same audio stream.

Finally, the application can instruct the AI gateway to stream the AI data output either in real-time or at the end of the call. It can choose one or multiple destinations. The AI gateway can provide multiple AI streams from an individual audio call to different business functions, in parallel. Businesses often have siloed organizations that have distinct requirements. They want their own feed of data so that they can unilaterally act on it. This allows different departments—like compliance, operations, or security—to use the call data to address their own specific business needs.

To demonstrate the AI gateway capabilities, we deployed a single Amazon EC2 instance in AWS. For benchmarking performance, we deployed a separate test harness and drove hundreds of simultaneous voice calls at the AI gateway instance. Using the g4dn.2xlarge EC2 instance type running Jarvis EA2 ASR, with T4 GPU, we generated 220 simultaneous voice streams in 110 simultaneous calls. Each GPU provided an order of magnitude capacity improvement over CPUs.

The AI gateway can direct call traffic to multiple GPUs to scale well beyond 100 simultaneous calls, to support the thousands of concurrent calls that a large contact center would field.

Conclusion

The speech-to-text use case is only the beginning. By providing the ability to convert from text back to speech and inject this into the call path to the caller, the AI gateway can provide a basis for real-time conversational AI agents to engage directly with contact center customers.

This AI gateway capability opens literally thousands of potential application use cases that can be tailored to fit specific business verticals and go beyond the confines of the contact center environment.

If you are interested in learning more, look for our conference talk at the upcoming GTC session, Real-time Integration of Telephony Network with AI Speech-to-Text Translation or contact me directly at Ribbon.

Categories
Misc

Reducing Costs with One-Pass Reverse Time Migration

Reverse time migration (RTM) is a powerful seismic migration technique, providing geophysicists with the ability to create accurate 3D images of the subsurface. Steep dips? Complex salt structure? High velocity contrast? No problem. By splitting the upgoing and downgoing wavefields and combining them with an accurate velocity model, RTM can image even the most complex … Continued

Reverse time migration (RTM) is a powerful seismic migration technique, providing geophysicists with the ability to create accurate 3D images of the subsurface. Steep dips? Complex salt structure? High velocity contrast? No problem. By splitting the upgoing and downgoing wavefields and combining them with an accurate velocity model, RTM can image even the most complex geologic formations.

The algorithm migrates each shot independently using this basic workflow:

  1. Compute the downgoing wavefield.
  2. Reconstruct the upgoing wavefield and reverse it in time.
  3. Correlate up and down wavefields at each image point.
  4. Repeat for all shots and combine in a 3D image.

While simple in concept, the computational costs made RTM economically unviable until the early 2010s, when parallel processing with NVIDIA GPUs dramatically reduced the migration time and hardware footprint needed.

Reducing RTM costs by increasing computational efficiency

There are several factors driving computational requirements for tilted transversely isotropic (TTI) RTM. One is the calculation of first, second, and cross-derivatives along x, y, and z. Earlier versions of GPU, such as the Fermi and Kepler generations, had limited streaming multiprocessors (SMs), shared memory, and compiler technology.

Paulius Micikevicius famously overcame these issues by splitting the derivative calculations into two or three passes, with each pass computing a set of derivatives. This major breakthrough allowed seismic processors to run RTM in an economical and time-efficient manner. However, each pass requires a round-trip to memory. Each round-trip to memory hinders performance and drives up costs.

While multi-pass RTM was the best you could do in 2012, you can do much better today with the NVIDIA Volta or NVIDIA Ampere Architecture generations. If your RTM kernel hasn’t been tuned since the days of Paulius, you are leaving significant value on the table.

Moving to a one-pass RTM

A one-pass TTI RTM kernel reads the wavefield one time, computes all necessary derivatives, and writes the updated wavefields to global memory one time. By eliminating multiple read/write roundtrips to memory, this implementation dramatically increases the performance gained on GPUs. It also helps the algorithm scale linearly across multiple GPUs in a node. Figure 2 shows the performance and strong scaling gained by reducing the number of passes on V100, T4, and A100 GPUs.

For seismic processing in the cloud, T4 provides a particularly good price/performance solution. On-premises servers for seismic processing typically have four to eight V100 or A100 GPUs per node. For these configurations, reducing the number of passes from three to one improves RTM kernel performance by 78-98%!

Performance benchmarks for TTI RTM on A100s. The graph shows benchmarks for three, two, and one pass TTI RTM, and near linear scaling from one to eight GPUs in a node.
Figure 1. A100 performance on multi and single-pass TTI RTM with linear scaling.
Performance benchmarks for TTI RTM on T4s. The graph shows benchmarks for three, two, and one pass TTI RTM, and near linear scaling from one to eight GPUs in a node.
Figure 2. T4 performance on multi and single-pass TTI RTM with linear scaling.
Performance benchmarks for TTI RTM on T4s. The graph shows benchmarks for three, two, and one pass TTI RTM, and near linear scaling from one to eight GPUs in a node.
Figure 3. V100 performance on multi and single-pass RTM with linear scaling.

Conclusion

Reducing the number of passes in your RTM kernel can dramatically improve code performance and decrease costs. To make the development easier, NVIDIA has developed a collection of code examples showing how to implement a GPU-accelerated RTM using best practices. If we have an NDA in place for you, you can have free access to this code.

Of course, the number of passes in an RTM kernel is only one piece of the puzzle. There are several other tricks shown in the example code to further increase performance, such as compression.

If you’re interested in accessing the NVIDIA RTM implementation or want assistance in optimizing your code, please comment below.

Categories
Misc

Beginner’s Guide to GPU- Accelerated Event Stream Processing in Python

This tutorial is the seventh installment of introductions to the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process geospatial, signal, and system log data, or use SQL language via … Continued

This tutorial is the seventh installment of introductions to the RAPIDS ecosystem. The series explores and discusses various aspects of RAPIDS that allow its users solve ETL (Extract, Transform, Load) problems, build ML (Machine Learning) and DL (Deep Learning) models, explore expansive graphs, process geospatial, signal, and system log data, or use SQL language via BlazingSQL to process data.

In the age of the Internet, abundant IoT devices, social media, web servers, and more, data flows at incredible speeds. In 2019, Forbes reported that every minute, Americans use approximately 4.4PB of internet data: which converts to roughly 1MB of data per Internet user per minute.

Not only is the volume of data increasing over time, but so are the speeds at which data arrives. Over the years, we went from dial-up modem connections with speeds up to 56kbit in the early 1990s to contemporary 10Gbit networks starting gaining some popularity. 1Gbit networks are still the most widely used type of interconnecting devices at home and in the office unless you are on a WiFi network. 

Many of the Internet services offered these days rely on prompt and fast processing of this constant waterfall of data. cuStreamz is one of the newer additions to the RAPIDS stack. It aims to take the streaming data processing historically done on CPU and accelerate on the GPU. Thanks to GPUs’ immense parallelism, processing streaming data has now become much faster with a friendly Python interface.

In the previous posts we showcased other areas: 

Today, we talk about cuStreamz – a library that uses GPUs to process streaming data. To help get familiar with cuStreamz, we also published a cheat sheet that can be downloaded here cuStreamz cheatsheet, and an interactive notebook with all the current functionality of cuStreamz showcased here.

Streaming frameworks

First released in 2011, Apache Kafka has quickly become a standard for managing vast quantities of fast-moving data with low latency and high-level APIs. Kafka is a distributed platform that maintains a list of topics that systems can subscribe to (the so-called, consumers), and publish their data onto (the producers). Data in Kafka, like many other distributed systems, is replicated among multiple workers (or brokers): if any of the brokers disconnects from the cluster, or otherwise dies, the data is not lost and still available from other brokers. This improves the resiliency and availability of the system that is required by today’s Internet service companies.

Streamz is a Python framework that focuses on processing high-velocity data and allows for branching, joining, controlling the flow of messages, and sinking the data to disk or other streams. Here’s what a set of distinct pipelines might look like;

Figure 1: Source: https://streamz.readthedocs.io/en/latest/_images/complex.svg

The pipeline can branch into multiple branches. A popular Lambda architecture also implements two branches: one to process fast-moving, near real-time data, and another one to provide batch processing.

RAPIDS cuStreamz builds on top of the streamz framework and allows the messages to be batched into cuDF DataFrames instead of text messages. This, on its own, enables significant speed-ups of processing of messages that purport to the same schema by tapping into the power of GPUs. Also, if the data volume cannot fit on a single machine, cuStreams supports pushing the data processing using Dask-cuDF DataFrames.

Setting up locally

It is easy to get started. In this section, we will show you how to set up your own mini-Kafka cluster using Docker. To use cuStreamz, you will, of course, need an NVIDIA GPU with Pascal architecture (GTX 1000-series) or newer as required by RAPIDS.

To get started with Kafka, you need to install Docker and Docker-compose: the installation instructions for Docker can be found here https://docs.docker.com/engine/install/ while the Docker-compose installation manual is here https://docs.docker.com/compose/install/. Please note that you will need a Linux machine to run this as neither Windows nor MacOSX is officially supported: https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#is-microsoft-windows-supported.

To use cuStreamz, your machine will need NVIDIA drivers and CUDA environment present (instructions to follow can be found here https://developer.nvidia.com/cuda-downloads) and NVIDIA-docker addition so Docker can connect to your GPU: find it here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker.

Kafka cluster

Next, let’s set up our Kafka cluster. If you clone the Github repository, inside the cheatsheets/cuStreamz folder navigate to Kafka and open docker-compose.yaml file.

Docker-compose uses the YAML configuration files to set up the whole cluster. The first service we start is the zookeeper. Zookeeper is a service used to track naming and configuration data for Kafka; it maintains information about the cluster nodes’ status and their topics, partitions, replication, etc. Besides, the Zookeeper service allows multiple clients to carry out concurrent reads and writes to the service to keep up with the volume and velocity of the incoming and outgoing data calls.

services:
 zookeeper:
   image: 'confluentinc/cp-zookeeper:5.4.3'
   hostname: zookeeper
   networks:
     - kafka
   environment:
     - ZOO_MY_ID=1
     - ZOOKEEPER_CLIENT_PORT=2181
     - ZOO_SERVERS=zookeeper:2888:3888
   ports:
     - 2181:2181
   volumes:
     - ./data/zookeeper/data:/data
     - ./data/zookeeper/datalog:/datalog

In this example, we use the cp-zookeeper:5.4.3 image from Confluent to start our Zookeeper service; the server started will be named zookeeper. The Zookeeper service can be replicated among multiple servers, so it can become resilient; the Zookeeper servers talk to each other on port 2888, and the leader-of-the-pack runs on port 3888. Clients that want to use the Zookeeper connect to the service on port 2181, and that port gets forwarded to the host via the config ports. We also map some host folders to the container so the data that Zookeeper stores is persisted.

Next, we start two Kafka worker nodes (one shown here for brevity).

kafka0:
 image: confluentinc/cp-kafka:5.4.3
 hostname: kafka0
 networks:
   - kafka
 ports:
   - "9092:9092"
 environment:
   KAFKA_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka0:19092,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9092
   KAFKA_ADVERTISED_LISTENERS: LISTENER_DOCKER_INTERNAL://kafka0:19092,LISTENER_DOCKER_EXTERNAL://${DOCKER_HOST_IP:-127.0.0.1}:9092
   KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: LISTENER_DOCKER_INTERNAL:PLAINTEXT,LISTENER_DOCKER_EXTERNAL:PLAINTEXT
   KAFKA_INTER_BROKER_LISTENER_NAME: LISTENER_DOCKER_INTERNAL
   KAFKA_ZOOKEEPER_CONNECT: "zookeeper:2181"
   KAFKA_BROKER_ID: 0
   KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
 volumes:
   - ./data/kafka0/data:/var/lib/kafka/data
 depends_on:
   - zookeeper

The cp-kafka image comes from the Confluent’s Docker Hub; here, we also use version 5.4.3. There are plenty of environmental variables but let’s review just the most important from our point of view:

  • KAFKA_LISTENERS identifies a list of server names and ports the server will be listening to. Note that the external and internal ports are different: to facilitate communication between multiple docker containers the server will be placed on the Docker internal network (in our case kafka_kafka) and the kafka0 server will be listening on port 19092. If you would like to connect to this service from the host you can use the localhost and port 9092. The same list is provided in the KAFKA_ADVERTISED_LISTENERS environmental variable.
  • KAFKA_INTER_BROKER_LISTENER_NAME tells the Docker which server name to use for internal communication between containers: in our case, this is LISTENER_DOCKER_INTERNAL but any recognizable name should work. Should you, however, change this name you will have to change the KAFKA_LISTENERS and the KAFKA_ADVERTISED_LISTENERS.
  • KAFKA_ZOOKEEPER_CONNECT specifies the address of the zookeeper to connect to; in our case, that is zookeeper: 2181.
  • KAFKA_BROKER_ID is a unique identifier of the kafka node and by convention should be included in the name of the service and server name.

We also identify the zookeeper as a service this container depends on.

To start all these services, simply navigate to the folder where the docker-compose.yaml file is saved and run docker-compose up in the terminal (if you want to stop the service press Ctrl-C or from another terminal window type docker-compose down). Once the services are running, you can check the list of all containers by running docker ps command.

With all the services running, let’s create a sample topic. Run the following command in the terminal.

docker exec -ti  bash

Once inside, run the following command.

kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor  --partitions  --topic test

Now, you should be able to subscribe to the topic test to either sink or consume the messages. Your Kafka service is running!!!

Let’s get streaming!

In this example, we will be using the official RAPIDS container. Go ahead and pull the latest one following the examples here https://rapids.ai/start.html. Start the container using the command listed on the RAPIDS website. You should now be able to navigate to https://localhost:8888 and access JupyterLab.

Before we move forward, we need to connect this container to the kafka_kafka network: do so with the following command from the terminal.

docker network connect kafka_kafka 

From now on, we should be able to access the kafka0:19092 server from the RAPIDS container.

Note that if you do not have custreamz available in your container, you can install it using the following command.

conda install -c rapidsai -c rapidsai-nightly -c nvidia -c conda-forge -c defaults custreamz python=3.7 cudatoolkit=11

Next, let’s subscribe to our test topic with cuStreamz!

import streamz

consumer_conf = {'bootstrap.servers': 'kafka0:19092',
                 'group.id': 'custreamz',
                 'session.timeout.ms': '60000'
                }

source = streamz.Stream.from_kafka_batched(
    'test'
    , consumer_conf
    , poll_interval='2s'
    , asynchronous=True
    , dask=False
    , engine="cudf"
    , start=False
)

We will be using the .from_kafka_batched(...) method to subscribe as this allows us to use the CUDA Kafka connector and return the messages in the form of a cudf DataFrame. The first parameter specifies the topic name and is followed by the dictionary with configuration. Next, we set up the interval the stream object will be checking the Kafka topic for new messages; 2 seconds in this example. The engine set cudf specifies that the messages should be returned as DataFrames. We can now provide the rest of the pipeline and start the listener.

from streamz.dataframe import DataFrame

def process_batch(messages):
    batch_df = cudf.DataFrame()
    
    for message in messages:
        df_split = messages[message].str.tokenize()
        df_split = (
            df_split
            .to_frame('word')
            .reset_index()
            .groupby(by='word')
            .agg({'index': 'count'})
            .rename(columns={'index': 'count'})
            .reset_index()
        )
        print("nWord Count for this batch:")
        
        batch_df = cudf.concat([batch_df, df_split])
    
    return batch_df

stream_df = source.map(process_batch)

# Create a streamz dataframe to get stateful word count
sdf = DataFrame(stream_df, example=cudf.DataFrame({'word':[], 'count':[]}))

# Formatting the print statements
def print_format(sdf):
    print("nGlobal Word Count:")
    return sdf

# Print cumulative word count from the start of the stream, after every batch. 
# One can also sink the output to a list.
sdf.groupby('word').sum().stream.gather().map(print_format)

After this run;

source.start()

Et voila! We now have a running listener to the test topic!

The code here is pretty self-explanatory, but at the high level, we expect the message to come as a DataFrame. We will count all the words occurring in the message by using the .tokenize() functionality of RAPIDS cudf and then count the number of individual words. Finally, we create a Streamz DataFrame that we use to produce the final tally of words by summing the occurrences of each word.

With the consumer running now, let’s produce some messages! Open a new notebook and install kafka-python package by running in a cell.

 !pip install kafka-python  

Next, we start a producer.

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers='kafka0:19092'
    , value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

The bootstrap_servers is the address of our kafka0 server. Every message we will emit will be JSON string UTF-8 encoded. Now we can start pushing the messages onto the topic message bus:

producer.send('test',{'text': 'RAPIDS rocks!'})

What your notebook with cuStreamz consumer running should produce is a DataFrame with index being RAPIDS and rocks! rows, and a count 1 against each of these words. You can now play more with it!

With the introduction of cuStreamz, the RAPIDS ecosystem can speed up the processing of fast-moving data. You can try the above examples and more for yourself at app.blazingsql.com and download the cuStreamz cheatsheet here.

Categories
Misc

Best Practices for Using AI to Develop the Most Accurate Retail Forecasting Solution

Jupyter Notebook Best Practices for Using RAPIDS  A leading global retailer has invested heavily in becoming one of the most competitive technology companies around.  Accurate and timely demand forecasting for millions of item-by-store combinations is critical to serving their millions of weekly customers. Key to their success in forecasting is RAPIDS, an open-source suite of GPU-accelerated libraries. RAPIDS helps them tear through their massive-scale data and … Continued

Jupyter Notebook Best Practices for Using RAPIDS 

A leading global retailer has invested heavily in becoming one of the most competitive technology companies around. 

Accurate and timely demand forecasting for millions of item-by-store combinations is critical to serving their millions of weekly customers. Key to their success in forecasting is RAPIDS, an open-source suite of GPU-accelerated libraries. RAPIDS helps them tear through their massive-scale data and has improved forecasting accuracy by several percentage points – it now runs orders of magnitude faster on a reduced infrastructure GPU footprint.  This enables them to respond in real-time to shopper trends and have more of the right products on the shelves, fewer out-of-stock situations, and increased sales.

With RAPIDS, data practitioners can accelerate pipelines on NVIDIA GPUs, reducing data operations including data loading, processing, and training from days to minutes.  RAPIDS abstracts the complexities of accelerated data science by building on and integrating with popular analytics ecosystems like PyData and Apache Spark, enabling users to see benefits immediately. Compared to similar CPU-based implementations, RAPIDS delivers 50x performance improvements for classical data analytics and machine learning (ML) processes at scale which drastically reduces the total cost of ownership (TCO) for large data science operations. 

End-to-end accelerated data science. Data preparation to model training to visualization.
Figure 1. Data science pipeline with GPUs and RAPIDS

To learn and solve complex data science and AI challenges, leaders in retail often leverage what are called ‘Kaggle competitions’. Kaggle is a platform that brings together data scientists and other developers to solve challenging and interesting problems posted by companies. In fact, there have been over 20 competitions for solving retail challenges within the past year. 

Leveraging RAPIDS and best practices for a forecasting competition, NVIDIA Kaggle Grandmaster Kazuki Onodera won 2nd place in the Instacart Market Basket Analysis Kaggle competition using complex feature engineering, gradient boosted tree models, and special modeling of the competition’s F1 evaluation metric.  Along the way, we documented the best practices for ETL, feature engineering, building and customizing the best models for building an AI based Retail forecasting solution.  

This blog post will walk readers through the components of a Kaggle competition to explain data science best practices for improving forecasting in retail.  Specifically, the blog post explains the Instacart Market Basket Analysis Kaggle competition goals, introduces RAPIDS, then offers a workflow to show how to explore the data visually, develop features, train the model, and run a forecasting prediction. Then, the post will dive into some advanced techniques for feature engineering with model explainability and hyperparameter optimization (HPO). 

For an even more detailed look into the methodology, see Kazuki Onodera’s fantastic interview with Medium.com.

Join Paul Hendricks at NVIDIA GTC 2021, where he hosts a session on Best Practices for ETL, Feature Engineering, and Model Development for Retail Forecasting Using NVIDIA RAPIDS Data Science Libraries.

Access this Jupyter notebook, where we share these best practices for GPU-accelerated forecasting within the context of the Instacart Market Basket Analysis Kaggle competition.

The Forecasting Challenge 

Instacart Market Basket Analysis competition challenged Kagglers to predict which grocery products a consumer will purchase again and when. Imagine, for example, having milk ready to be added to your cart right when you run out, or knowing that it’s time to stock up again on your favorite ice cream. 

This focus on understanding temporal behavior patterns makes the problem fairly different from standard item recommendation, where user needs and preferences are often assumed to be relatively constant across short windows of time. Whereas Netflix might be fine assuming you want to watch another movie like the one you just watched, it’s less clear that you’ll want to reorder a fresh batch of almond butter or toilet paper if you bought them yesterday. 

Problem Overview 

The goal of this competition was to predict grocery reorders: given a user’s purchase history (a set of orders, and the products purchased within each order), which of their previously purchased products will they repurchase in their next order? 

The problem is a little different from the general recommendation problem, where we often face a cold start issue of making predictions for new users and new items that we’ve never seen before. For example, a movie site may need to recommend new movies and make recommendations for new users. 

The sequential and time-based nature of the problem also makes it interesting: how do we take the time since a user last purchased an item into account? Do users have specific purchase patterns and do they buy different kinds of items at different times of the day? 

To get started, we’ll first load some of the modules we’ll be using in this notebook and set the random seed for any random number generator we’ll be using. 

Modules to predict grocery reorders.

RAPIDS Overview 

Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data – as the name suggests – comes in a structured form, often represented by a table or CSV. We’ll focus the majority of the tutorials on working with these types of data.

There are many tools in the Python ecosystem for structured, tabular data but few are as widely used as pandas. pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing, and many more.

For more information on pandas, check out the excellent documentation here: http://pandas.pydata.org/pandas-docs/stable/

pandas is fantastic for working with small datasets that fit into your system’s memory. However, datasets are growing larger and data scientists are working with increasingly complex workloads – the need for accelerated compute arises.

cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide. 

Getting familiar with the data

The dataset for this competition contains several files capturing orders from Instacart users over time, with the goal of the competition to predict if a user will re-order a product and specifically, which products will those customers will re-order. From the Kaggle data description (https://www.kaggle.com/c/instacart-market-basket-analysis/data), we see that we have over three million grocery orders with a customer base of over 200,000 Instacart users. And that for each user, we are provided between 4 and 100 of their orders, with the sequence of products purchased in each order as well as the time of their orders and a relative measure of time between orders. Also provided are the week and hour of the day the order was placed, and a relative measure of time between orders.

Our products, aisles, and departments datasets are composed of metadata about our products, aisles, and departments respectively. Each dataset (products, aisles, departments, and orders, etc.) has a unique identifier mapping for each entity in that dataset e.g. order_id represents a unique order within the orders dataset, product_id represents a unique product within the products dataset, etc. We’ll use these unique identifiers later to combine all of these separate datasets into one coherent view for exploratory data analysis, feature engineering, and modeling.

Below, we will read in our data and inspect our different tables using cuDF. 

Products, aisles and departments datasets using cuDF.
Datasets for days since prior order and order hour of the day.

Additionally, we’ll read in our orders datasets. The first indicates to which set (prior, train, test) an order belongs. Additional files specify which products were purchased in each order. Again, from the Kaggle description of the data, we see that the order_products__prior.csv contains previous order contents for all customers. And that the column ‘reordered’ indicates that the customer has a previous order that contains the product. We are informed that some orders will have no reordered items.

Datasets for orders including days since prior order and order hour of the day.
Datasets for orders including days since prior order and order hour of the day.

Exploring the data

When we think about our data science workflow, one of the most important steps is Exploratory Data Analysis. This is where we examine our data and look for clues and insights into which features we can use (or need to create) to feed our model. There are many ways to explore the data and each Exploratory Data Analysis is different for each problem – however, it still remains incredibly important as it informs our feature engineering process, ultimately determining how accurate our model will be. 

In the notebook, we look at a couple different cross sections of the day. Specifically, we examine the distribution of the order counts, the days of week and times customers typically place orders, the distribution of number of days since the last order, and the most popular items across all orders and unique customers (de-duplicating so as to ignore customers who have a “favorite” item that they place repeated orders for). 

Distribution of the order counts, the days of week and times customers typically place orders.
Figure 2. Exploring the data

From this we see respectively that: 

  • There are no orders less than 4 and max is capped at 100. 
  • The orders are high Saturday and Sunday (days 0 and 1) and low during Wednesday. 
  • The majority of orders are made during the daytime. And customers primarily order once a week or month (see the peaks at days 7 and 30). 

Similar exploratory analysis for product popularity is provided in the notebook.

Feature Engineering

If Exploratory Data Analysis is the most important part of our data science workflow, Feature Engineering is a close second. This is where we identify which features should be fed into the model and create features where we believe they might be able to help the model do a better job of predicting. 

Feature Engineering is where we identify which features should be fed into the model and create features where we believe they might be able to help the model do a better job of predicting.
Figure 3: Machine Learning is an iterative process.

We start by just identifying our unique User X Item combinations and sorting them. We’ll create a dataset where each user maps to their most recent order number, day of week and hour, and how many days it’s been since that order. And we’ll extend our dataset, creating labels and features to be used later in our machine learning model, such as: 

  • How many kinds of products have the user ordered? 
  • How many products have the user ordered within one cart? 
  • From which departments have the user-ordered products? 
  • When has the user ordered products (day of the week)?  
  • Has this user ordered this product at least once before? 
  • How many orders have a user placed that have included this item? 

Solving for the business problem (train and predict)

The mathematical operations underlying many machine learning algorithms are often matrix multiplications. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. RAPIDS makes it easy to build machine learning models in an accelerated fashion while still using a nearly identical interface to Scikit-Learn and XGBoost. 

There are many ways to create a model – one can use Linear Regression models, SVMs, tree-based models like Random Forest and XGBoost, or even Neural Networks. In general, tree-based models tend to work better with tabular data for forecasting than Neural Networks. Neural Networks work by mapping the input (feature space) to another complex boundary space and determining what values should belong to those points within that boundary space (regression, classification). Tree-based models on the other hand work by taking the data, identifying a column, and then finding a split point in that column to map a value to, all the while optimizing the accuracy. We can create multiple trees using different columns, and even different columns within each tree.

For a more detailed description of tree-based models XGBoost, see this fantastic documentation: https://xgboost.readthedocs.io/en/latest/tutorials/model.html 

In addition to their better accuracy performance, tree-based models are very easy to interpret (important for when predictions or decisions resulting from the predictions must be explained and justified, maybe for compliance and legal reasons e.g. finance, insurance, healthcare). Tree-based models are very robust and work well even when there is a small set of data points.  

In the section below, we’ll set the different parameters for our XGBoost model and train five different models – each on a different subset of users to avoid overfitting to a particular set of users. 

import xgboost as xgb

NFOLD = 5
PARAMS = {
    'max_depth':8, 
    'eta':0.1,
    'colsample_bytree':0.4,
    'subsample':0.75,
    'silent':1,
    'nthread':40,
    'eval_metric':'logloss',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist'
         }

models = []
for i in range(NFOLD):
    train_ = train[train.user_id % NFOLD != i]
    valid_ = train[train.user_id % NFOLD == i]
    dtrain = xgb.DMatrix(train_.drop(['user_id', 'product_id', 'label'], axis=1), train_['label'])
    dvalid = xgb.DMatrix(valid_.drop(['user_id', 'product_id', 'label'], axis=1), valid_['label'])    
    model = xgb.train(PARAMS, dtrain, 9999, [(dtrain, 'train'),(dvalid, 'valid')],
                      early_stopping_rounds=50, verbose_eval=5)
    models.append(model) 
    break

There are several parameters that should be set before XGBoost can be run. 

  • General parameters relate to which booster we are using to do boosting, commonly tree or linear model.
  • Booster parameters depend on which booster you have chosen. 
  • Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks. 

For more information on the configurable parameters within the XGBoost module, see the documentation here: https://xgboost.readthedocs.io/en/latest/parameter.html 

Feature Importance

Once we’ve trained our models, we might want to look at the internal workings and understand which of the features we’ve crafted are contributing the most to the predictions. This is called Feature Importance. One of the advantages for tree-based models for forecasting is that understanding the differing importance of our features is very easy. 

With understanding how our features contribute to the model accuracy, we can choose to remove features that aren’t important or try to iterate and create new features, re-train, and re-assess if those new features are more important. Ultimately, being able to iterate quickly and try new things in this workflow will lead to the most accurate model and the greatest ROI (for forecasting, oftentimes cost-savings from reduced out-of-stock and poorly placed inventory). Iteration traditionally can take a significant amount of time due to computational intensity. RAPIDS allows users to churn through model iteration with NVIDIA accelerated computing so users can iterate quickly and determine the best performing model. 

In the Feature Importance section of the notebook, we define convenience code to access the importance of the features in each model. We then pass in our list of models that we trained, iterate over them one by one, and average the importance of each variable across all the models. Lastly, we visualize feature importance using a horizontal bar chart. 

We see specifically that three of our features are contributing the most to our predictions: 

  • user_product_size – How many orders has a user placed that has included this item?
  • user_product_t-1 – Has this user ordered this product at least once before?
  • order_number – The number of orders that user has created.
Figure 4: Determining top features.

All of this makes sense and aligns with our understanding of the problem. Customers who have placed an order for an item before are more likely to repeat an order for that product, and users who place multiple orders of that product are even more likely to re-order. Additionally, the number of orders a customer has created correlates with their likelihood of re-ordering. 

The code uses the default XGBoost implementation of feature importance – but we are free to choose any implementation or technique. A wonderful technique (also developed by an NVIDIA Kaggle Grandmaster, Ahmet Erdem) is called LOFO

From the description of the LOFO GitHub page, we see that LOFO (Leave One Feature Out) Importance calculates the importance of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric. And that LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set.

This methodology allows us to effectively determine which features are important for the model. LOFO has several advantages compared to other importance types:

  • It does not favor granular features.
  • It generalizes well to unseen test sets. 
  • It is model agnostic. 
  • It gives negative importance to features that hurt performance upon inclusion. 

For more information on LOFO, see here: https://github.com/aerdem4/lofo-importance 

Hyperparamater Optimization (HPO) 

When we trained our XGBoost models, we used the following parameters: 

PARAMS = {    'max_depth':8,     'eta':0.1,    'colsample_bytree':0.4,    'subsample':0.75,    'silent':1,    'nthread':40,    'eval_metric':'logloss',    'objective':'binary:logistic',    'tree_method':'gpu_hist'         } 

Of these, only a few may be changed and effect the accuracy of our model: max_depth, eta, colsample_bytree, and subsample. However, these may not be the most optimal parameters. The art and science of identifying and training models with the model optimal hyperparamers is called hyperparameter optimization. 

While there is no magic button one can press to automatically identify the most optimal hyperparameters, there are techniques that allow you to explore the range of all possible hyperparameter values, quickly test them, and find values that are closest. 

A full exploration of these techniques is beyond the scope of this notebook. However, RAPIDS is integrated into many Cloud ML Frameworks for doing HPO as well as with many of the different open source tools. And being able to use the incredible speedups from RAPIDS allows you to go through your ETL, feature engineering, and model training workflow very quickly for each possible experiment – ultimately resulting in fast HPO explorations through large hyperparameter spaces and a significant reduction in Total Cost of Ownership (TCO). 

Conclusion 

In this blog, we walked through the components of a Kaggle competition to explained data science best practices for improving forecasting in retail.  Specifically, the blog post explained the Instacart Market Basket Analysis Kaggle competition goals, introduced RAPIDS, then offered a workflow to show how to explore the data visually, develop features, train the model, and run a forecasting prediction. Then, the post reviewed techniques for feature engineering with model explainability and hyperparameter optimization (HPO).  

To learn more, be sure to: 

  • See this Jupyter notebook on forecasting
    where we show best practices for GPU accelerated forecasting within the context of the Instacart Market Basket Analysis Kaggle competition in which NVIDIA Kaggle Grandmaster Kazuki Onodera won 2nd place, using complex feature engineering, gradient boosted tree models and special modeling of the competition’s F1 evaluation metric.   
  • Read Kazuki Onodera’s detailed interview at Medium.com
Categories
Misc

Detecting Financial Fraud Using GANs at Swedbank with Hopsworks and NVIDIA GPUs

Recently, one of Sweden’s largest banks trained generative adversarial neural networks (GANs) using NVIDIA GPUs as part of its fraud and money-laundering prevention strategy. Financial fraud and money laundering pose immense challenges to financial institutions and society. Financial institutions invest huge amounts of resources in both identifying and preventing suspicious and illicit activities. There are … Continued

Recently, one of Sweden’s largest banks trained generative adversarial neural networks (GANs) using NVIDIA GPUs as part of its fraud and money-laundering prevention strategy. Financial fraud and money laundering pose immense challenges to financial institutions and society. Financial institutions invest huge amounts of resources in both identifying and preventing suspicious and illicit activities. There are large institutions reportedly saving $150 million in a single year through the use of AI fraud detection.

Existing approaches to identifying financial fraud and money laundering rely on databases of human-engineered rules that match suspicious patterns in financial transactions. As new schemes are identified, new rules are added to the rules base.

Swedbank has developed new solutions to these problems using combinations of deep learning techniques on GPUs, producing new state-of-the-art solutions for identifying suspicious activities. The approach is to model problems in a semi-supervised fashion using anomaly detection via GANs. The solution requires software and hardware that can scale to process and train models on large volumes of data. Hopsworks has trained models on a dataset as large as 40 terabytes (TBs) in size. To this end, the Hopsworks software platform was used with NVIDIA V100 GPUs to engineer features at scale and efficiently train GANs using many GPUs in parallel.

Rules-based vs. model-based fraud detection

Existing approaches to identifying fraud and money laundering rely on databases of human-engineered rules that attempt to match patterns that are indicative of fraud. As new fraud schemes are identified, new rules are added to the rule engines. For example, in money laundering, there are well-known patterns of smurfing money at lots of accounts and aggregating that money using small, under-the-radar transactions at hubs for later spending.

In the Rules-Based Fraud Detection code example, you can see the rule-based approach to identifying suspicious financial transactions. Here, you define a large set of rules that are applied to all financial transactions. If a transaction matches any of the rules, an alert is triggered. If the alert was incorrectly triggered (false positive), it induces costs. If no alert was triggered, but one should have been (false negative), you must design a new rule to identify the fraud scheme. Companies maintain these rule databases and routinely ship updates to customers.

Rules-Based Fraud Detection

# Rule 1
IF transfersLastDay > 10 && amount > $5k
THEN
 alert
END

# Rule 2
IF country is LISTED && amount > $1k
THEN
 alert
END
… 
# Rule N
… 
Train Fraud Detection Model

dataset=tf.data(“financial_transactions”)
model = …
model.compile(…)
model.fit(dataset, …)

Detect Fraud with Model 

IF model.predict(amount,transfersLastDay,
           country, ….) == TRUE
THEN
 alert
END

Given enough historical financial transaction data, model-based approaches are better at pattern matching than rule-based approaches, as they can generalize to learn fraud schemes that are like existing fraud schemes. In the Train Fraud Detection Model code example, you can see that you must first curate a labeled training dataset: financial_transactions. With that dataset, you can train the model and then the trained model can then be used on new financial transactions to predict if they are fraud or not-fraud. An alert is sent if a financial transaction is suspected of fraud.

GANs are a natural choice for financial fraud prediction as they can learn the patterns of lawful transactions from historical data. For every new financial transaction, the model computes an anomaly score; financial transactions with high scores are labeled as suspicious transactions.

GANs are difficult to both train and deploy in production, needing lots of GPUs and parallel hyperparameter search as well as distributed training support. Great care must be exercised and advanced machine learning experience is certainly desired. One of the GAN implementations is based on the Unsupervised Learning of Anomaly Detection from Contaminated Image Data using Simultaneous Encoder Training paper, which describes a architecture for anomaly detection that tolerates small amounts of incorrectly labeled samples and supports parallel encoder training.

Understanding fraud using a graph representation of entities and transactions

To detect fraudulent patterns and trigger alerts, you can use graph and tabular features and the DL-based GAN techniques described earlier. Graphs consist of nodes, also known as vertices, and edges, also known as arcs. In financial applications, graphs can model transactional interactions of businesses and individuals.

To show the utility of graphs, here’s an example. Mark the businesses and individuals with different titles: businesses are marked as “Corp” and individuals are marked as “Indiv”. The edges are used to represent transactions with associated dates and amounts and the arrows represent the direction of transactions.

There are various expected graph patterns, such as a normal scatter pattern, also known as a dandelion, that happens when an organization pays salaries. Such a pattern occurs on certain dates, salaries are relatively fixed, and the money flow is outbound from a single payer. An anomalous scatter pattern has a sudden burst of transactions that has never been seen previously for involved nodes or bidirectional money flows.

Figure 1 shows a gather-scatter pattern, where money flows initially inbound to the central node in the month of January. These flows are subsequently outbound to other nodes in the month of February. In the world of money-laundering, this gather-scatter pattern is used to hide the distribution of funds from financial institutions. Similarly, Figure 2 shows a scatter-gather pattern that again has a bidirectional flow of money on different dates. In this case, the source and destination of the money are two different central entities.

Central node is a corporation and there are four close nodes (three corporations, one individual) sending money to the central node in January up to February 2. There are four further out nodes (two corporations, two individuals) receiving money later in February from the central node.
Figure 1. Gather-scatter pattern through a central entity.
Originating node is an individual and there are receiving nodes before August 17. Upon August 17 and after those nodes send money to the second central individual node.
Figure 2. Scatter-gather pattern through two central entities. Outflow occurs before August 17 and inflow thereafter.

Based on tabular features as well as graph features, DL-based GAN methods can detect such fraud patterns, with an example based on using Hopsworks on NVIDIA GPUs. Such methods coexist with rule-based techniques to lead to better results, accuracy, and a confusion matrix.

Challenges in modelling fraud as a binary classification problem

Figure 3 shows the confusion matrix of a financial fraud binary classifier. For problems such as money laundering, false negatives should be weighed significantly higher. Use a variant of the F1 score to evaluate models: precision, recall, and fallout should not be weighted equally.

A 2x2 table with true positive of fraud being correct in the upper left and true negative of not-fraud in the lower right. In the upper right is the case of a false positive where fraud is predicted when there is not-fraud. In the lower left is the case of a false negative where no fraud is predicted when fraud is present.
Figure 3. Confusion matrix of a financial fraud binary classifier.

There are other challenges in detecting money laundering patterns:

  • Massive class imbalance—Transactions labeled as suspicious may be less than 0.0001% of total historical transactions.
  • Non-stationarity—New money-laundering schemes are constantly being invented. To identify new patterns as they appear, techniques must be able to adapt themselves or be easily adapted.

Logical Clocks, developers of the open-source Hopsworks platform, have published as open source a full end-to-end example for detecting fraud:

  • A sample raw dataset of financial transactions
  • Feature engineering programs to compute complex features such as graph embedding and store them in a feature store
  • Notebooks to find good hyperparameters for the GANs
  • Distributed training of a GAN using many GPUs.

The code can be reproduced on any Hopsworks cluster, including managed Hopsworks clusters available on AWS, Microsoft Azure, and on-premises installations of Hopsworks with NVIDIA GPUs. Hopsworks clusters can manage up to thousands of GPUs and allocate them to applications on-demand.

NVIDIA GPUs for accelerating financial data science

Identifying fraud and money laundering among large numbers of customer records is a classic financial machine learning (ML) and deep learning (DL) use case. Because it requires many trillion floating point operations (TOPS), applying GPUs accelerates the neural network training process significantly. Many data scientists know that NVIDIA GPUs have been helping in the ML training process for several years.

When the neural network training is complete and the inference phase becomes more important, the recently introduced open-source software NVIDIA Triton Inference Server can help simplify and manage inference acceleration and production model deployment. Triton Server can run as a Docker container, on bare metal, or inside a virtual machine in a virtualized environment. Hopsworks supports serving models on Triton Server using KFServing.

Hopsworks supports ML/DL training using TensorFlow, PyTorch, and Scikit-Learn with additional support for transparent data-parallel training, hyperparameter tuning, and parallel ablation studies on TensorFlow and PyTorch using the Maggy framework. Hopsworks works with multi-GPU, single-node systems as well as clusters of multi-GPU systems. DGX A100 systems are now the universal systems for AI infrastructure for distributed training on the GPU. Each DGX A100 system provides the following configuration:

  • 8x NVIDIA 100 Tensor Core GPUs
  • 80 GB of GPU memory for a total of 640 GB
  • SXM (NVLink) form factor
  • Connected with NVIDIA NV Switches
  • 5 AI petaFLOPS or 10 petaOPS INT8, respectively

Multi-GPU, multi-node DGX A100 systems constitute superpods on the Hopsworks platform, which can accelerate DL training and inferencing workloads considerably. You can achieve similar configurations on NVIDIA GPUs by working with the NVIDIA Partner Network (NPN) of OEM partners, system integrators, and value-added resellers.

Figure 4 shows the architecture of DL systems using Hopsworks that can leverage data-parallel distributed GPU training using TensorFlow CollectiveAllReduceStrategy. The Maggy framework, supported in Hopsworks, facilitates the ease of development with TensorFlow CollectiveAllReduceStrategy on multi-GPU, multi-node systems, through the transparent management of distribution using Spark. Large clusters additionally benefit from GPU interlinks using NVSwitch. In the future, such an architecture will also be possible using the NVIDIA Rapids.ai framework and Spark on GPUs.

Optimizing distributed  training using Hopsworks on NVIDIA-certified multi-GPU, multi-node systems

Image shows four CPUs with pairs communicating and four GPUs per CPU node typically using NVIDIA NVLINK for high-speed communicating.
Figure 4. Multi-GPU, multi-node architecture to train GANs for the anomaly detection task. Note the places where NVLINK is used for high bandwidth.

For inferencing of trained models using other architectures, NVIDIA supports multiple inferencing workloads, concurrent applications, and instances of DL models with the NVIDIA Triton Inferencing Server framework, increasing GPU utilization. Hopsworks clients have used GANs, vision, and other DL models requiring extensive distributed training on the GPU to develop cutting-edge AI systems.

In the following end-to-end money-laundering example from LogicalClocks, a GAN model for anomaly detection was trained on DGX systems using a setup on a multi-GPU, multi-node framework. Training times using such setups can provide almost linear scaling, also known as strong scaling, to accelerate your DL training. Also, inferencing of such models using the Triton Server framework can use the GPUs efficiently.

The number of GPUs from 1 to 4 is on the bottom axis. The samples per second rate of GAN training is on the left side or y-axis. The plotted line goes from 9,000 at 1 GPU to 31,000 at 4 GPUs.
Figure 5. Close to linear speedup when applying NVIDIA V100 GPUs to the problem of GAN training.

You can also use other frameworks including RAPIDS.ai, Spark on GPU, and the NVIDIA GRAPH framework CuGraph for accelerating such features on the GPU on the Hopsworks platform.

Get in touch with us

Teamwork is the key to engineering accurate financial fraud and money laundering solutions. Going from rule-based to model-based approaches is a common technology objective. The goal is to reduce the number of falsely classified outcomes that financial institutions may receive when using fraud detection or money laundering models in production. Nowadays, customers expect more accuracy from their financial firms in terms of preventing fraud and limiting false alarms.

For more information and to share your experiences with this important use case and state-of-the-art approach, please leave a comment below.

Categories
Misc

Why I am getting less accuracy and big loss?

I chose the TensorFlow estimator for implementation due to having a distributed training API support. Well, honestly saying, I found a code, which was quite understandable. So I chose that to implement sensor-based signal recognition on multiple GPUs.

The code was basically for MNIST data set training on multiple GPUs. I got the error after executing the code, and that error was coming due to not working mnist dataset API downloading. Here is the link to that code. https://github.com/shu-yusa/tensorflow-mirrored-strategy-sample/blob/master/cnn_mnist.py

I could not found any solution on google. There might be are multiple issues behind that; implementing in Tensorflow1 can be one of them. I tried to convert that code into Tensorflow2. Mostly code is transformed; however, tf.contrib related things did not restore. So I decided to edit that code for sensor-based signal (Time series).

However, when I ran the code, the accuracy was 30%, and the lost value was bigger. On the other hand, when I implemented CNN on the same data set on Low-level tensor API, I received 95% accuracy. Now I do not know why it is giving low accuracy on the tf estimator. In my opinion, one of the reasons can be wrong input feeding to the CNN. Here is the code:

def cnn_model_fn(features, labels, mode):

“””Model function for CNN.”””

# Input Layer

# Reshape X to 4-D tensor: [batch_size, width, height, channels]

# input 1 * segment_size, and have three channel, in accelrometer we have x, y, z

input_layer = tf.reshape(features[“x”], [-1, 1, segment_size, num_input_channels])

# Convolutional Layer #1

# Computes 32 features using a 5×5 filter with ReLU activation.

# Padding is added to preserve width and height.

# Input Tensor Shape: [batch_size, 28, 28, 1]

# Output Tensor Shape: [batch_size, 28, 28, 32]

conv1 = tf.compat.v1.layers.conv2d(

inputs=input_layer,

filters=32,

kernel_size=[1, 12],

padding=”same”,

activation=tf.nn.relu)

# Pooling Layer #1

# First max pooling layer with a 2×2 filter and stride of 2

# Input Tensor Shape: [batch_size, 28, 28, 32]

# Output Tensor Shape: [batch_size, 14, 14, 32]

pool1 = tf.compat.v1.layers.max_pooling2d(inputs=conv1, pool_size=[1, 4], strides=2, padding=’same’)

# Convolutional Layer #2

# Computes 64 features using a 5×5 filter.

# Padding is added to preserve width and height.

# Input Tensor Shape: [batch_size, 14, 14, 32]

# Output Tensor Shape: [batch_size, 14, 14, 64]

conv2 = tf.compat.v1.layers.conv2d(

inputs=pool1,

filters=64,

kernel_size=[1, 12],

padding=”same”,

activation=tf.nn.relu)

# Pooling Layer #2

# Second max pooling layer with a 2×2 filter and stride of 2

# Input Tensor Shape: [batch_size, 14, 14, 64]

# Output Tensor Shape: [batch_size, 7, 7, 64]

pool2 = tf.compat.v1.layers.max_pooling2d(inputs=conv2, pool_size=[1, 4], strides=2, padding=’same’)

# Flatten tensor into a batch of vectors

# Input Tensor Shape: [batch_size, 7, 7, 64]

# Output Tensor Shape: [batch_size, 7 * 7 * 64]

pool2_flat = tf.reshape(pool2, [-1, 1 * 50 * 64])

# Dense Layer

# Densely connected layer with 1024 neurons

# Input Tensor Shape: [batch_size, 7 * 7 * 64]

# Output Tensor Shape: [batch_size, 1024]

dense = tf.compat.v1.layers.dense(inputs=pool2_flat, units=1024, activation=tf.nn.relu)

# Add dropout operation; 0.6 probability that element will be kept

dropout = tf.compat.v1.layers.dropout(

inputs=dense, rate=0.4, training=mode == tf.estimator.ModeKeys.TRAIN)

# Logits layer

# Input Tensor Shape: [batch_size, 1024]

# Output Tensor Shape: [batch_size, 10]

logits = tf.compat.v1.layers.dense(inputs=dropout,

units=6) # unit =10 in our case we have 6 classes so will 6 units at last layer

predictions = {

# Generate predictions (for PREDICT and EVAL mode)

“classes”: tf.argmax(input=logits, axis=1),

# Add softmax_tensor` to the graph. It is used for PREDICT and by the`

# logging_hook`.`

“probabilities”: tf.nn.softmax(logits, name=”softmax_tensor”)

}

if mode == tf.estimator.ModeKeys.PREDICT:

return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

# labels = tf.argmax(tf.cast(labels, dtype=tf.int32), 1)

# Calculate Loss (for both TRAIN and EVAL modes)

loss = tf.compat.v1.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

# here we define how we calculate our accuracy

# if you want to monitor your training accuracy you need these two lines

# accuracy = tf.compat.v1.metrics.accuracy(labels=labels, predictions=predictions[‘classes’], name=’acc_op’)

# tf.summary.scalar(‘accuracy’, accuracy[1])

# Configure the Training Op (for TRAIN mode)

if mode == tf.estimator.ModeKeys.TRAIN:

optimizer = tf.compat.v1.train.GradientDescentOptimizer(learning_rate=0.001)

train_op = optimizer.minimize(

loss=loss,

global_step=tf.compat.v1.train.get_global_step())

return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

# Add evaluation metrics (for EVAL mode)

eval_metric_ops = {

“accuracy”: tf.compat.v1.metrics.accuracy(labels,

predictions=predictions[“classes”])}

return tf.estimator.EstimatorSpec(

mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

After debugging:

in `def cnn_model_fn(features, labels, mode):` I am getting {‘x’: <tf.Tensor ‘IteratorGetNext:0’ shape=(?, 600) dtype=float64>} and in labels Tensor(“IteratorGetNext:1”, shape=(?,), dtype=int64) and in mode {str} train.

**Here the result on test data:**

Saving ‘checkpoint_path’ summary for global step 1000: /tmp/tmp77ffy2i9/model.ckpt-1000

{‘accuracy’: 0.3959022, ‘loss’: 1.698279, ‘global_step’: 1000}**

Can anyone help why my model giving less accuracy and huge loss value?

submitted by /u/Nafees060
[visit reddit] [comments]

Categories
Misc

Noob looking for help

Hello.

I’m looking to put on a project for school, a RNN to predict stock prices (not really innovative i know).

All the programs i see to learn how to make this model has a very strange feedback 50% goes like “you rock this is great” the other 50% goes like “this makes no sense”.

Is this project something hard to do with Keras+TensorFlow?

Any 1 can help me navigate the immensive amount of documents about TF to get me started?

Where i can find the maths behind this models?

submitted by /u/Impossible-Hunter224
[visit reddit] [comments]

Categories
Misc

Shackling Jitter and Perfecting Ping, How to Reduce Latency in Cloud Gaming

Looking to improve your cloud gaming experience? First, become a master of your network. Twitch-class champions in cloud gaming shred Wi-Fi and broadband waves. They cultivate good ping to defeat two enemies — latency and jitter. What Is Latency?  Latency or lag is a delay in getting data from the device in your hands to Read article >

The post Shackling Jitter and Perfecting Ping, How to Reduce Latency in Cloud Gaming appeared first on The Official NVIDIA Blog.