Categories
Misc

Using Network Graphs to Visualize Potential Fraud on Ethereum Blockchain

Web of wires forming geometric shapes.Beyond the unimaginable prices for monkey pictures, NFT’s underlying technology provides companies with a new avenue to directly monetize their online…Web of wires forming geometric shapes.

Beyond the unimaginable prices for monkey pictures, NFT’s underlying technology provides companies with a new avenue to directly monetize their online engagements. Major brands such as Adidas, NBA, and TIME have already begun experimenting with these revenue streams using NFTs–and we are still early in this trend.

As data practitioners, we are positioned to provide valuable insights into these revenue streams, given that all transactions are public on the blockchain. This post provides a guided project to access, analyze, and identify potential fraud using blockchain data using Python.

In this post and accompanying Jupyter notebook, I discuss the following:

  • The basics of blockchain, NFTs, and network graphs.
  • How to pull NFT data using the open-source package NFT Analyst Starter Pack from a16z.
  • How to interpret Ethereum blockchain data.
  • The fraudulent practice of wash trading NFTs.
  • Constructing network graphs to visualize potential wash trading on the NFT project Bored Ape Yacht Club.

The Jupyter notebook has a more detailed, step-by-step guide for writing Python code to implement this example walkthrough, and the post provides additional context. In addition, this post assumes that you have a basic understanding of pandas, data preparation, and data visualization.

What is blockchain data?

Cutting through the media frenzy of coins named after dogs and pixelated pictures selling for hundreds of thousands of dollars reveals fascinating technology: the blockchain. 

The following excerpt best describes this decentralized data source:

“At a very high level, blockchains are ledgers of transactions utilizing cryptography that can only add information and thus can’t be changed (i.e., immutable). What separates blockchains from ledgers you find at banks is a concept called ‘decentralization’— where every computer connected to a respective blockchain must ‘agree’ on the same state of the blockchain and subsequent data added to it.”

For more information about Ethereum blockchain data, see Making Sense of Ethereum Data for Analytics.

Central to this technology is that all the data (for example, logs, metadata, and so on) must be public and accessible. I highly recommend Stanford professor Dan Boneh’s lecture.

What is an NFT?

NFT stands for non-fungible token, a crypto asset on a blockchain (such as Ethereum) in which it represents a unique one-of-one token that can be digitally owned. For example, gold bars are fungible, as multiple bars can exist and represent the same thing, while the original Mona Lisa is non-fungible in that only one exists.

Contrary to popular belief, NFTs are not just art and JPEGs but digital representations of ownership on a blockchain ledger for a unique item such as art, music, or whatever the NFT creator wants to put onto the metadata. For this post, however, we use the NFT project Bored Ape Yacht Club (BAYC), an artwork NFT.

P.S. If you are a visual learner, my favorite intro resource on the topic of NFTs is the What Are NFTs and How Can They Be Used in Decentralized Finance? DEFI Explained video by Finematics.

What is a network graph, and why are they prime to represent blockchain data?

Networks are a method of organizing relationship data using nodes and edges. Nodes represent an entity, such as an email address or social media account, while the edges show the connection between nodes. 

Furthermore, you can store metadata for nodes and edges to represent different aspects of a relationship. Metadata can range from weights to labels. Figure 1 shows the steps of taking an entire network and zooming into a use case with helpful labels from metadata.

Six separate network graphs of the same data in various stages of detail, graph 1: all data as a blob of blue, graph 2: single network with no labels, graph 3: single network with wallet addresses as labels, graph 4: single network with wallet addresses replaced with sequential letters, graph 5: single network with sequential letters for labels and added directional arrows, graph 6: single graph with sequential letter labels, directional arrows, and labels of money spent for each transaction..
Source: Graphs from notebook tutorial.
Figure 1. The various network graphs created in this post

What makes network graphs ideal for representing blockchain transactions is that there is always a to and from blockchain address, as well as significant metadata (for example, timestamps, coin amounts, and so on.) for each transaction. Furthermore, as blockchain data is public by design through decentralization, you can use network graphs to visualize economic behaviors on a respective blockchain.

In this example, I want to demonstrate identifying the fraudulent behavior of wash trading, where an individual intentionally sells an asset to themselves across multiple accounts to artificially inflate the asset’s price.

Chainalysis wrote an excellent report on the phenomena, where they identified over 260 Ethereum crypto wallets potentially engaging in wash trading with a collective profit of over $8.4 million in 2021 alone.

Pulling data from the Ethereum blockchain

Though all blockchain data is publicly available to anyone, it is still tough to access and prepare for analysis. Following are some options to access blockchain data:

  • Create your own blockchain node (for example, become a miner) to read the rawest data available.
  • Use a third-party tool to create your own blockchain node.
  • Use a third-party API to read raw data from their own blockchain node.
  • Use a third-party API to read cleaned and aggregated blockchain data from their service.
  • Use the open-source package NFT Analyst Starter Pack from a16z.

Although all are viable options, each has a tradeoff between reliability, trust, and convenience.

For example, I worked on an NFT analytics project where we wanted to create a reliable NFT market dashboard. Unfortunately, having our own blockchain node was cost-prohibitive, many third-party data sources had various data-quality issues that we couldn’t control, and it became challenging to track transactions across multiple blockchains. That project ultimately required bringing together high-quality data from numerous third-party APIs.

Thankfully for this project, you want the most convenience possible to focus on learning, so I recommend NFT Analyst Starter Pack from a16z. Think of this package as a convenient wrapper around the third-party blockchain API from Alchemy, where it creates easy-to-use CSVs of your desired NFT contract.

Preparing data and creating network graphs

The NFT Analyst Starter Pack results in three separate CSV files for the BAYC NFT project:

  • BAYC Metadata: Information regarding a specific NFT, where asset_id is the unique identifier within this NFT token.
  • BAYC Sales: Logs and metadata related to a specific transaction represented by its transaction hash where a seller and buyer inform you of the wallets involved.
  • BAYC Transfers: The same as BAYC Sales data, but no money is transferred from one wallet to another.

For this project, most of the data preparation is around:

  • Reorganizing BAYC Sales and BAYC Transfers to enable a clean union of the two datasets.
  • Deleting duplicate logs of transfer transactions already represented in sales.

Given that the goal is to learn, don’t worry about whether the blockchain data is accurate, but you can always check for yourself by searching the transaction_hash value on Etherscan.

After preparing the data, use the NetworkX package to generate a network graph data structure of your NFT transactions. There are multiple ways to construct a graph, but in my opinion, the most straightforward approach is to use the function from_pandas_edgelist, where you just provide a pandas DataFrame, the to and from values to represent the nodes, and any metadata for edges and labeling.

[('0x2fdcca65899346af3a93a8daa6128bdbcb1ce3b3',
  '0xcedf17dfafa947cd0e205fe2a3a183cf2fb3a0bc',
  {'transaction_hash': '0xb235f0321b0b50198399ec7f2bb759ef625f85673b4d90d68f711229750181e4',
   'block_number': '14675897',
   'date': '2022-04-28',
   'asset_id': '7438',
   'sale_price_eth': 153.2,
   'sale_price_usd': 442685.5285671361,
   'transaction_type': 'sell',
   'asset_contract': '0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d'}),
('0x2fdcca65899346af3a93a8daa6128bdbcb1ce3b3',
  '0xd8fdd6031fa27194f93e1a877f8bf5bfc9b47e1e',     {'transaction_hash':'0x7b4797061eb16d73a28a869e51745e471e2849a55c80459b2aff7f0205925d74',
   'block_number': '14654313',
   'date': '2022-04-25',
   'asset_id': '5954',
   'sale_price_eth': 0.0,
   'sale_price_usd': 0.0,
   'transaction_type': 'transfer',
   'asset_contract': '0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d'})]

From this prepared data, the NetworkX package makes visualizing network graphs as easy as nx.draw, but with over 40k transactions in the DataFrame, visualizing the whole graph returns only a useless blob. Therefore, you must be specific about exactly what to visualize among your transactions to create a captivating data story.

Visualizing potential wash trading

Rather than scouring through the transactions of 10,000 NFTs, you can instead validate what others are stating within the market. Notably, the NFT Wash Trading – Is it possible to protect against it? post calls out BAYC token 8099 as potentially being subjected to the fraudulent practice of wash trading.

If you follow along in the accompanying notebook, you carry out these steps:

  • Filter prepared NFT data to only rows containing logs for asset_id 8099.
  • Rename the to and from wallet addresses to uppercase letters ordered sequentially following when the wallet address first appeared among the NFT asset’s transactions.
  • Generate network graph data with the prepared asset 8099 data using the NetworkX package.
  • Plot the network graph with desired labels, edge arrows, and node positioning.

Did BAYC 8099 NFT experience wash trading?

The data plotted in Figure 2 enables you to visualize the data corresponding to asset 8099. Starting at node H, you can see that this wallet first increases the price from $95k to $166k between HI, then adds more transactions through transfers between HJ. Finally, node H sells the potentially artificially increased NFT to node K.

Graph generated from Jupyter Notebook tutorial, following the first sell, second sell, third sell that might indicate potential wash trading, and fourth sell that is potentially inflated.
Figure 2. Following the transactions of NFT BAYC 8099

Though this graph can’t definitively state that node H engaged in wash trading—as you don’t know whether nodes H, I, and J are wallets owned by the same person—seeing loops for a node where the price increases should flag the need to do more due diligence. For example, you could look at etherscan.com to review the transactions between the following wallets:

  • 0xe4bc96b24e0bdf87b4b92ed39c1aef8839b090dd (node H).
  • 0x7e99611cf208cb097497a59b3fb7cb4dfd115ea9 (node I).
  • 0xcbc9f463f83699d20dd5b54be5262be69a0aea9f (node J).

Perhaps node H had sellers’ remorse and wanted their NFT back, as it’s not uncommon for investors to get attached to their beloved NFTs. But numerous transactions between the wallets associated with nodes H, I, and J could indicate further red flags for the NFT asset.

Next steps

By following along with this post and accompanying notebook, you have seen how to access Ethereum blockchain data and analyze NFTs through network graphs. If you enjoyed this analysis and are interested in doing more projects like this, chat with me in the CharlieDAO Discord, where I hang out. We are a collective of software engineers, data scientists, and crypto natives exploring web3!

Disclaimer

This content is for educational purposes only and is not financial advice. I do not have any financial ties to the analyzed NFTs at the moment of releasing the notebook, and here is my crypto wallet address on Etherscan for reference. This analysis only highlights potential fraud to investigate further but does not prove fraud has taken place. Finally, if you own crypto, never share your “secret recovery phrase” or “private keys” with anyone.

Categories
Misc

Running Large-Scale Graph Analytics with Memgraph and NVIDIA cuGraph Algorithms

With the latest Memgraph Advanced Graph Extensions (MAGE) release, you can now run GPU-powered graph analytics from Memgraph in seconds, while working in…

With the latest Memgraph Advanced Graph Extensions (MAGE) release, you can now run GPU-powered graph analytics from Memgraph in seconds, while working in Python.  Powered by NVIDIA cuGraph, the following graph algorithms will now execute on GPU: 

  • PageRank (graph analysis)
  • Louvain (community detection)
  • Balanced Cut (clustering)
  • Spectral Clustering (clustering)
  • HITS (hubs versus authorities analytics)
  • Leiden (community detection)
  • Katz centrality
  • Betweenness centrality

This tutorial will show you how to use PageRank graph analysis and Louvain community detection to analyze a Facebook dataset containing 1.3 million relationships.

By the end of the tutorial, you will know how to:

  • Import data inside Memgraph using Python
  • Run analytics on large scale graphs and get fast results
  • Run analytics on NVIDIA GPUs from Memgraph

Tutorial prerequisites

To follow this graph analytics tutorial, you will need an NVIDIA GPU, driver, and container toolkit. Once you have successfully installed the NVIDIA GPU driver and container toolkit, you must also install the following four tools:

The next section walks you through installing and setting up these tools for the tutorial. 

Docker

Docker is used to install and run the mage-cugraph Docker image. There are three steps involved in setting up and running the Docker image: 

  1. Download Docker
  2. Download the tutorial data
  3. Run the Docker image, giving it access to the tutorial data

1. Download Docker

You can install Docker by visiting the Docker webpage and following the instructions for your operating system. 

2. Downloading the tutorial data

Before running the mage-cugraph Docker image, first download the data that will be used in the tutorial. This allows you to give the Docker image access to the tutorial dataset when run.  

To download the data, use the following commands to clone the jupyter-memgraph-tutorials GitHub repo, and move it to the jupyter-memgraph-tutorials/cugraph-analytics folder:

Git clone https://github.com/memgraph/jupyter-memgraph-tutorials.git
Cd jupyter-memgraph-tutorials/cugraph-analytics

3. Run the Docker image

You can now use the following command to run the Docker image and mount the workshop data to the /samples folder:

docker run -it -p 7687:7687 -p 7444:7444 --volume /data/facebook_clean_data/:/samples mage-cugraph

When you run the Docker container, you should see the following message:

You are running Memgraph vX.X.X
To get started with Memgraph, visit https://memgr.ph/start

With the mount command executed, the CSV files needed for the tutorial will be located inside the /samples folder within the Docker image, where Memgraph will find them when needed.

Jupyter notebook

Now that Memgraph is running, install Jupyter. This tutorial uses JupyterLab, and you can install it with the following command:

pip install jupyterlab

Once JupyterLab is installed, launch it with the following command:

jupyter lab

GQLAlchemy 

Use GQLAlchemy, an Object Graph Mapper (OGM), to connect to Memgraph and also execute queries in Python. You can think of Cypher as SQL for graph databases. It contains many of the same language constructs such as Create, Update, and Delete. 

Download CMake on your system, and then you can install GQLAlchemy with pip:

pip install gqlalchemy

Memgraph Lab 

The last prerequisite you need to install is Memgraph Lab. You will use it to create data visualizations upon connecting to Memgraph. Learn how to install Memgraph Lab as a desktop application for your operating system.

With Memgraph Lab installed, you should now connect to your Memgraph database

At this point, you are finally ready to:

  • Connect to Memgraph with GQLAlchemy
  • Import the dataset
  • Run graph analytics in Python

Connect to Memgraph with GQLAlchemy

First, position yourself in the Jupyter notebook. The first three lines of code will import gqlalchemy, connect to Memgraph database instance via host:127.0.0.1 and port:7687, and clear the database. Be sure to start with a clean slate.

from gqlalchemy import Memgraph
memgraph = Memgraph("127.0.0.1", 7687)
memgraph.drop_database()

Import the dataset from CSV files. 

Next, you will perform PageRank and Louvain community detection using Python.

Import data

The Facebook dataset consists of eight CSV files, each having the following structure:

node_1,node_2
0,1794
0,3102
0,16645

Each record represents an edge connecting two nodes.  Nodes represent the pages, and relationships are mutual likes among them.

There are eight distinct types of pages (Government, Athletes, and TV shows, for example). Pages have been reindexed for anonymity, and all pages have been verified for authenticity by Facebook.

Since Memgraph imports queries faster when data has indices, create them for all the nodes with the label Page on the id property.

memgraph.execute(
    """
    CREATE INDEX ON :Page(id);
    """
)

Docker already has container access to the data used in this tutorial, so you can list through the local files in the ./data/facebook_clean_data/ folder. By concatenating both the file names and the /samples/ folder, you can determine their paths. Use the concatenated file paths to load data into Memgraph.

import os
from os import listdir
from os.path import isfile, join
csv_dir_path = os.path.abspath("./data/facebook_clean_data/")
csv_files = [f"/samples/{f}" for f in listdir(csv_dir_path) if isfile(join(csv_dir_path, f))]

Load all CSV files using the following query:

for csv_file_path in csv_files:
    memgraph.execute(
        f"""
        LOAD CSV FROM "{csv_file_path}" WITH HEADER AS row
        MERGE (p1:Page {{id: row.node_1}}) 
        MERGE (p2:Page {{id: row.node_2}}) 
        MERGE (p1)-[:LIKES]->(p2);
        """
    )

For more information about importing CSV files with LOAD CSV see the Memgraph documentation.

Next, use PageRank and Louvain community detection algorithms with Python to determine which pages in the network are most important, and to find all the communities in a network.

PageRank importance analysis

To identify important pages in a Facebook dataset, you will execute PageRank. Learn about different algorithm settings that can be set when calling PageRank.

Note that you will also find other algorithms integrated within MAGE. Memgraph should help with the process of running graph analytics on large-scale graphs. Find other Memgraph tutorials on how to run these analytics.

MAGE is integrated to simplify executing PageRank. The following query will first execute the algorithm, and then create and set the rank property of each node to the value that the cugraph.pagerank algorithm returns.

The value of that property will then be saved as a variable rank. Note that this (and all tests presented here) were executed on an NVIDIA GeForce GTX 1650 Ti, and Intel Core i5-10300H CPU at 2.50GHz with 16GB RAM, and returned results in around four seconds.  

 memgraph.execute(
        """
        CALL cugraph.pagerank.get() YIELD node,rank
        SET node.rank = rank;
        """
    )

Next, retrieve ranks using the following Python call:

results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        RETURN n.id as node, n.rank as rank
        ORDER BY rank DESC
        LIMIT 10;
        """
    )
for dict_result in results:
    print(f"node id: {dict_result['node']}, rank: {dict_result['rank']}")

node id: 50493, rank: 0.0030278728385218327
node id: 31456, rank: 0.0027350282311318468
node id: 50150, rank: 0.0025153975342989345
node id: 48099, rank: 0.0023413620866201052
node id: 49956, rank: 0.0020696403564964
node id: 23866, rank: 0.001955167533390466
node id: 50442, rank: 0.0019417018181751462
node id: 49609, rank: 0.0018211204462452515
node id: 50272, rank: 0.0018123518843272954
node id: 49676, rank: 0.0014821440895415787

This code returns 10 nodes with the highest rank score. Results are available in a dictionary form.

Now, it is time to visualize results with Memgraph Lab. In addition to creating beautiful visualizations powered by D3.js and our Graph Style Script language, you can use Memgraph Lab to:

  • Query graph database and write your graph algorithms in Python or C++ or even Rust
  • Check Memgraph Database Logs
  • Visualize graph schema

Memgraph Lab comes with a variety of pre-built datasets to help you get started. Open Execute Query view in Memgraph Lab and run the following query:

MATCH (n)
WITH n
ORDER BY n.rank DESC
LIMIT 3
MATCH (n)

The first part of this query will MATCH all the nodes. The second part of the query will ORDER nodes by their rank in descending order.

For the first three nodes, obtain all pages connected to them. We need the WITH clause to connect the two parts of the query. Figure 1 shows the PageRank query results.

Generated graph for visualization of grouped PageRank results
Figure 1. PageRank results visualized in Memgraph Lab

The next step is learning how to use Louvain community detection to find communities present in the graph.

Community detection with Louvain

The Louvain algorithm measures the extent to which the nodes within a community are connected, compared to how connected they would be in a random network.

It also recursively merges communities into a single node and executes the modularity clustering on the condensed graphs. This is one of the most popular community detection algorithms.

Using Louvain, you can find the number of communities within the graph.  First execute Louvain and save the cluster_id as a property for every node:

memgraph.execute(
    """
    CALL cugraph.louvain.get() YIELD cluster_id, node
    SET node.cluster_id = cluster_id;
    """
)

To find the number of communities, run the following code:

results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        WITH DISTINCT n.cluster_id as cluster_id
        RETURN count(cluster_id ) as num_of_clusters;
        """
    )
# we will get only 1 result
result = list(results)[0]

#don't forget that results are saved in a dict
print(f"Number of clusters: {result['num_of_clusters']}")

Number of clusters: 2664

Next, take a closer look at some of these communities. For example, you may find nodes that belong to one community, but are connected to another node that belongs in the opposing community. Louvain attempts to minimize the number of such nodes, so you should not see many of them. In Memgraph Lab, execute the following query:

MATCH  (n2)(m1)
WHERE n1.cluster_id != m1.cluster_id AND n1.cluster_id = n2.cluster_id
RETURN *
LIMIT 1000;

This query will MATCH node n1 and its relationship to two other nodes n2 and m1 with the following parts, respectively: (n2) and (n1)-[e]->(m1). Then, it will filter out only those nodes WHERE cluster_id of n1 and n2 is not the same as the cluster_id of node m1.

Use LIMIT 1000 to show only 1,000 of such relationships, for visualization simplicity.

Using Graph Style Script in Memgraph Lab, you can style your graphs to, for example, represent different communities with different colors. Figure 2 shows the Louvain query results. 

Generated graph visualization of the Louvain query results
Figure 2. Louvain results visualized in Memgraph Lab

Summary

And there you have it: millions of nodes and relationships imported using Memgraph and analyzed using cuGraph PageRank and Louvain graph analytics algorithms. With GPU-powered graph analytics from Memgraph, powered by NVIDIA cuGraph, you are able to explore massive graph databases and carry out inference without having to wait for results. You can find more tutorials covering a variety of techniques on the Memgraph website.

Categories
Misc

Using Federated Learning to Bridge Data Silos in Financial Services

Unlocking the full potential of artificial intelligence (AI) in financial services is often hindered by the inability to ensure data privacy during machine…

Unlocking the full potential of artificial intelligence (AI) in financial services is often hindered by the inability to ensure data privacy during machine learning (ML). For instance, traditional ML methods assume all data can be moved to a central repository.

This is an unrealistic assumption when dealing with data sovereignty and security considerations or sensitive data like personally identifiable information. More practically, it ignores data egress challenges and the considerable cost of creating large pooled datasets.

Massive internal datasets that would be valuable for training ML models remain unused. How can companies in the financial services industry leverage their own data while ensuring privacy and security?

This post introduces federated learning and explains its benefits for businesses handling sensitive datasets. We present three ways federated learning can be used in financial services and provide tips on getting started today.

What is federated learning

Federated learning is a ML technique that enables the extraction of insights from multiple isolated datasets—without needing to share or move that data into a central repository or server.

For example, assume you have multiple datasets you want to use to train an AI model. Today’s standard ML approach requires first gathering all the training data in one place. However, this approach is not feasible for much of the world’s data that is sensitive. This leaves many datasets and use-cases off-limits for applying AI techniques.

On the other hand, federated learning does not assume that one unified dataset can be created. The distributed training datasets are instead left where they are.

The approach involves creating multiple versions of the model and sending one to each server or device where the datasets live. Each site trains the model locally on its subset of the data, and then sends only the model parameters back to a central server. This is the key feature of federated learning: only the model updates or parameters are shared, not the training data itself. This preserves data privacy and sovereignty. 

Finally, the central server collects all the updates from each site and intelligently aggregates the “mini-models” into one global model. This global model can capture the insights from the entire dataset, even when actual data cannot be combined.

Note that these local sites could be servers, edge devices like smartphones, or any machine that can train locally and send back the model updates to the central server.

Advantages of privacy-preserving technology

Large-scale collaborations in healthcare have demonstrated the real-world viability of using federated learning for multiple independent parties to jointly train an AI model. However, federated learning is not just about collaborating with external partners.

In financial institutions, we see an incredible opportunity for federated learning to bridge internal data silos. Company-wide ROI can increase as businesses gather all viable data for new products, including recommender systems, fraud detection systems, and call center analytics.

Privacy concerns, however, aren’t limited to financial data. The wave of data privacy legislation being enacted worldwide today (starting with GDPR in Europe and CCPA in California, with many similar laws coming soon) will only accelerate the need for privacy-preserving ML techniques in all industries.

Expect federated learning to become an essential part of the AI toolset in the years ahead.

Practical business use cases 

ML algorithms are hungry for data. Furthermore, the real-world performance of your ML model depends not only on the amount of data but also the relevance of the training data. 

Many organizations could improve current AI models by incorporating new datasets that cannot be easily accessed without sacrificing privacy. This is where federated learning comes in. 

Federated learning enables companies to leverage new data resources without requiring data sharing.

Broadly, three types of use cases are enabled by federated learning: 

  • Intra-company: Bridging internal data silos
  • Inter-company: Facilitating collaboration between organizations
  • Edge computing: Learning across thousands of edge devices

Intra-company use case: Leverage siloed internal data

There are many reasons a single company might rely on multiple data storage solutions. For example:

  1. Data governance rules such as GDPR may require that data be kept in specific geolocations and specify retention and privacy policies. 
  2. Mergers and acquisitions come with new data from the partner company. Still, the arduous task of integrating that data into existing storage systems often leaves the data dispersed for long periods.
  3. Both on-premise and hybrid cloud storage solutions are used, and there is a high cost of moving large amounts of data. 

Federated learning enables your company to leverage ML across isolated datasets in different business organizations, geographic regions, or data warehouses while preserving privacy and security.

Illustration of example workflow in an intra-company use case showing a federated server (bottom) storing the global model and then receiving parameters or model weights from client nodes (US data center, EU data center, public/private cloud).
Figure 1. A workflow of an intra-company federated learning use case. A federated server stores the global model and receives parameters from client nodes.

Inter-company use case: Collaborate with external partners

Gathering enough quantitative data to build powerful AI models is difficult for a single company. Consider an insurance company building an effective fraud detection system. The company can only collect data from observed events such as customers filing a claim. Yet, this data may not be representative of the entire population and can therefore potentially contribute to AI model bias.

To build an effective fraud detection system, the company needs larger datasets with more diverse data points to train robust, generalizable models. Many organizations can benefit from pooling data with others. Practically, most organizations will not share their proprietary datasets on a common supercomputer or cloud server.

Illustration of example workflow in an inter-company use case showing a federated server (bottom) storing the global model and then receiving parameters or model weights from client nodes (bank X, bank Y, credit card network Z). Graphic created by author Annika Brundyn
Figure 2. A workflow of an inter-company federated learning use case. A federated server stores the global model and receives parameters from client nodes.

Enabling this type of collaboration for industry-wide challenges could bring massive benefits. 

For example, in one of the largest real-world federated collaborations, we saw 20 independent hospitals across five continents train an AI model for predicting the oxygen needs of COVID-19 infected patients. On average, the hospitals saw a 38% improvement in generalizability and a 16% improvement in the model performance by participating in the federated system.

Likewise, there is a real opportunity to maintain customer privacy while credit card networks reduce fraudulent activity and banks employ anti-money laundering initiatives. Federated learning increases the data available to a single bank, which can help address issues such as money laundering activities in correspondent banking.

Edge computing: Smartphones and IoT

Google originally introduced federated learning in 2017 to train AI models on personal data distributed across billions of mobile devices. In 2022, many more devices are connected to the internet, including smartwatches, home assistants, alarm systems, thermostats, and even cars.

Federated learning is useful for all kinds of edge devices that are continuously collecting valuable data for ML models, but this data is often privacy sensitive, large in quantity, or both, which prevents logging to the data center. 

How does federated learning fit into an existing workflow

It is important to note that federated learning is a general technique. Federated learning is not just about training neural networks; rather, it applies to data analysis, more traditional ML methods, or any other distributed workflow. 

Very few assumptions are built into federated learning, and perhaps only two are worth mentioning: 1) local sites can connect to a central server, and 2) each site has the minimum computational resources to train locally.

Beyond that, you are free to design your own application with custom local and global aggregation behavior. You can decide how much trust to place in different parties and how much is shared with the central server. The federated system is configurable to your specific business needs.

For example, federated learning can be paired with other privacy-preserving techniques like differential privacy (to add noise) and homomorphic encryption (to encrypt model updates and obscure what the central server sees).

Get started with federated learning 

We have developed a federated_learning code sample that shows you how to train a global fraud prediction model on two different splits of a credit card transactions dataset, corresponding to two different geographic regions. 

Although federated learning by definition enables training across multiple machines, this example is designed to simulate an entire federated system on a single machine for you to get up and running in under an hour. The system is implemented with NVFLARE, an NVIDIA open-source framework to enable federated learning.

Acknowledgments 

We would like to thank Patrick Hogan and Anita Weemaes for their contributions to this post.

Categories
Misc

AI Shows the Way: Seoul Robotics Helps Cars Move, Park on Their Own

Imagine driving a car — one without self-driving capabilities — to a mall, airport or parking garage, and using an app to have the car drive off to park itself. Software company Seoul Robotics is using NVIDIA technology to make this possible — turning non-autonomous cars into self-driving vehicles. Headquartered in Korea, the company’s initial Read article >

The post AI Shows the Way: Seoul Robotics Helps Cars Move, Park on Their Own appeared first on NVIDIA Blog.

Categories
Misc

Smart Devices, Smart Manufacturing: Pegatron Taps AI, Digital Twins

In the fast-paced field of making the world’s tech devices, Pegatron Corp. initially harnessed AI to gain an edge. Now, it’s on the cusp of creating digital twins to further streamline its efficiency. Whether or not they’re familiar with the name, most people have probably used smartphones, tablets, Wi-Fi routers or other products that Taiwan-based Read article >

The post Smart Devices, Smart Manufacturing: Pegatron Taps AI, Digital Twins appeared first on NVIDIA Blog.

Categories
Offsites

Towards Helpful Robots: Grounding Language in Robotic Affordances

Over the last several years, we have seen significant progress in applying machine learning to robotics. However, robotic systems today are capable of executing only very short, hard-coded commands, such as “Pick up an apple,” because they tend to perform best with clear tasks and rewards. They struggle with learning to perform long-horizon tasks and reasoning about abstract goals, such as a user prompt like “I just worked out, can you get me a healthy snack?”

Meanwhile, recent progress in training language models (LMs) has led to systems that can perform a wide range of language understanding and generation tasks with impressive results. However, these language models are inherently not grounded in the physical world due to the nature of their training process: a language model generally does not interact with its environment nor observe the outcome of its responses. This can result in it generating instructions that may be illogical, impractical or unsafe for a robot to complete in a physical context. For example, when prompted with “I spilled my drink, can you help?” the language model GPT-3 responds with “You could try using a vacuum cleaner,” a suggestion that may be unsafe or impossible for the robot to execute. When asking the FLAN language model the same question, it apologizes for the spill with “I’m sorry, I didn’t mean to spill it,” which is not a very useful response. Therefore, we asked ourselves, is there an effective way to combine advanced language models with robot learning algorithms to leverage the benefits of both?

In “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”, we present a novel approach, developed in partnership with Everyday Robots, that leverages advanced language model knowledge to enable a physical agent, such as a robot, to follow high-level textual instructions for physically-grounded tasks, while grounding the language model in tasks that are feasible within a specific real-world context. We evaluate our method, which we call PaLM-SayCan, by placing robots in a real kitchen setting and giving them tasks expressed in natural language. We observe highly interpretable results for temporally-extended complex and abstract tasks, like “I just worked out, please bring me a snack and a drink to recover.” Specifically, we demonstrate that grounding the language model in the real world nearly halves errors over non-grounded baselines. We are also excited to release a robot simulation setup where the research community can test this approach.

<!–

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

–>

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

A Dialog Between User and Robot, Facilitated by the Language Model
Our approach uses the knowledge contained in language models (Say) to determine and score actions that are useful towards high-level instructions. It also uses an affordance function (Can) that enables real-world-grounding and determines which actions are possible to execute in a given environment. Using the the PaLM language model, we call this PaLM-SayCan.

Our approach selects skills based on what the language model scores as useful to the high level instruction and what the affordance model scores as possible.

Our system can be seen as a dialog between the user and robot, facilitated by the language model. The user starts by giving an instruction that the language model turns into a sequence of steps for the robot to execute. This sequence is filtered using the robot’s skillset to determine the most feasible plan given its current state and environment. The model determines the probability of a specific skill successfully making progress toward completing the instruction by multiplying two probabilities: (1) task-grounding (i.e., a skill language description) and (2) world-grounding (i.e., skill feasibility in the current state).

There are additional benefits of our approach in terms of its safety and interpretability. First, by allowing the LM to score different options rather than generate the most likely output, we effectively constrain the LM to only output one of the pre-selected responses. In addition, the user can easily understand the decision making process by looking at the separate language and affordance scores, rather than a single output.

PaLM-SayCan is also interpretable: at each step, we can see the top options it considers based on their language score (blue), affordance score (red), and combined score (green).

Training Policies and Value Functions
Each skill in the agent’s skillset is defined as a policy with a short language description (e.g., “pick up the can”), represented as embeddings, and an affordance function that indicates the probability of completing the skill from the robot’s current state. To learn the affordance functions, we use sparse reward functions set to 1.0 for a successful execution, and 0.0 otherwise.

We use image-based behavioral cloning (BC) to train the language-conditioned policies and temporal-difference-based (TD) reinforcement learning (RL) to train the value functions. To train the policies, we collected data from 68,000 demos performed by 10 robots over 11 months and added 12,000 successful episodes, filtered from a set of autonomous episodes of learned policies. We then learned the language conditioned value functions using MT-Opt in the Everyday Robots simulator. The simulator complements our real robot fleet with a simulated version of the skills and environment, which is transformed using RetinaGAN to reduce the simulation-to-real gap. We bootstrapped simulation policies’ performance by using demonstrations to provide initial successes, and then continuously improved RL performance with online data collection in simulation.

Given a high-level instruction, our approach combines the probabilities from the language model with the probabilities from the value function (VF) to select the next skill to perform. This process is repeated until the high-level instruction is successfully completed.

Performance on Temporally-Extended, Complex, and Abstract Instructions
To test our approach, we use robots from Everyday Robots paired with PaLM. We place the robots in a kitchen environment containing common objects and evaluate them on 101 instructions to test their performance across various robot and environment states, instruction language complexity and time horizon. Specifically, these instructions were designed to showcase the ambiguity and complexity of language rather than to provide simple, imperative queries, enabling queries such as “I just worked out, how would you bring me a snack and a drink to recover?” instead of “Can you bring me water and an apple?”

We use two metrics to evaluate the system’s performance: (1) the plan success rate, indicating whether the robot chose the right skills for the instruction, and (2) the execution success rate, indicating whether it performed the instruction successfully. We compare two language models, PaLM and FLAN (a smaller language model fine-tuned on instruction answering) with and without the affordance grounding as well as the underlying policies running directly with natural language (Behavioral Cloning in the table below). The results show that the system using PaLM with affordance grounding (PaLM-SayCan) chooses the correct sequence of skills 84% of the time and executes them successfully 74% of the time, reducing errors by 50% compared to FLAN and compared to PaLM without robotic grounding. This is particularly exciting because it represents the first time we can see how an improvement in language models translates to a similar improvement in robotics. This result indicates a potential future where robotics is able to ride the wave of progress that we have been observing in language models, bringing these subfields of research closer together.

Algorithm     Plan     Execute
PaLM-SayCan     84%     74%
PaLM     67%    
FLAN-SayCan     70%     61%
FLAN     38%    
Behavioral Cloning     0%     0%
PaLM-SayCan halves errors compared to PaLM without affordances and compared to FLAN over 101 tasks.
SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.

<!–

SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.

–>

If you’re interested in learning more about this project from the researchers themselves, please check out the video below:

Conclusion and Future Work
We’re excited about the progress that we’ve seen with PaLM-SayCan, an interpretable and general approach to leveraging knowledge from language models that enables a robot to follow high-level textual instructions to perform physically-grounded tasks. Our experiments on a number of real-world robotic tasks demonstrate the ability to plan and complete long-horizon, abstract, natural language instructions at a high success rate. We believe that PaLM-SayCan’s interpretability allows for safe real-world user interaction with robots. As we explore future directions for this work, we hope to better understand how information gained via the robot’s real-world experience could be leveraged to improve the language model and to what extent natural language is the right ontology for programming robots. We have open-sourced a robot simulation setup, which we hope will provide researchers with a valuable resource for future research that combines robotic learning with advanced language models. The research community can visit the project’s GitHub page and website to learn more.

Acknowledgements
We’d like to thank our coauthors Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Kelly Fu, Keerthana Gopalakrishnan, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. We’d also like to thank Yunfei Bai, Matt Bennice, Maarten Bosma, Justin Boyd, Bill Byrne, Kendra Byrne, Noah Constant, Pete Florence, Laura Graesser, Rico Jonschkowski, Daniel Kappler, Hugo Larochelle, Benjamin Lee, Adrian Li, Suraj Nair, Krista Reymann, Jeff Seto, Dhruv Shah, Ian Storz, Razvan Surdulescu, and Vincent Zhao for their help and support in various aspects of the project. And we’d like to thank Tom Small for creating many of the animations in this post.

Categories
Misc

Digital Art Professor Kate Parsons Inspires Next Generation of Creators This Week ‘In the NVIDIA Studio’

Many artists can edit a video, paint a picture or build a model — but transforming one’s imagination into stunning creations can now involve breakthrough design technologies. Kate Parsons, a digital art professor at Pepperdine University and this week’s featured In the NVIDIA Studio artist, helped bring a music video for How Do I Get to Invincible to life using virtual reality and NVIDIA GeForce RTX GPUs.

The post Digital Art Professor Kate Parsons Inspires Next Generation of Creators This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

NVIDIA GTC to Feature CEO Jensen Huang Keynote Announcing New AI and Metaverse Technologies, 200+ Sessions With Top Tech, Business Execs

Deep Learning Pioneers Yoshua Bengio, Geoff Hinton, Yann LeCun Among the Scores of Industry Experts to Present at World’s Premier AI Conference, Sept. 19-22SANTA CLARA, Calif., Aug. 15, 2022 …

Categories
Misc

From Sapling to Forest: Five Sustainability and Employment Initiatives We’re Nurturing in India

For over a decade, NVIDIA has invested in social causes and communities in India as part of our commitment to corporate social responsibility. Bolstering those efforts, we’re unveiling this year’s investments in five projects that have been selected by the NVIDIA Foundation team, focused on the areas of environmental conservation, ecological restoration, social innovation and job Read article >

The post From Sapling to Forest: Five Sustainability and Employment Initiatives We’re Nurturing in India appeared first on NVIDIA Blog.

Categories
Offsites

Rax: Composable Learning-to-Rank Using JAX

Ranking is a core problem across a variety of domains, such as search engines, recommendation systems, or question answering. As such, researchers often utilize learning-to-rank (LTR), a set of supervised machine learning techniques that optimize for the utility of an entire list of items (rather than a single item at a time). A noticeable recent focus is on combining LTR with deep learning. Existing libraries, most notably TF-Ranking, offer researchers and practitioners the necessary tools to use LTR in their work. However, none of the existing LTR libraries work natively with JAX, a new machine learning framework that provides an extensible system of function transformations that compose: automatic differentiation, JIT-compilation to GPU/TPU devices and more.

Today, we are excited to introduce Rax, a library for LTR in the JAX ecosystem. Rax brings decades of LTR research to the JAX ecosystem, making it possible to apply JAX to a variety of ranking problems and combine ranking techniques with recent advances in deep learning built upon JAX (e.g., T5X). Rax provides state-of-the-art ranking losses, a number of standard ranking metrics, and a set of function transformations to enable ranking metric optimization. All this functionality is provided with a well-documented and easy to use API that will look and feel familiar to JAX users. Please check out our paper for more technical details.

Learning-to-Rank Using Rax
Rax is designed to solve LTR problems. To this end, Rax provides loss and metric functions that operate on batches of lists, not batches of individual data points as is common in other machine learning problems. An example of such a list is the multiple potential results from a search engine query. The figure below illustrates how tools from Rax can be used to train neural networks on ranking tasks. In this example, the green items (B, F) are very relevant, the yellow items (C, E) are somewhat relevant and the red items (A, D) are not relevant. A neural network is used to predict a relevancy score for each item, then these items are sorted by these scores to produce a ranking. A Rax ranking loss incorporates the entire list of scores to optimize the neural network, improving the overall ranking of the items. After several iterations of stochastic gradient descent, the neural network learns to score the items such that the resulting ranking is optimal: relevant items are placed at the top of the list and non-relevant items at the bottom.

Using Rax to optimize a neural network for a ranking task. The green items (B, F) are very relevant, the yellow items (C, E) are somewhat relevant and the red items (A, D) are not relevant.

Approximate Metric Optimization
The quality of a ranking is commonly evaluated using ranking metrics, e.g., the normalized discounted cumulative gain (NDCG). An important objective of LTR is to optimize a neural network so that it scores highly on ranking metrics. However, ranking metrics like NDCG can present challenges because they are often discontinuous and flat, so stochastic gradient descent cannot directly be applied to these metrics. Rax provides state-of-the-art approximation techniques that make it possible to produce differentiable surrogates to ranking metrics that permit optimization via gradient descent. The figure below illustrates the use of rax.approx_t12n, a function transformation unique to Rax, which allows for the NDCG metric to be transformed into an approximate and differentiable form.

Using an approximation technique from Rax to transform the NDCG ranking metric into a differentiable and optimizable ranking loss (approx_t12n and gumbel_t12n).

First, notice how the NDCG metric (in green) is flat and discontinuous, making it hard to optimize using stochastic gradient descent. By applying the rax.approx_t12n transformation to the metric, we obtain ApproxNDCG, an approximate metric that is now differentiable with well-defined gradients (in red). However, it potentially has many local optima — points where the loss is locally optimal, but not globally optimal — in which the training process can get stuck. When the loss encounters such a local optimum, training procedures like stochastic gradient descent will have difficulty improving the neural network further.

To overcome this, we can obtain the gumbel-version of ApproxNDCG by using the rax.gumbel_t12n transformation. This gumbel version introduces noise in the ranking scores which causes the loss to sample many different rankings that may incur a non-zero cost (in blue). This stochastic treatment may help the loss escape local optima and often is a better choice when training a neural network on a ranking metric. Rax, by design, allows the approximate and gumbel transformations to be freely used with all metrics that are offered by the library, including metrics with a top-k cutoff value, like recall or precision. In fact, it is even possible to implement your own metrics and transform them to obtain gumbel-approximate versions that permit optimization without any extra effort.

Ranking in the JAX Ecosystem
Rax is designed to integrate well in the JAX ecosystem and we prioritize interoperability with other JAX-based libraries. For example, a common workflow for researchers that use JAX is to use TensorFlow Datasets to load a dataset, Flax to build a neural network, and Optax to optimize the parameters of the network. Each of these libraries composes well with the others and the composition of these tools is what makes working with JAX both flexible and powerful. For researchers and practitioners of ranking systems, the JAX ecosystem was previously missing LTR functionality, and Rax fills this gap by providing a collection of ranking losses and metrics. We have carefully constructed Rax to function natively with standard JAX transformations such as jax.jit and jax.grad and various libraries like Flax and Optax. This means that users can freely use their favorite JAX and Rax tools together.

Ranking with T5
While giant language models such as T5 have shown great performance on natural language tasks, how to leverage ranking losses to improve their performance on ranking tasks, such as search or question answering, is under-explored. With Rax, it is possible to fully tap this potential. Rax is written as a JAX-first library, thus it is easy to integrate it with other JAX libraries. Since T5X is an implementation of T5 in the JAX ecosystem, Rax can work with it seamlessly.

To this end, we have an example that demonstrates how Rax can be used in T5X. By incorporating ranking losses and metrics, it is now possible to fine-tune T5 for ranking problems, and our results indicate that enhancing T5 with ranking losses can offer significant performance improvements. For example, on the MS-MARCO QNA v2.1 benchmark we are able to achieve a +1.2% NDCG and +1.7% MRR by fine-tuning a T5-Base model using the Rax listwise softmax cross-entropy loss instead of a pointwise sigmoid cross-entropy loss.

Fine-tuning a T5-Base model on MS-MARCO QNA v2.1 with a ranking loss (softmax, in blue) versus a non-ranking loss (pointwise sigmoid, in red).

Conclusion
Overall, Rax is a new addition to the growing ecosystem of JAX libraries. Rax is entirely open source and available to everyone at github.com/google/rax. More technical details can also be found in our paper. We encourage everyone to explore the examples included in the github repository: (1) optimizing a neural network with Flax and Optax, (2) comparing different approximate metric optimization techniques, and (3) how to integrate Rax with T5X.

Acknowledgements
Many collaborators within Google made this project possible: Xuanhui Wang, Zhen Qin, Le Yan, Rama Kumar Pasumarthi, Michael Bendersky, Marc Najork, Fernando Diaz, Ryan Doherty, Afroz Mohiuddin, and Samer Hassan.