Categories
Misc

Startup’s Vision AI Software Trains Itself — in One Hour — to Detect Manufacturing Defects in Real Time

Cameras have been deployed in factories for over a decade — so why, Franz Tschimben wondered, hasn’t automated visual inspection yet become the worldwide standard? This question motivated Tschimben and his colleagues to found Covision Quality, an AI-based visual-inspection software startup that uses NVIDIA technology to transform end-of-line defect detection for the manufacturing industry. “The Read article >

The post Startup’s Vision AI Software Trains Itself — in One Hour — to Detect Manufacturing Defects in Real Time appeared first on NVIDIA Blog.

Categories
Misc

Boldly Go: Discover New Frontiers in AI-Powered Transportation at GTC

AI and the metaverse are revolutionizing every aspect of the way we live, work and play — including how we move. Leaders in the automotive and technology industries will come together at NVIDIA GTC to discuss the newest breakthroughs driving intelligent vehicles, whether in the real world or in simulation. The virtual conference, which runs Read article >

The post Boldly Go: Discover New Frontiers in AI-Powered Transportation at GTC appeared first on NVIDIA Blog.

Categories
Misc

Easy A: GeForce NOW Brings Higher Resolution and Frame Rates for Browser Streaming on PC

Class is in session this GFN Thursday as GeForce NOW makes the up-grade with support for higher resolutions and frame rates in Chrome browser on PC. It’s the easiest way to spice up a boring study session. When the lecture is over, dive into the six games joining the GeForce NOW library this week, where Read article >

The post Easy A: GeForce NOW Brings Higher Resolution and Frame Rates for Browser Streaming on PC appeared first on NVIDIA Blog.

Categories
Misc

Attend Expert-Led Developer Sessions at GTC 2022

Register now and get ready to explore cutting-edge technology and the latest developer tools at GTC.

Register now and get ready to explore cutting-edge technology and the latest developer tools at GTC.

Categories
Misc

Immunai Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs

Mapping the immune system could lead to the creation of drugs that help our bodies win the fight against cancer and other diseases. That’s the big idea behind immunotherapy. The problem: the immune system is incredibly complex. Enter Immunai, a biotech company that’s using cutting-edge genomics & ML technology to map the human immune system Read article >

The post Immunai Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs appeared first on NVIDIA Blog.

Categories
Misc

Using Network Graphs to Visualize Potential Fraud on Ethereum Blockchain

Web of wires forming geometric shapes.Beyond the unimaginable prices for monkey pictures, NFT’s underlying technology provides companies with a new avenue to directly monetize their online…Web of wires forming geometric shapes.

Beyond the unimaginable prices for monkey pictures, NFT’s underlying technology provides companies with a new avenue to directly monetize their online engagements. Major brands such as Adidas, NBA, and TIME have already begun experimenting with these revenue streams using NFTs–and we are still early in this trend.

As data practitioners, we are positioned to provide valuable insights into these revenue streams, given that all transactions are public on the blockchain. This post provides a guided project to access, analyze, and identify potential fraud using blockchain data using Python.

In this post and accompanying Jupyter notebook, I discuss the following:

  • The basics of blockchain, NFTs, and network graphs.
  • How to pull NFT data using the open-source package NFT Analyst Starter Pack from a16z.
  • How to interpret Ethereum blockchain data.
  • The fraudulent practice of wash trading NFTs.
  • Constructing network graphs to visualize potential wash trading on the NFT project Bored Ape Yacht Club.

The Jupyter notebook has a more detailed, step-by-step guide for writing Python code to implement this example walkthrough, and the post provides additional context. In addition, this post assumes that you have a basic understanding of pandas, data preparation, and data visualization.

What is blockchain data?

Cutting through the media frenzy of coins named after dogs and pixelated pictures selling for hundreds of thousands of dollars reveals fascinating technology: the blockchain. 

The following excerpt best describes this decentralized data source:

“At a very high level, blockchains are ledgers of transactions utilizing cryptography that can only add information and thus can’t be changed (i.e., immutable). What separates blockchains from ledgers you find at banks is a concept called ‘decentralization’— where every computer connected to a respective blockchain must ‘agree’ on the same state of the blockchain and subsequent data added to it.”

For more information about Ethereum blockchain data, see Making Sense of Ethereum Data for Analytics.

Central to this technology is that all the data (for example, logs, metadata, and so on) must be public and accessible. I highly recommend Stanford professor Dan Boneh’s lecture.

What is an NFT?

NFT stands for non-fungible token, a crypto asset on a blockchain (such as Ethereum) in which it represents a unique one-of-one token that can be digitally owned. For example, gold bars are fungible, as multiple bars can exist and represent the same thing, while the original Mona Lisa is non-fungible in that only one exists.

Contrary to popular belief, NFTs are not just art and JPEGs but digital representations of ownership on a blockchain ledger for a unique item such as art, music, or whatever the NFT creator wants to put onto the metadata. For this post, however, we use the NFT project Bored Ape Yacht Club (BAYC), an artwork NFT.

P.S. If you are a visual learner, my favorite intro resource on the topic of NFTs is the What Are NFTs and How Can They Be Used in Decentralized Finance? DEFI Explained video by Finematics.

What is a network graph, and why are they prime to represent blockchain data?

Networks are a method of organizing relationship data using nodes and edges. Nodes represent an entity, such as an email address or social media account, while the edges show the connection between nodes. 

Furthermore, you can store metadata for nodes and edges to represent different aspects of a relationship. Metadata can range from weights to labels. Figure 1 shows the steps of taking an entire network and zooming into a use case with helpful labels from metadata.

Six separate network graphs of the same data in various stages of detail, graph 1: all data as a blob of blue, graph 2: single network with no labels, graph 3: single network with wallet addresses as labels, graph 4: single network with wallet addresses replaced with sequential letters, graph 5: single network with sequential letters for labels and added directional arrows, graph 6: single graph with sequential letter labels, directional arrows, and labels of money spent for each transaction..
Source: Graphs from notebook tutorial.
Figure 1. The various network graphs created in this post

What makes network graphs ideal for representing blockchain transactions is that there is always a to and from blockchain address, as well as significant metadata (for example, timestamps, coin amounts, and so on.) for each transaction. Furthermore, as blockchain data is public by design through decentralization, you can use network graphs to visualize economic behaviors on a respective blockchain.

In this example, I want to demonstrate identifying the fraudulent behavior of wash trading, where an individual intentionally sells an asset to themselves across multiple accounts to artificially inflate the asset’s price.

Chainalysis wrote an excellent report on the phenomena, where they identified over 260 Ethereum crypto wallets potentially engaging in wash trading with a collective profit of over $8.4 million in 2021 alone.

Pulling data from the Ethereum blockchain

Though all blockchain data is publicly available to anyone, it is still tough to access and prepare for analysis. Following are some options to access blockchain data:

  • Create your own blockchain node (for example, become a miner) to read the rawest data available.
  • Use a third-party tool to create your own blockchain node.
  • Use a third-party API to read raw data from their own blockchain node.
  • Use a third-party API to read cleaned and aggregated blockchain data from their service.
  • Use the open-source package NFT Analyst Starter Pack from a16z.

Although all are viable options, each has a tradeoff between reliability, trust, and convenience.

For example, I worked on an NFT analytics project where we wanted to create a reliable NFT market dashboard. Unfortunately, having our own blockchain node was cost-prohibitive, many third-party data sources had various data-quality issues that we couldn’t control, and it became challenging to track transactions across multiple blockchains. That project ultimately required bringing together high-quality data from numerous third-party APIs.

Thankfully for this project, you want the most convenience possible to focus on learning, so I recommend NFT Analyst Starter Pack from a16z. Think of this package as a convenient wrapper around the third-party blockchain API from Alchemy, where it creates easy-to-use CSVs of your desired NFT contract.

Preparing data and creating network graphs

The NFT Analyst Starter Pack results in three separate CSV files for the BAYC NFT project:

  • BAYC Metadata: Information regarding a specific NFT, where asset_id is the unique identifier within this NFT token.
  • BAYC Sales: Logs and metadata related to a specific transaction represented by its transaction hash where a seller and buyer inform you of the wallets involved.
  • BAYC Transfers: The same as BAYC Sales data, but no money is transferred from one wallet to another.

For this project, most of the data preparation is around:

  • Reorganizing BAYC Sales and BAYC Transfers to enable a clean union of the two datasets.
  • Deleting duplicate logs of transfer transactions already represented in sales.

Given that the goal is to learn, don’t worry about whether the blockchain data is accurate, but you can always check for yourself by searching the transaction_hash value on Etherscan.

After preparing the data, use the NetworkX package to generate a network graph data structure of your NFT transactions. There are multiple ways to construct a graph, but in my opinion, the most straightforward approach is to use the function from_pandas_edgelist, where you just provide a pandas DataFrame, the to and from values to represent the nodes, and any metadata for edges and labeling.

[('0x2fdcca65899346af3a93a8daa6128bdbcb1ce3b3',
  '0xcedf17dfafa947cd0e205fe2a3a183cf2fb3a0bc',
  {'transaction_hash': '0xb235f0321b0b50198399ec7f2bb759ef625f85673b4d90d68f711229750181e4',
   'block_number': '14675897',
   'date': '2022-04-28',
   'asset_id': '7438',
   'sale_price_eth': 153.2,
   'sale_price_usd': 442685.5285671361,
   'transaction_type': 'sell',
   'asset_contract': '0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d'}),
('0x2fdcca65899346af3a93a8daa6128bdbcb1ce3b3',
  '0xd8fdd6031fa27194f93e1a877f8bf5bfc9b47e1e',     {'transaction_hash':'0x7b4797061eb16d73a28a869e51745e471e2849a55c80459b2aff7f0205925d74',
   'block_number': '14654313',
   'date': '2022-04-25',
   'asset_id': '5954',
   'sale_price_eth': 0.0,
   'sale_price_usd': 0.0,
   'transaction_type': 'transfer',
   'asset_contract': '0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d'})]

From this prepared data, the NetworkX package makes visualizing network graphs as easy as nx.draw, but with over 40k transactions in the DataFrame, visualizing the whole graph returns only a useless blob. Therefore, you must be specific about exactly what to visualize among your transactions to create a captivating data story.

Visualizing potential wash trading

Rather than scouring through the transactions of 10,000 NFTs, you can instead validate what others are stating within the market. Notably, the NFT Wash Trading – Is it possible to protect against it? post calls out BAYC token 8099 as potentially being subjected to the fraudulent practice of wash trading.

If you follow along in the accompanying notebook, you carry out these steps:

  • Filter prepared NFT data to only rows containing logs for asset_id 8099.
  • Rename the to and from wallet addresses to uppercase letters ordered sequentially following when the wallet address first appeared among the NFT asset’s transactions.
  • Generate network graph data with the prepared asset 8099 data using the NetworkX package.
  • Plot the network graph with desired labels, edge arrows, and node positioning.

Did BAYC 8099 NFT experience wash trading?

The data plotted in Figure 2 enables you to visualize the data corresponding to asset 8099. Starting at node H, you can see that this wallet first increases the price from $95k to $166k between HI, then adds more transactions through transfers between HJ. Finally, node H sells the potentially artificially increased NFT to node K.

Graph generated from Jupyter Notebook tutorial, following the first sell, second sell, third sell that might indicate potential wash trading, and fourth sell that is potentially inflated.
Figure 2. Following the transactions of NFT BAYC 8099

Though this graph can’t definitively state that node H engaged in wash trading—as you don’t know whether nodes H, I, and J are wallets owned by the same person—seeing loops for a node where the price increases should flag the need to do more due diligence. For example, you could look at etherscan.com to review the transactions between the following wallets:

  • 0xe4bc96b24e0bdf87b4b92ed39c1aef8839b090dd (node H).
  • 0x7e99611cf208cb097497a59b3fb7cb4dfd115ea9 (node I).
  • 0xcbc9f463f83699d20dd5b54be5262be69a0aea9f (node J).

Perhaps node H had sellers’ remorse and wanted their NFT back, as it’s not uncommon for investors to get attached to their beloved NFTs. But numerous transactions between the wallets associated with nodes H, I, and J could indicate further red flags for the NFT asset.

Next steps

By following along with this post and accompanying notebook, you have seen how to access Ethereum blockchain data and analyze NFTs through network graphs. If you enjoyed this analysis and are interested in doing more projects like this, chat with me in the CharlieDAO Discord, where I hang out. We are a collective of software engineers, data scientists, and crypto natives exploring web3!

Disclaimer

This content is for educational purposes only and is not financial advice. I do not have any financial ties to the analyzed NFTs at the moment of releasing the notebook, and here is my crypto wallet address on Etherscan for reference. This analysis only highlights potential fraud to investigate further but does not prove fraud has taken place. Finally, if you own crypto, never share your “secret recovery phrase” or “private keys” with anyone.

Categories
Misc

Running Large-Scale Graph Analytics with Memgraph and NVIDIA cuGraph Algorithms

With the latest Memgraph Advanced Graph Extensions (MAGE) release, you can now run GPU-powered graph analytics from Memgraph in seconds, while working in…

With the latest Memgraph Advanced Graph Extensions (MAGE) release, you can now run GPU-powered graph analytics from Memgraph in seconds, while working in Python.  Powered by NVIDIA cuGraph, the following graph algorithms will now execute on GPU: 

  • PageRank (graph analysis)
  • Louvain (community detection)
  • Balanced Cut (clustering)
  • Spectral Clustering (clustering)
  • HITS (hubs versus authorities analytics)
  • Leiden (community detection)
  • Katz centrality
  • Betweenness centrality

This tutorial will show you how to use PageRank graph analysis and Louvain community detection to analyze a Facebook dataset containing 1.3 million relationships.

By the end of the tutorial, you will know how to:

  • Import data inside Memgraph using Python
  • Run analytics on large scale graphs and get fast results
  • Run analytics on NVIDIA GPUs from Memgraph

Tutorial prerequisites

To follow this graph analytics tutorial, you will need an NVIDIA GPU, driver, and container toolkit. Once you have successfully installed the NVIDIA GPU driver and container toolkit, you must also install the following four tools:

The next section walks you through installing and setting up these tools for the tutorial. 

Docker

Docker is used to install and run the mage-cugraph Docker image. There are three steps involved in setting up and running the Docker image: 

  1. Download Docker
  2. Download the tutorial data
  3. Run the Docker image, giving it access to the tutorial data

1. Download Docker

You can install Docker by visiting the Docker webpage and following the instructions for your operating system. 

2. Downloading the tutorial data

Before running the mage-cugraph Docker image, first download the data that will be used in the tutorial. This allows you to give the Docker image access to the tutorial dataset when run.  

To download the data, use the following commands to clone the jupyter-memgraph-tutorials GitHub repo, and move it to the jupyter-memgraph-tutorials/cugraph-analytics folder:

Git clone https://github.com/memgraph/jupyter-memgraph-tutorials.git
Cd jupyter-memgraph-tutorials/cugraph-analytics

3. Run the Docker image

You can now use the following command to run the Docker image and mount the workshop data to the /samples folder:

docker run -it -p 7687:7687 -p 7444:7444 --volume /data/facebook_clean_data/:/samples mage-cugraph

When you run the Docker container, you should see the following message:

You are running Memgraph vX.X.X
To get started with Memgraph, visit https://memgr.ph/start

With the mount command executed, the CSV files needed for the tutorial will be located inside the /samples folder within the Docker image, where Memgraph will find them when needed.

Jupyter notebook

Now that Memgraph is running, install Jupyter. This tutorial uses JupyterLab, and you can install it with the following command:

pip install jupyterlab

Once JupyterLab is installed, launch it with the following command:

jupyter lab

GQLAlchemy 

Use GQLAlchemy, an Object Graph Mapper (OGM), to connect to Memgraph and also execute queries in Python. You can think of Cypher as SQL for graph databases. It contains many of the same language constructs such as Create, Update, and Delete. 

Download CMake on your system, and then you can install GQLAlchemy with pip:

pip install gqlalchemy

Memgraph Lab 

The last prerequisite you need to install is Memgraph Lab. You will use it to create data visualizations upon connecting to Memgraph. Learn how to install Memgraph Lab as a desktop application for your operating system.

With Memgraph Lab installed, you should now connect to your Memgraph database

At this point, you are finally ready to:

  • Connect to Memgraph with GQLAlchemy
  • Import the dataset
  • Run graph analytics in Python

Connect to Memgraph with GQLAlchemy

First, position yourself in the Jupyter notebook. The first three lines of code will import gqlalchemy, connect to Memgraph database instance via host:127.0.0.1 and port:7687, and clear the database. Be sure to start with a clean slate.

from gqlalchemy import Memgraph
memgraph = Memgraph("127.0.0.1", 7687)
memgraph.drop_database()

Import the dataset from CSV files. 

Next, you will perform PageRank and Louvain community detection using Python.

Import data

The Facebook dataset consists of eight CSV files, each having the following structure:

node_1,node_2
0,1794
0,3102
0,16645

Each record represents an edge connecting two nodes.  Nodes represent the pages, and relationships are mutual likes among them.

There are eight distinct types of pages (Government, Athletes, and TV shows, for example). Pages have been reindexed for anonymity, and all pages have been verified for authenticity by Facebook.

Since Memgraph imports queries faster when data has indices, create them for all the nodes with the label Page on the id property.

memgraph.execute(
    """
    CREATE INDEX ON :Page(id);
    """
)

Docker already has container access to the data used in this tutorial, so you can list through the local files in the ./data/facebook_clean_data/ folder. By concatenating both the file names and the /samples/ folder, you can determine their paths. Use the concatenated file paths to load data into Memgraph.

import os
from os import listdir
from os.path import isfile, join
csv_dir_path = os.path.abspath("./data/facebook_clean_data/")
csv_files = [f"/samples/{f}" for f in listdir(csv_dir_path) if isfile(join(csv_dir_path, f))]

Load all CSV files using the following query:

for csv_file_path in csv_files:
    memgraph.execute(
        f"""
        LOAD CSV FROM "{csv_file_path}" WITH HEADER AS row
        MERGE (p1:Page {{id: row.node_1}}) 
        MERGE (p2:Page {{id: row.node_2}}) 
        MERGE (p1)-[:LIKES]->(p2);
        """
    )

For more information about importing CSV files with LOAD CSV see the Memgraph documentation.

Next, use PageRank and Louvain community detection algorithms with Python to determine which pages in the network are most important, and to find all the communities in a network.

PageRank importance analysis

To identify important pages in a Facebook dataset, you will execute PageRank. Learn about different algorithm settings that can be set when calling PageRank.

Note that you will also find other algorithms integrated within MAGE. Memgraph should help with the process of running graph analytics on large-scale graphs. Find other Memgraph tutorials on how to run these analytics.

MAGE is integrated to simplify executing PageRank. The following query will first execute the algorithm, and then create and set the rank property of each node to the value that the cugraph.pagerank algorithm returns.

The value of that property will then be saved as a variable rank. Note that this (and all tests presented here) were executed on an NVIDIA GeForce GTX 1650 Ti, and Intel Core i5-10300H CPU at 2.50GHz with 16GB RAM, and returned results in around four seconds.  

 memgraph.execute(
        """
        CALL cugraph.pagerank.get() YIELD node,rank
        SET node.rank = rank;
        """
    )

Next, retrieve ranks using the following Python call:

results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        RETURN n.id as node, n.rank as rank
        ORDER BY rank DESC
        LIMIT 10;
        """
    )
for dict_result in results:
    print(f"node id: {dict_result['node']}, rank: {dict_result['rank']}")

node id: 50493, rank: 0.0030278728385218327
node id: 31456, rank: 0.0027350282311318468
node id: 50150, rank: 0.0025153975342989345
node id: 48099, rank: 0.0023413620866201052
node id: 49956, rank: 0.0020696403564964
node id: 23866, rank: 0.001955167533390466
node id: 50442, rank: 0.0019417018181751462
node id: 49609, rank: 0.0018211204462452515
node id: 50272, rank: 0.0018123518843272954
node id: 49676, rank: 0.0014821440895415787

This code returns 10 nodes with the highest rank score. Results are available in a dictionary form.

Now, it is time to visualize results with Memgraph Lab. In addition to creating beautiful visualizations powered by D3.js and our Graph Style Script language, you can use Memgraph Lab to:

  • Query graph database and write your graph algorithms in Python or C++ or even Rust
  • Check Memgraph Database Logs
  • Visualize graph schema

Memgraph Lab comes with a variety of pre-built datasets to help you get started. Open Execute Query view in Memgraph Lab and run the following query:

MATCH (n)
WITH n
ORDER BY n.rank DESC
LIMIT 3
MATCH (n)

The first part of this query will MATCH all the nodes. The second part of the query will ORDER nodes by their rank in descending order.

For the first three nodes, obtain all pages connected to them. We need the WITH clause to connect the two parts of the query. Figure 1 shows the PageRank query results.

Generated graph for visualization of grouped PageRank results
Figure 1. PageRank results visualized in Memgraph Lab

The next step is learning how to use Louvain community detection to find communities present in the graph.

Community detection with Louvain

The Louvain algorithm measures the extent to which the nodes within a community are connected, compared to how connected they would be in a random network.

It also recursively merges communities into a single node and executes the modularity clustering on the condensed graphs. This is one of the most popular community detection algorithms.

Using Louvain, you can find the number of communities within the graph.  First execute Louvain and save the cluster_id as a property for every node:

memgraph.execute(
    """
    CALL cugraph.louvain.get() YIELD cluster_id, node
    SET node.cluster_id = cluster_id;
    """
)

To find the number of communities, run the following code:

results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        WITH DISTINCT n.cluster_id as cluster_id
        RETURN count(cluster_id ) as num_of_clusters;
        """
    )
# we will get only 1 result
result = list(results)[0]

#don't forget that results are saved in a dict
print(f"Number of clusters: {result['num_of_clusters']}")

Number of clusters: 2664

Next, take a closer look at some of these communities. For example, you may find nodes that belong to one community, but are connected to another node that belongs in the opposing community. Louvain attempts to minimize the number of such nodes, so you should not see many of them. In Memgraph Lab, execute the following query:

MATCH  (n2)(m1)
WHERE n1.cluster_id != m1.cluster_id AND n1.cluster_id = n2.cluster_id
RETURN *
LIMIT 1000;

This query will MATCH node n1 and its relationship to two other nodes n2 and m1 with the following parts, respectively: (n2) and (n1)-[e]->(m1). Then, it will filter out only those nodes WHERE cluster_id of n1 and n2 is not the same as the cluster_id of node m1.

Use LIMIT 1000 to show only 1,000 of such relationships, for visualization simplicity.

Using Graph Style Script in Memgraph Lab, you can style your graphs to, for example, represent different communities with different colors. Figure 2 shows the Louvain query results. 

Generated graph visualization of the Louvain query results
Figure 2. Louvain results visualized in Memgraph Lab

Summary

And there you have it: millions of nodes and relationships imported using Memgraph and analyzed using cuGraph PageRank and Louvain graph analytics algorithms. With GPU-powered graph analytics from Memgraph, powered by NVIDIA cuGraph, you are able to explore massive graph databases and carry out inference without having to wait for results. You can find more tutorials covering a variety of techniques on the Memgraph website.

Categories
Misc

Using Federated Learning to Bridge Data Silos in Financial Services

Unlocking the full potential of artificial intelligence (AI) in financial services is often hindered by the inability to ensure data privacy during machine…

Unlocking the full potential of artificial intelligence (AI) in financial services is often hindered by the inability to ensure data privacy during machine learning (ML). For instance, traditional ML methods assume all data can be moved to a central repository.

This is an unrealistic assumption when dealing with data sovereignty and security considerations or sensitive data like personally identifiable information. More practically, it ignores data egress challenges and the considerable cost of creating large pooled datasets.

Massive internal datasets that would be valuable for training ML models remain unused. How can companies in the financial services industry leverage their own data while ensuring privacy and security?

This post introduces federated learning and explains its benefits for businesses handling sensitive datasets. We present three ways federated learning can be used in financial services and provide tips on getting started today.

What is federated learning

Federated learning is a ML technique that enables the extraction of insights from multiple isolated datasets—without needing to share or move that data into a central repository or server.

For example, assume you have multiple datasets you want to use to train an AI model. Today’s standard ML approach requires first gathering all the training data in one place. However, this approach is not feasible for much of the world’s data that is sensitive. This leaves many datasets and use-cases off-limits for applying AI techniques.

On the other hand, federated learning does not assume that one unified dataset can be created. The distributed training datasets are instead left where they are.

The approach involves creating multiple versions of the model and sending one to each server or device where the datasets live. Each site trains the model locally on its subset of the data, and then sends only the model parameters back to a central server. This is the key feature of federated learning: only the model updates or parameters are shared, not the training data itself. This preserves data privacy and sovereignty. 

Finally, the central server collects all the updates from each site and intelligently aggregates the “mini-models” into one global model. This global model can capture the insights from the entire dataset, even when actual data cannot be combined.

Note that these local sites could be servers, edge devices like smartphones, or any machine that can train locally and send back the model updates to the central server.

Advantages of privacy-preserving technology

Large-scale collaborations in healthcare have demonstrated the real-world viability of using federated learning for multiple independent parties to jointly train an AI model. However, federated learning is not just about collaborating with external partners.

In financial institutions, we see an incredible opportunity for federated learning to bridge internal data silos. Company-wide ROI can increase as businesses gather all viable data for new products, including recommender systems, fraud detection systems, and call center analytics.

Privacy concerns, however, aren’t limited to financial data. The wave of data privacy legislation being enacted worldwide today (starting with GDPR in Europe and CCPA in California, with many similar laws coming soon) will only accelerate the need for privacy-preserving ML techniques in all industries.

Expect federated learning to become an essential part of the AI toolset in the years ahead.

Practical business use cases 

ML algorithms are hungry for data. Furthermore, the real-world performance of your ML model depends not only on the amount of data but also the relevance of the training data. 

Many organizations could improve current AI models by incorporating new datasets that cannot be easily accessed without sacrificing privacy. This is where federated learning comes in. 

Federated learning enables companies to leverage new data resources without requiring data sharing.

Broadly, three types of use cases are enabled by federated learning: 

  • Intra-company: Bridging internal data silos
  • Inter-company: Facilitating collaboration between organizations
  • Edge computing: Learning across thousands of edge devices

Intra-company use case: Leverage siloed internal data

There are many reasons a single company might rely on multiple data storage solutions. For example:

  1. Data governance rules such as GDPR may require that data be kept in specific geolocations and specify retention and privacy policies. 
  2. Mergers and acquisitions come with new data from the partner company. Still, the arduous task of integrating that data into existing storage systems often leaves the data dispersed for long periods.
  3. Both on-premise and hybrid cloud storage solutions are used, and there is a high cost of moving large amounts of data. 

Federated learning enables your company to leverage ML across isolated datasets in different business organizations, geographic regions, or data warehouses while preserving privacy and security.

Illustration of example workflow in an intra-company use case showing a federated server (bottom) storing the global model and then receiving parameters or model weights from client nodes (US data center, EU data center, public/private cloud).
Figure 1. A workflow of an intra-company federated learning use case. A federated server stores the global model and receives parameters from client nodes.

Inter-company use case: Collaborate with external partners

Gathering enough quantitative data to build powerful AI models is difficult for a single company. Consider an insurance company building an effective fraud detection system. The company can only collect data from observed events such as customers filing a claim. Yet, this data may not be representative of the entire population and can therefore potentially contribute to AI model bias.

To build an effective fraud detection system, the company needs larger datasets with more diverse data points to train robust, generalizable models. Many organizations can benefit from pooling data with others. Practically, most organizations will not share their proprietary datasets on a common supercomputer or cloud server.

Illustration of example workflow in an inter-company use case showing a federated server (bottom) storing the global model and then receiving parameters or model weights from client nodes (bank X, bank Y, credit card network Z). Graphic created by author Annika Brundyn
Figure 2. A workflow of an inter-company federated learning use case. A federated server stores the global model and receives parameters from client nodes.

Enabling this type of collaboration for industry-wide challenges could bring massive benefits. 

For example, in one of the largest real-world federated collaborations, we saw 20 independent hospitals across five continents train an AI model for predicting the oxygen needs of COVID-19 infected patients. On average, the hospitals saw a 38% improvement in generalizability and a 16% improvement in the model performance by participating in the federated system.

Likewise, there is a real opportunity to maintain customer privacy while credit card networks reduce fraudulent activity and banks employ anti-money laundering initiatives. Federated learning increases the data available to a single bank, which can help address issues such as money laundering activities in correspondent banking.

Edge computing: Smartphones and IoT

Google originally introduced federated learning in 2017 to train AI models on personal data distributed across billions of mobile devices. In 2022, many more devices are connected to the internet, including smartwatches, home assistants, alarm systems, thermostats, and even cars.

Federated learning is useful for all kinds of edge devices that are continuously collecting valuable data for ML models, but this data is often privacy sensitive, large in quantity, or both, which prevents logging to the data center. 

How does federated learning fit into an existing workflow

It is important to note that federated learning is a general technique. Federated learning is not just about training neural networks; rather, it applies to data analysis, more traditional ML methods, or any other distributed workflow. 

Very few assumptions are built into federated learning, and perhaps only two are worth mentioning: 1) local sites can connect to a central server, and 2) each site has the minimum computational resources to train locally.

Beyond that, you are free to design your own application with custom local and global aggregation behavior. You can decide how much trust to place in different parties and how much is shared with the central server. The federated system is configurable to your specific business needs.

For example, federated learning can be paired with other privacy-preserving techniques like differential privacy (to add noise) and homomorphic encryption (to encrypt model updates and obscure what the central server sees).

Get started with federated learning 

We have developed a federated_learning code sample that shows you how to train a global fraud prediction model on two different splits of a credit card transactions dataset, corresponding to two different geographic regions. 

Although federated learning by definition enables training across multiple machines, this example is designed to simulate an entire federated system on a single machine for you to get up and running in under an hour. The system is implemented with NVFLARE, an NVIDIA open-source framework to enable federated learning.

Acknowledgments 

We would like to thank Patrick Hogan and Anita Weemaes for their contributions to this post.

Categories
Misc

AI Shows the Way: Seoul Robotics Helps Cars Move, Park on Their Own

Imagine driving a car — one without self-driving capabilities — to a mall, airport or parking garage, and using an app to have the car drive off to park itself. Software company Seoul Robotics is using NVIDIA technology to make this possible — turning non-autonomous cars into self-driving vehicles. Headquartered in Korea, the company’s initial Read article >

The post AI Shows the Way: Seoul Robotics Helps Cars Move, Park on Their Own appeared first on NVIDIA Blog.

Categories
Misc

Smart Devices, Smart Manufacturing: Pegatron Taps AI, Digital Twins

In the fast-paced field of making the world’s tech devices, Pegatron Corp. initially harnessed AI to gain an edge. Now, it’s on the cusp of creating digital twins to further streamline its efficiency. Whether or not they’re familiar with the name, most people have probably used smartphones, tablets, Wi-Fi routers or other products that Taiwan-based Read article >

The post Smart Devices, Smart Manufacturing: Pegatron Taps AI, Digital Twins appeared first on NVIDIA Blog.