Categories
Misc

Running Large-Scale Graph Analytics with Memgraph and NVIDIA cuGraph Algorithms

With the latest Memgraph Advanced Graph Extensions (MAGE) release, you can now run GPU-powered graph analytics from Memgraph in seconds, while working in…

With the latest Memgraph Advanced Graph Extensions (MAGE) release, you can now run GPU-powered graph analytics from Memgraph in seconds, while working in Python.  Powered by NVIDIA cuGraph, the following graph algorithms will now execute on GPU: 

  • PageRank (graph analysis)
  • Louvain (community detection)
  • Balanced Cut (clustering)
  • Spectral Clustering (clustering)
  • HITS (hubs versus authorities analytics)
  • Leiden (community detection)
  • Katz centrality
  • Betweenness centrality

This tutorial will show you how to use PageRank graph analysis and Louvain community detection to analyze a Facebook dataset containing 1.3 million relationships.

By the end of the tutorial, you will know how to:

  • Import data inside Memgraph using Python
  • Run analytics on large scale graphs and get fast results
  • Run analytics on NVIDIA GPUs from Memgraph

Tutorial prerequisites

To follow this graph analytics tutorial, you will need an NVIDIA GPU, driver, and container toolkit. Once you have successfully installed the NVIDIA GPU driver and container toolkit, you must also install the following four tools:

The next section walks you through installing and setting up these tools for the tutorial. 

Docker

Docker is used to install and run the mage-cugraph Docker image. There are three steps involved in setting up and running the Docker image: 

  1. Download Docker
  2. Download the tutorial data
  3. Run the Docker image, giving it access to the tutorial data

1. Download Docker

You can install Docker by visiting the Docker webpage and following the instructions for your operating system. 

2. Downloading the tutorial data

Before running the mage-cugraph Docker image, first download the data that will be used in the tutorial. This allows you to give the Docker image access to the tutorial dataset when run.  

To download the data, use the following commands to clone the jupyter-memgraph-tutorials GitHub repo, and move it to the jupyter-memgraph-tutorials/cugraph-analytics folder:

Git clone https://github.com/memgraph/jupyter-memgraph-tutorials.git
Cd jupyter-memgraph-tutorials/cugraph-analytics

3. Run the Docker image

You can now use the following command to run the Docker image and mount the workshop data to the /samples folder:

docker run -it -p 7687:7687 -p 7444:7444 --volume /data/facebook_clean_data/:/samples mage-cugraph

When you run the Docker container, you should see the following message:

You are running Memgraph vX.X.X
To get started with Memgraph, visit https://memgr.ph/start

With the mount command executed, the CSV files needed for the tutorial will be located inside the /samples folder within the Docker image, where Memgraph will find them when needed.

Jupyter notebook

Now that Memgraph is running, install Jupyter. This tutorial uses JupyterLab, and you can install it with the following command:

pip install jupyterlab

Once JupyterLab is installed, launch it with the following command:

jupyter lab

GQLAlchemy 

Use GQLAlchemy, an Object Graph Mapper (OGM), to connect to Memgraph and also execute queries in Python. You can think of Cypher as SQL for graph databases. It contains many of the same language constructs such as Create, Update, and Delete. 

Download CMake on your system, and then you can install GQLAlchemy with pip:

pip install gqlalchemy

Memgraph Lab 

The last prerequisite you need to install is Memgraph Lab. You will use it to create data visualizations upon connecting to Memgraph. Learn how to install Memgraph Lab as a desktop application for your operating system.

With Memgraph Lab installed, you should now connect to your Memgraph database

At this point, you are finally ready to:

  • Connect to Memgraph with GQLAlchemy
  • Import the dataset
  • Run graph analytics in Python

Connect to Memgraph with GQLAlchemy

First, position yourself in the Jupyter notebook. The first three lines of code will import gqlalchemy, connect to Memgraph database instance via host:127.0.0.1 and port:7687, and clear the database. Be sure to start with a clean slate.

from gqlalchemy import Memgraph
memgraph = Memgraph("127.0.0.1", 7687)
memgraph.drop_database()

Import the dataset from CSV files. 

Next, you will perform PageRank and Louvain community detection using Python.

Import data

The Facebook dataset consists of eight CSV files, each having the following structure:

node_1,node_2
0,1794
0,3102
0,16645

Each record represents an edge connecting two nodes.  Nodes represent the pages, and relationships are mutual likes among them.

There are eight distinct types of pages (Government, Athletes, and TV shows, for example). Pages have been reindexed for anonymity, and all pages have been verified for authenticity by Facebook.

Since Memgraph imports queries faster when data has indices, create them for all the nodes with the label Page on the id property.

memgraph.execute(
    """
    CREATE INDEX ON :Page(id);
    """
)

Docker already has container access to the data used in this tutorial, so you can list through the local files in the ./data/facebook_clean_data/ folder. By concatenating both the file names and the /samples/ folder, you can determine their paths. Use the concatenated file paths to load data into Memgraph.

import os
from os import listdir
from os.path import isfile, join
csv_dir_path = os.path.abspath("./data/facebook_clean_data/")
csv_files = [f"/samples/{f}" for f in listdir(csv_dir_path) if isfile(join(csv_dir_path, f))]

Load all CSV files using the following query:

for csv_file_path in csv_files:
    memgraph.execute(
        f"""
        LOAD CSV FROM "{csv_file_path}" WITH HEADER AS row
        MERGE (p1:Page {{id: row.node_1}}) 
        MERGE (p2:Page {{id: row.node_2}}) 
        MERGE (p1)-[:LIKES]->(p2);
        """
    )

For more information about importing CSV files with LOAD CSV see the Memgraph documentation.

Next, use PageRank and Louvain community detection algorithms with Python to determine which pages in the network are most important, and to find all the communities in a network.

PageRank importance analysis

To identify important pages in a Facebook dataset, you will execute PageRank. Learn about different algorithm settings that can be set when calling PageRank.

Note that you will also find other algorithms integrated within MAGE. Memgraph should help with the process of running graph analytics on large-scale graphs. Find other Memgraph tutorials on how to run these analytics.

MAGE is integrated to simplify executing PageRank. The following query will first execute the algorithm, and then create and set the rank property of each node to the value that the cugraph.pagerank algorithm returns.

The value of that property will then be saved as a variable rank. Note that this (and all tests presented here) were executed on an NVIDIA GeForce GTX 1650 Ti, and Intel Core i5-10300H CPU at 2.50GHz with 16GB RAM, and returned results in around four seconds.  

 memgraph.execute(
        """
        CALL cugraph.pagerank.get() YIELD node,rank
        SET node.rank = rank;
        """
    )

Next, retrieve ranks using the following Python call:

results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        RETURN n.id as node, n.rank as rank
        ORDER BY rank DESC
        LIMIT 10;
        """
    )
for dict_result in results:
    print(f"node id: {dict_result['node']}, rank: {dict_result['rank']}")

node id: 50493, rank: 0.0030278728385218327
node id: 31456, rank: 0.0027350282311318468
node id: 50150, rank: 0.0025153975342989345
node id: 48099, rank: 0.0023413620866201052
node id: 49956, rank: 0.0020696403564964
node id: 23866, rank: 0.001955167533390466
node id: 50442, rank: 0.0019417018181751462
node id: 49609, rank: 0.0018211204462452515
node id: 50272, rank: 0.0018123518843272954
node id: 49676, rank: 0.0014821440895415787

This code returns 10 nodes with the highest rank score. Results are available in a dictionary form.

Now, it is time to visualize results with Memgraph Lab. In addition to creating beautiful visualizations powered by D3.js and our Graph Style Script language, you can use Memgraph Lab to:

  • Query graph database and write your graph algorithms in Python or C++ or even Rust
  • Check Memgraph Database Logs
  • Visualize graph schema

Memgraph Lab comes with a variety of pre-built datasets to help you get started. Open Execute Query view in Memgraph Lab and run the following query:

MATCH (n)
WITH n
ORDER BY n.rank DESC
LIMIT 3
MATCH (n)

The first part of this query will MATCH all the nodes. The second part of the query will ORDER nodes by their rank in descending order.

For the first three nodes, obtain all pages connected to them. We need the WITH clause to connect the two parts of the query. Figure 1 shows the PageRank query results.

Generated graph for visualization of grouped PageRank results
Figure 1. PageRank results visualized in Memgraph Lab

The next step is learning how to use Louvain community detection to find communities present in the graph.

Community detection with Louvain

The Louvain algorithm measures the extent to which the nodes within a community are connected, compared to how connected they would be in a random network.

It also recursively merges communities into a single node and executes the modularity clustering on the condensed graphs. This is one of the most popular community detection algorithms.

Using Louvain, you can find the number of communities within the graph.  First execute Louvain and save the cluster_id as a property for every node:

memgraph.execute(
    """
    CALL cugraph.louvain.get() YIELD cluster_id, node
    SET node.cluster_id = cluster_id;
    """
)

To find the number of communities, run the following code:

results =  memgraph.execute_and_fetch(
        """
        MATCH (n)
        WITH DISTINCT n.cluster_id as cluster_id
        RETURN count(cluster_id ) as num_of_clusters;
        """
    )
# we will get only 1 result
result = list(results)[0]

#don't forget that results are saved in a dict
print(f"Number of clusters: {result['num_of_clusters']}")

Number of clusters: 2664

Next, take a closer look at some of these communities. For example, you may find nodes that belong to one community, but are connected to another node that belongs in the opposing community. Louvain attempts to minimize the number of such nodes, so you should not see many of them. In Memgraph Lab, execute the following query:

MATCH  (n2)(m1)
WHERE n1.cluster_id != m1.cluster_id AND n1.cluster_id = n2.cluster_id
RETURN *
LIMIT 1000;

This query will MATCH node n1 and its relationship to two other nodes n2 and m1 with the following parts, respectively: (n2) and (n1)-[e]->(m1). Then, it will filter out only those nodes WHERE cluster_id of n1 and n2 is not the same as the cluster_id of node m1.

Use LIMIT 1000 to show only 1,000 of such relationships, for visualization simplicity.

Using Graph Style Script in Memgraph Lab, you can style your graphs to, for example, represent different communities with different colors. Figure 2 shows the Louvain query results. 

Generated graph visualization of the Louvain query results
Figure 2. Louvain results visualized in Memgraph Lab

Summary

And there you have it: millions of nodes and relationships imported using Memgraph and analyzed using cuGraph PageRank and Louvain graph analytics algorithms. With GPU-powered graph analytics from Memgraph, powered by NVIDIA cuGraph, you are able to explore massive graph databases and carry out inference without having to wait for results. You can find more tutorials covering a variety of techniques on the Memgraph website.

Leave a Reply

Your email address will not be published. Required fields are marked *