Categories
Misc

Startup Digs Into Public Filings With GPU-Driven Machine Learning to Serve Up Alternative Financial Data Services

When Rachel Carpenter and Joseph French founded Intrinio a decade ago, the fintech revolution had only just begun. But they saw an opportunity to apply machine learning to vast amounts of financial filings to create an alternative data provider among the giants. The startup, based in St. Petersburg, Fla., delivers financial data to hedge funds, Read article >

The post Startup Digs Into Public Filings With GPU-Driven Machine Learning to Serve Up Alternative Financial Data Services appeared first on NVIDIA Blog.

Categories
Misc

Addressing Cybersecurity in the Enterprise with AI

Cybersecurity-related risk remains one of the top sources of risk in the enterprise. This has been exacerbated by the global pandemic, which has forced…

Cybersecurity-related risk remains one of the top sources of risk in the enterprise. This has been exacerbated by the global pandemic, which has forced companies to accelerate digitization initiatives to better support a remote workforce.

This includes not only the infrastructure to support a distributed workforce but also automation through robotics, data analytics, and new applications. Unfortunately, this expansive digital footprint has led to an increase in cybercriminal attacks.

If you are considering a new cybersecurity solution for your business, it is important to understand how traditional prevention methods differ from modern AI solutions.

Are traditional cybersecurity methods still feasible for enterprises?

The proliferation of endpoints in today’s more distributed environments makes traditional cybersecurity methods, which create perimeters to secure the infrastructure, much less effective. In fact, it’s estimated that for at least half of all attacks, the intruder is already inside.

Manual data collection and analysis process

Implementing rules-based tools or supervised machine-learning systems to combat cyberattacks is ineffective. The number of logs collected on devices and added to networks continues to increase and can overwhelm traditional collection mechanisms. Petabytes of data are easily amassed and must be sent back to a central data lake for processing.

Due to bandwidth limitations, only a small sample is typically analyzed. This might be as little as five percent of the data, so one in every 2000 packets can be analyzed. This is a suboptimal way of analyzing data for cybersecurity threats.

Most enterprises have the means to look at only a small percentage of their data. This means they are likely missing valuable data points that could help identify vulnerabilities and prevent threats. Analysts may look to enrich their view of what is happening in and around the network by integrating tools and data, but this is often a manual process. 

Lack of AI capabilities leads to longer threat detection times

It is estimated that it can take up to 277 days to identify and contain a security breach. Being able to quickly triage and iterate on a perceived threat is crucial, but also typically requires human intervention. These problems are magnified by the global shortage of cybersecurity professionals. 

Supervised ML systems also can’t detect zero-day threats because that is a “look back” cybersecurity approach. Traditional software-driven approaches like these can impede security teams from responding quickly to cybercriminals.

A better way to address threat detection challenges is with AI technology. For example, a bank institution may implement an AI cybersecurity solution to automatically identify which customer transactions are typical and which are potential threats.

How is AI changing modern cybersecurity solutions?

It’s no secret that cybersecurity professionals face an uphill battle to keep their organizations secure. Traditional threat detection methods are costly, reactive, and leave large gaps in security coverage, particularly in operations and globally distributed organizations.

To meet today’s cyberthreats, organizations need solutions that can provide visibility into 100% of the available data to identify malicious activity, along with insights to assist cybersecurity analysts in responding to threats.

AI cybersecurity use cases include:

  • Analyst augmentation technology using predictive analytics to assist with querying for large datasets.
  • User behavior risk scoring using AI algorithms to mine network data to identify and stop potential threats.
  • Reducing the time required to detect threats through faster, automated AI model updates.

Adopt an enterprise AI cybersecurity framework

NVIDIA Morpheus enables enterprises to observe all their data and apply AI inferencing and real-time monitoring of every server and packet across the entire network, at a scale previously impossible to achieve. 

The Morpheus pipeline, combined with the NVIDIA accelerated computing platform, enables the analysis of cybersecurity data orders of magnitude faster than traditional solutions that use CPU-only servers. 

Additionally, the Morpheus prebuilt use cases enable simplified augmentation of existing security infrastructure:

  • Digital fingerprinting uses unsupervised AI and time series modeling to create micro-targeted models for every user account and machine account combination running on the network, detecting humans posing as machines and machines as humans.
  • Phishing detection analyzes the entire raw email to classify it into ham, spam, or phishing.
  • Sensitive information detection finds and classifies leaked credentials, keys, passwords, credit card numbers, financial account information, and more. 
  • Crypto-mining detection addresses the issue, reported by more than 69% of enterprises, of crypto-mining malware resulting in malicious DNS traffic and over-utilization of compute resources. This model determines crypto-mining, malware, machine learning and deep learning workloads, and more.

For more information, see the full list of NVIDIA Morpheus use cases.

Next steps

To get started with Morpheus, see the nvidia/morpheus GitHub repo.

To learn about how Morpheus can help companies leverage AI to improve their cybersecurity posture, register for the free online Morpheus DLI course or check out the following on-demand GTC sessions:

For live sessions, join us at GTC, Sept 19 – 22, to explore the next technology and research across AI, data science, cybersecurity, and more.

  • Learn About the Latest Developments with AI-Powered Cybersecurity [A41142]: Learn about the latest innovations available with NVIDIA Morpheus, being introduced in the Fall 2022 release, and find out how today’s security analysts are using Morpheus in their everyday investigations and workflows. – Bartley Richardson, Director of Cybersecurity Engineering, NVIDIA.
  • Deriving Cyber Resilience from the Data Supply Chain [A41145]: Hear how NVIDIA tackles these challenges through the application of zero-trust architectures in combination with AI and data analytics, combating our joint adversaries with a data-first response with the application of DPU, GPU, and AI SDKs and tools. Learn where the promise of cyber-AI is working in application. – Daniel Rohrer, Vice President of software Product Security, NVIDIA.
  • Accelerating the Next Generation of Cybersecurity Research [A41120]: Discover how to apply prebuilt models for digital fingerprinting to analyze behavior of every user and machine, analyze raw emails to automatically detect phishing, find and classify leaked credentials and sensitive information, profile behaviors to detect malicious code and behavior, and leverage graph neural networks to identify fraud. – Killian Sexsmith, Senior Developer Relations Manager, NVIDIA.
Categories
Misc

Building a Machine Learning Microservice with FastAPI

Deploying an application using a microservice architecture has several advantages: easier main system integration, simpler testing, and reusable code…

Deploying an application using a microservice architecture has several advantages: easier main system integration, simpler testing, and reusable code components. FastAPI has recently become one of the most popular web frameworks used to develop microservices in Python. FastAPI is much faster than Flask (a commonly used web framework in Python) because it is built over an Asynchronous Server Gateway Interface (ASGI) instead of a Web Server Gateway Interface (WSGI).

What are microservices 

Microservices define an architectural and organizational approach to building software applications. One key aspect of microservices is that they are distributed and have loose couplings. Implementing changes is unlikely to break the entire application. 

You can also think of an application built with a microservice architecture as being composed of several small, independent services that communicate through application programming interfaces (APIs). Typically, each service is owned by a smaller, self-contained team responsible for implementing changes and updates when necessary. 

One of the major benefits of using microservices is that they enable teams to build new components for their applications rapidly. This is vital to remain aligned with ever-changing business needs.

Another benefit is how simple they make it to scale applications on demand. Businesses can accelerate the time-to-market to ensure that they are meeting customer needs constantly.

The difference between microservices and monoliths

Monoliths are another type of software architecture that proposes a more traditional, unified structure for designing software applications. Here are some of the differences. 

Microservices are decoupled

Think about how a microservice breaks down an application into its core functions. Each function is referred to as a service and performs a single task. 

In other words, a service can be independently built and deployed. The advantage of this is that individual services work without impacting the other services. For example, if one service is in more demand than the others, it can be independently scaled. 

Monoliths are tightly coupled

On the other hand, a monolith architecture is tightly coupled and runs as a single service. The downside is that when one process experiences a demand spike, the entire application must be scaled to prevent this process from becoming a bottleneck. There’s also the increased risk of application downtime, as a single process failure affects the whole application.

With a monolithic architecture, it is much more complex to update or add new features to an application as the codebase grows. This limits the room for experimentation. 

When to use microservices or monoliths?

These differences do not necessarily mean microservices are better than monoliths. In some instances, it still makes more sense to use a monolith, such as building a small application that will not demand much business logic, superior scalability, or flexibility.

However, machine learning (ML) applications are often complex systems with many moving parts and must be able to scale to meet business demands. Using a microservice architecture for ML applications is usually desirable.

Packaging a machine learning model

Before I can get into the specifics of the architecture to use for this microservice, there is an important step to go through: model packaging. You can only truly realize the value of an ML model when its predictions can be served to end users. aIn most scenarios, that means going from notebooks to scripts so that you can put your models into production.

In this instance, you convert the scripts to train and make predictions on new instances into a Python package. Packages are an essential part of programming. Without them, most of your development time is wasted rewriting existing code.

To better understand what packages are, it is much easier to start with what scripts are and then introduce modules.

  • Script: A file expected to be run directly. Each script execution performs a specific behavior defined by the developer. Creating a script is as simple as saving a file with the .py extension to denote a Python file. 
  • Module: A program created to be imported into other scripts or modules. A module typically consists of several classes and functions intended to be used by other files. Another way to think of modules is as code to be reused over and over again.

A package may be defined as a collection of related modules. These modules interact with one another in a specific way such that you are enabled to accomplish a task. In Python, packages are typically bundled and distributed through PyPi and they can be installed using pip, a Python package installer.

For this post, bundle the ML model. To follow along with the code, see the kurtispykes/car-evaluation-project GitHub repo.

Figure 1 shows the directory structure for this model. 

Model directory structure.
Figure 1. The directory structure for the model

The package modules include the following:

  • config.yml: YAML file to define constant variables.
  • pipeline.py: Pipeline to perform all feature transformations and modeling.
  • predict.py: To make predictions on new instances with the trained model.
  • train_pipeline.py: To conduct model training.
  • VERSION: The current release.
  • config/core.py: Module used to parse YAML file such that constant variables may be accessed in Python. 
  • data/: All data used for the project.
  • models/: The trained serialized model.
  • processing/data_manager.py: Utility functions for data management.
  • processing/features.py: Feature transformations to be used in the pipeline.
  • processing/validation.py: A data validation schema.

The model is not optimized for this problem as the main focus of this post is to show how to build a ML application with a microservice architecture. 

Now the model is ready to be distributed, but there is an issue. Distributing the package through the PyPi index would mean that it is accessible worldwide. This may be okay for a scenario where there’s no business value in the model. However, this would be a complete disaster in a real business scenario. 

Instead of using PyPi to host private packages, you can use a third-party tool like Gemfury. The steps to do this are beyond the scope of this post. For more information, see Installing private Python packages.

Figure 2 shows the privately packaged model in my Gemfury repository. I’ve made this package public for demonstration purposes

Screenshot of Gemfury interface showing a privately packaged machine learning car evaluation model.
Figure 2. The packaged model in a Gemfury repository

Microservice system design

After you have trained and saved your model, you need a way of serving predictions to the end user. REST APIs are a great way to achieve this goal. There are several application architectures you could use to integrate the REST API. Figure 3 shows the embedded architecture that I use in this post. 

Simple diagram of an embedded architecture including a trained model
Figure 3. A visual representation of the embedded approach

An embedded architecture refers to a system in which the trained model is embedded into the API and installed as a dependency. 

There is a natural trade-off between simplicity and flexibility. The embedded approach is much simpler than other approaches but is less flexible. For example, whenever a model update is made, the entire application would have to be redeployed. If your service were being offered on mobile, then you’d have to release a new version of the software. 

Building the API with FastAPI

The consideration when building the API is dependencies. You won’t be creating a virtual environment because you are running the application with tox, which is a command-line–driven, automated testing tool also used for generic virtualenv management. Thus, calling tox creates a virtual environment and runs the application.

Nonetheless, here are the dependencies. 

--extra-index-url="https://repo.fury.io/kurtispykes/"
car-evaluation-model==1.0.0

uvicorn>=0.18.2, =0.79.0, =0.0.5, =1.9.1, =3.10.0, =0.6.0, 



There’s an extra index, another index for pip to search for packages if it cannot be found in PyPI. This is a public link to the Gemfury account hosting the packaged model, thus, enabling you to install the trained model from Gemfury. This would be a private package in a professional setting, meaning that the link would be extracted and hidden in an environment variable. 

Another thing to take note of is uvicorn. Uvicorn is a server gateway interface that implements the ASGI interface. In other words, it is a dedicated web server that is responsible for dealing with inbound and outbound requests. It’s defined in Procfile.

web: uvicorn app.main:app --host 0.0.0.0 --port $PORT

Now that the dependencies are specified, you can move on to look at the actual application. The main part of the API application is the main.py script:

from typing import Any

from fastapi import APIRouter, FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import HTMLResponse
from loguru import logger

from app.api import api_router
from app.config import settings, setup_app_logging

# setup logging as early as possible
setup_app_logging(config=settings)


app = FastAPI(
    title=settings.PROJECT_NAME, openapi_url=f"{settings.API_V1_STR}/openapi.json"
)

root_router = APIRouter()


@root_router.get("/")
def index(request: Request) -> Any:
    """Basic HTML response."""
    body = (
        ""
        ""
        "

Welcome to the API

" "
" "Check the docs: here" "
" "" "" ) return HTMLResponse(content=body) app.include_router(api_router, prefix=settings.API_V1_STR) app.include_router(root_router) # Set all CORS enabled origins if settings.BACKEND_CORS_ORIGINS: app.add_middleware( CORSMiddleware, allow_origins=[str(origin) for origin in settings.BACKEND_CORS_ORIGINS], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) if __name__ == "__main__": # Use this for debugging purposes only logger.warning("Running in development mode. Do not run like this in production.") import uvicorn # type: ignore uvicorn.run(app, host="localhost", port=8001, log_level="debug")

If you are unable to follow along, do not worry about it. The key thing to note is that there are two routers in the main application:

  • root_router: This endpoint defines a body that returns an HTML response. You could almost think of it as a home endpoint that is called index. 
  • api_router: This endpoint is used to specify the more complex endpoints that permit other applications to interact with the ML model.

Dive deeper into the api.py module to understand api_router better. First, there are two endpoints defined in this module: health and predict

Take a look at the code example: 

@api_router.get("/health", response_model=schemas.Health, status_code=200)
def health() -> dict:
    """
    Root Get
    """
    health = schemas.Health(
        name=settings.PROJECT_NAME, api_version=__version__, model_version=model_version
    )

    return health.dict()

@api_router.post("/predict", response_model=schemas.PredictionResults, status_code=200)
async def predict(input_data: schemas.MultipleCarTransactionInputData) -> Any:
    """
    Make predictions with the Fraud detection model
    """

    input_df = pd.DataFrame(jsonable_encoder(input_data.inputs))

    # Advanced: You can improve performance of your API by rewriting the
    # `make prediction` function to be async and using await here.
    logger.info(f"Making prediction on inputs: {input_data.inputs}")
    results = make_prediction(inputs=input_df.replace({np.nan: None}))

    if results["errors"] is not None:
        logger.warning(f"Prediction validation error: {results.get('errors')}")
        raise HTTPException(status_code=400, detail=json.loads(results["errors"]))

    logger.info(f"Prediction results: {results.get('predictions')}")

    return results

The health endpoint is quite straightforward. It returns the health response schema of the model when you access the web server (Figure 4). You defined this schema in the health.py module in the schemas directory. 

The server response shows a 200 code the response body and the response headers.
Figure 4. Server response from the health endpoint

The predict endpoint is slightly more complex. Here are the steps involved: 

  1. Take the input and convert it into a pandas DataFrame: the jsonable_encoder returns a JSON compatible version of the pydantic model.   
  2. Log the input data for audit purposes. 
  3. Make a prediction using the ML model’s make_prediction function. 
  4. Catch any errors made by the model. 
  5. Return the results, if the model has no errors.

Check that all is functioning well by spinning up a server with the following command from a terminal window: 

py -m tox -e run

This should display several logs if the server is running, as shown in Figure 5. 

Screenshot of the logs to inform you that the server is up and running.
Figure 5. Logs to show the server is running

Now, you can navigate to http://localhost:8001 to see the interactive endpoints of the API. 

Testing the microservice API

Navigating to the local server takes you to the index endpoint defined in root_rooter from the main.py script. You can get more information about the API by adding /docs to the end of the local host server URL.

For example, Figure 6 shows that you’ve created the predict endpoint as a POST request, and the health endpoint is a GET request.

An image of the API endpoints available in our microservice.
Figure 6. API endpoints

First, expand the predict heading to receive information about the endpoint. In this heading, you see an example in the request body. I defined this example in one of the schemas so that you can test the API—this is beyond the scope of this post, but you can browse the schema code.

To try out the model on the request body example, choose Try it out.

A GIF demonstrating how to make an example prediction with the machine learning microservice that you created.
Figure 7. An example prediction with the microservice

Figure 7 shows that the model returns a predicted output class of 1. Internally, you know that 1 refers to the acc class value, but you may want to reveal that to the user when displayed in a user interface.

What’s next?

Congratulations, you have now built your own ML model microservice. The next steps involve deploying it so that it can run in production. 

To recap: A microservice is an architectural and organizational design approach that arranges loosely coupled services. One of the main benefits of using the microservice approach for ML applications is independence from the main software product. Having a feature service (the ML application) that is separate from the main software product has two key benefits:

  • It enables cross-functional teams to engage in distributed development, which results in faster deployments.
  • The scalability of the software is significantly improved.  

Did you find this tutorial helpful? Leave your feedback in the comments or connect with me at kurtispykes (LinkedIn).

Categories
Misc

Deep Learning and Data Science Workshops at GTC 2022

Hands-on, expert-led workshops in data science and deep learning at GTC 2022 are just $99 when you register by August 29 (standard price $500).

Hands-on, expert-led workshops in data science and deep learning at GTC 2022 are just $99 when you register by August 29 (standard price $500).

Categories
Misc

Startup’s Vision AI Software Trains Itself — in One Hour — to Detect Manufacturing Defects in Real Time

Cameras have been deployed in factories for over a decade — so why, Franz Tschimben wondered, hasn’t automated visual inspection yet become the worldwide standard? This question motivated Tschimben and his colleagues to found Covision Quality, an AI-based visual-inspection software startup that uses NVIDIA technology to transform end-of-line defect detection for the manufacturing industry. “The Read article >

The post Startup’s Vision AI Software Trains Itself — in One Hour — to Detect Manufacturing Defects in Real Time appeared first on NVIDIA Blog.

Categories
Misc

Boldly Go: Discover New Frontiers in AI-Powered Transportation at GTC

AI and the metaverse are revolutionizing every aspect of the way we live, work and play — including how we move. Leaders in the automotive and technology industries will come together at NVIDIA GTC to discuss the newest breakthroughs driving intelligent vehicles, whether in the real world or in simulation. The virtual conference, which runs Read article >

The post Boldly Go: Discover New Frontiers in AI-Powered Transportation at GTC appeared first on NVIDIA Blog.

Categories
Misc

Easy A: GeForce NOW Brings Higher Resolution and Frame Rates for Browser Streaming on PC

Class is in session this GFN Thursday as GeForce NOW makes the up-grade with support for higher resolutions and frame rates in Chrome browser on PC. It’s the easiest way to spice up a boring study session. When the lecture is over, dive into the six games joining the GeForce NOW library this week, where Read article >

The post Easy A: GeForce NOW Brings Higher Resolution and Frame Rates for Browser Streaming on PC appeared first on NVIDIA Blog.

Categories
Misc

Attend Expert-Led Developer Sessions at GTC 2022

Register now and get ready to explore cutting-edge technology and the latest developer tools at GTC.

Register now and get ready to explore cutting-edge technology and the latest developer tools at GTC.

Categories
Misc

Immunai Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs

Mapping the immune system could lead to the creation of drugs that help our bodies win the fight against cancer and other diseases. That’s the big idea behind immunotherapy. The problem: the immune system is incredibly complex. Enter Immunai, a biotech company that’s using cutting-edge genomics & ML technology to map the human immune system Read article >

The post Immunai Co-Founder Luis Voloch on Using Deep Learning to Develop New Drugs appeared first on NVIDIA Blog.

Categories
Misc

Using Network Graphs to Visualize Potential Fraud on Ethereum Blockchain

Web of wires forming geometric shapes.Beyond the unimaginable prices for monkey pictures, NFT’s underlying technology provides companies with a new avenue to directly monetize their online…Web of wires forming geometric shapes.

Beyond the unimaginable prices for monkey pictures, NFT’s underlying technology provides companies with a new avenue to directly monetize their online engagements. Major brands such as Adidas, NBA, and TIME have already begun experimenting with these revenue streams using NFTs–and we are still early in this trend.

As data practitioners, we are positioned to provide valuable insights into these revenue streams, given that all transactions are public on the blockchain. This post provides a guided project to access, analyze, and identify potential fraud using blockchain data using Python.

In this post and accompanying Jupyter notebook, I discuss the following:

  • The basics of blockchain, NFTs, and network graphs.
  • How to pull NFT data using the open-source package NFT Analyst Starter Pack from a16z.
  • How to interpret Ethereum blockchain data.
  • The fraudulent practice of wash trading NFTs.
  • Constructing network graphs to visualize potential wash trading on the NFT project Bored Ape Yacht Club.

The Jupyter notebook has a more detailed, step-by-step guide for writing Python code to implement this example walkthrough, and the post provides additional context. In addition, this post assumes that you have a basic understanding of pandas, data preparation, and data visualization.

What is blockchain data?

Cutting through the media frenzy of coins named after dogs and pixelated pictures selling for hundreds of thousands of dollars reveals fascinating technology: the blockchain. 

The following excerpt best describes this decentralized data source:

“At a very high level, blockchains are ledgers of transactions utilizing cryptography that can only add information and thus can’t be changed (i.e., immutable). What separates blockchains from ledgers you find at banks is a concept called ‘decentralization’— where every computer connected to a respective blockchain must ‘agree’ on the same state of the blockchain and subsequent data added to it.”

For more information about Ethereum blockchain data, see Making Sense of Ethereum Data for Analytics.

Central to this technology is that all the data (for example, logs, metadata, and so on) must be public and accessible. I highly recommend Stanford professor Dan Boneh’s lecture.

What is an NFT?

NFT stands for non-fungible token, a crypto asset on a blockchain (such as Ethereum) in which it represents a unique one-of-one token that can be digitally owned. For example, gold bars are fungible, as multiple bars can exist and represent the same thing, while the original Mona Lisa is non-fungible in that only one exists.

Contrary to popular belief, NFTs are not just art and JPEGs but digital representations of ownership on a blockchain ledger for a unique item such as art, music, or whatever the NFT creator wants to put onto the metadata. For this post, however, we use the NFT project Bored Ape Yacht Club (BAYC), an artwork NFT.

P.S. If you are a visual learner, my favorite intro resource on the topic of NFTs is the What Are NFTs and How Can They Be Used in Decentralized Finance? DEFI Explained video by Finematics.

What is a network graph, and why are they prime to represent blockchain data?

Networks are a method of organizing relationship data using nodes and edges. Nodes represent an entity, such as an email address or social media account, while the edges show the connection between nodes. 

Furthermore, you can store metadata for nodes and edges to represent different aspects of a relationship. Metadata can range from weights to labels. Figure 1 shows the steps of taking an entire network and zooming into a use case with helpful labels from metadata.

Six separate network graphs of the same data in various stages of detail, graph 1: all data as a blob of blue, graph 2: single network with no labels, graph 3: single network with wallet addresses as labels, graph 4: single network with wallet addresses replaced with sequential letters, graph 5: single network with sequential letters for labels and added directional arrows, graph 6: single graph with sequential letter labels, directional arrows, and labels of money spent for each transaction..
Source: Graphs from notebook tutorial.
Figure 1. The various network graphs created in this post

What makes network graphs ideal for representing blockchain transactions is that there is always a to and from blockchain address, as well as significant metadata (for example, timestamps, coin amounts, and so on.) for each transaction. Furthermore, as blockchain data is public by design through decentralization, you can use network graphs to visualize economic behaviors on a respective blockchain.

In this example, I want to demonstrate identifying the fraudulent behavior of wash trading, where an individual intentionally sells an asset to themselves across multiple accounts to artificially inflate the asset’s price.

Chainalysis wrote an excellent report on the phenomena, where they identified over 260 Ethereum crypto wallets potentially engaging in wash trading with a collective profit of over $8.4 million in 2021 alone.

Pulling data from the Ethereum blockchain

Though all blockchain data is publicly available to anyone, it is still tough to access and prepare for analysis. Following are some options to access blockchain data:

  • Create your own blockchain node (for example, become a miner) to read the rawest data available.
  • Use a third-party tool to create your own blockchain node.
  • Use a third-party API to read raw data from their own blockchain node.
  • Use a third-party API to read cleaned and aggregated blockchain data from their service.
  • Use the open-source package NFT Analyst Starter Pack from a16z.

Although all are viable options, each has a tradeoff between reliability, trust, and convenience.

For example, I worked on an NFT analytics project where we wanted to create a reliable NFT market dashboard. Unfortunately, having our own blockchain node was cost-prohibitive, many third-party data sources had various data-quality issues that we couldn’t control, and it became challenging to track transactions across multiple blockchains. That project ultimately required bringing together high-quality data from numerous third-party APIs.

Thankfully for this project, you want the most convenience possible to focus on learning, so I recommend NFT Analyst Starter Pack from a16z. Think of this package as a convenient wrapper around the third-party blockchain API from Alchemy, where it creates easy-to-use CSVs of your desired NFT contract.

Preparing data and creating network graphs

The NFT Analyst Starter Pack results in three separate CSV files for the BAYC NFT project:

  • BAYC Metadata: Information regarding a specific NFT, where asset_id is the unique identifier within this NFT token.
  • BAYC Sales: Logs and metadata related to a specific transaction represented by its transaction hash where a seller and buyer inform you of the wallets involved.
  • BAYC Transfers: The same as BAYC Sales data, but no money is transferred from one wallet to another.

For this project, most of the data preparation is around:

  • Reorganizing BAYC Sales and BAYC Transfers to enable a clean union of the two datasets.
  • Deleting duplicate logs of transfer transactions already represented in sales.

Given that the goal is to learn, don’t worry about whether the blockchain data is accurate, but you can always check for yourself by searching the transaction_hash value on Etherscan.

After preparing the data, use the NetworkX package to generate a network graph data structure of your NFT transactions. There are multiple ways to construct a graph, but in my opinion, the most straightforward approach is to use the function from_pandas_edgelist, where you just provide a pandas DataFrame, the to and from values to represent the nodes, and any metadata for edges and labeling.

[('0x2fdcca65899346af3a93a8daa6128bdbcb1ce3b3',
  '0xcedf17dfafa947cd0e205fe2a3a183cf2fb3a0bc',
  {'transaction_hash': '0xb235f0321b0b50198399ec7f2bb759ef625f85673b4d90d68f711229750181e4',
   'block_number': '14675897',
   'date': '2022-04-28',
   'asset_id': '7438',
   'sale_price_eth': 153.2,
   'sale_price_usd': 442685.5285671361,
   'transaction_type': 'sell',
   'asset_contract': '0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d'}),
('0x2fdcca65899346af3a93a8daa6128bdbcb1ce3b3',
  '0xd8fdd6031fa27194f93e1a877f8bf5bfc9b47e1e',     {'transaction_hash':'0x7b4797061eb16d73a28a869e51745e471e2849a55c80459b2aff7f0205925d74',
   'block_number': '14654313',
   'date': '2022-04-25',
   'asset_id': '5954',
   'sale_price_eth': 0.0,
   'sale_price_usd': 0.0,
   'transaction_type': 'transfer',
   'asset_contract': '0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d'})]

From this prepared data, the NetworkX package makes visualizing network graphs as easy as nx.draw, but with over 40k transactions in the DataFrame, visualizing the whole graph returns only a useless blob. Therefore, you must be specific about exactly what to visualize among your transactions to create a captivating data story.

Visualizing potential wash trading

Rather than scouring through the transactions of 10,000 NFTs, you can instead validate what others are stating within the market. Notably, the NFT Wash Trading – Is it possible to protect against it? post calls out BAYC token 8099 as potentially being subjected to the fraudulent practice of wash trading.

If you follow along in the accompanying notebook, you carry out these steps:

  • Filter prepared NFT data to only rows containing logs for asset_id 8099.
  • Rename the to and from wallet addresses to uppercase letters ordered sequentially following when the wallet address first appeared among the NFT asset’s transactions.
  • Generate network graph data with the prepared asset 8099 data using the NetworkX package.
  • Plot the network graph with desired labels, edge arrows, and node positioning.

Did BAYC 8099 NFT experience wash trading?

The data plotted in Figure 2 enables you to visualize the data corresponding to asset 8099. Starting at node H, you can see that this wallet first increases the price from $95k to $166k between HI, then adds more transactions through transfers between HJ. Finally, node H sells the potentially artificially increased NFT to node K.

Graph generated from Jupyter Notebook tutorial, following the first sell, second sell, third sell that might indicate potential wash trading, and fourth sell that is potentially inflated.
Figure 2. Following the transactions of NFT BAYC 8099

Though this graph can’t definitively state that node H engaged in wash trading—as you don’t know whether nodes H, I, and J are wallets owned by the same person—seeing loops for a node where the price increases should flag the need to do more due diligence. For example, you could look at etherscan.com to review the transactions between the following wallets:

  • 0xe4bc96b24e0bdf87b4b92ed39c1aef8839b090dd (node H).
  • 0x7e99611cf208cb097497a59b3fb7cb4dfd115ea9 (node I).
  • 0xcbc9f463f83699d20dd5b54be5262be69a0aea9f (node J).

Perhaps node H had sellers’ remorse and wanted their NFT back, as it’s not uncommon for investors to get attached to their beloved NFTs. But numerous transactions between the wallets associated with nodes H, I, and J could indicate further red flags for the NFT asset.

Next steps

By following along with this post and accompanying notebook, you have seen how to access Ethereum blockchain data and analyze NFTs through network graphs. If you enjoyed this analysis and are interested in doing more projects like this, chat with me in the CharlieDAO Discord, where I hang out. We are a collective of software engineers, data scientists, and crypto natives exploring web3!

Disclaimer

This content is for educational purposes only and is not financial advice. I do not have any financial ties to the analyzed NFTs at the moment of releasing the notebook, and here is my crypto wallet address on Etherscan for reference. This analysis only highlights potential fraud to investigate further but does not prove fraud has taken place. Finally, if you own crypto, never share your “secret recovery phrase” or “private keys” with anyone.