Categories
Misc

When a model is training:

View Poll

submitted by /u/BlakeYerian
[visit reddit] [comments]

Categories
Misc

Just Released: nvCOMP v2.3

The CUDA library, nvCOMP, now offers support for zStandard and Deflate standards, as well as modified-CRC32 checksum support and improved ANS performance.

Categories
Misc

How to Evaluate AI in Your Vendor’s Cybersecurity Solution

Considering new security software? AI and security experts Bartley Richardson and Daniel Rohrer from NVIDIA have advice: Ask a lot of questions.

Cybersecurity software is getting more sophisticated these days, thanks to AI and ML capabilities. It’s now possible to automate security measures without direct human intervention. The value in these powerful solutions is real—in stopping breaches, providing highly detailed alerts, and protecting attack surfaces. Still, it pays to be a skeptic.

This interview with NVIDIA experts Bartley Richardson and Daniel Rohrer covers key issues such as AI claims, the hidden costs of deployment, and how to best use the demo. They recommend a series of questions to ask vendors when considering investing in AI-enabled cybersecurity software. 

Richardson leads a cross-discipline engineering team for AI infrastructure and cybersecurity, working on ML and Deep Learning techniques and new frameworks for cybersecurity. Rohrer, vice president of software security, has held a variety of technical and leadership roles during his 22 years at NVIDIA. 

1. What type of AI is running in your software?

Richardson: A lot of vendors claim they have AI and ML solutions for cybersecurity, but only a small set of them are investing in thoughtful approaches using AI that address the core issue of cybersecurity as a data problem. Vendors can make claims that their solutions can do all kinds of things. But those claims are really only valid if you also have X, Y, and Z in your ecosystem. The conditions on their claims get buried most of the time. 

Rohrer: It’s like saying that you can get these great features if you have full ptrace logs that go across your network all day long. Who has that? No one has that. It is cost-prohibitive. 

Richardson: Oh sure, we can do amazing things for you. You just need to capture, store, and catalog all of the packets that go across your network. Then devote a lot of very powerful and expensive servers to analyzing those packets, only to then require 100 new cybersecurity experts to interpret those results. Everyone has those resources, right?

But seriously, AI is not magic. It’s math. AI is just another technique. You have to look critically at AI. Ask your vendor, what type of AI is running in their software and where are they running it? Because AI is used for everything, but not everything is AI. Overselling and underdelivering are causing “AI fatigue” in the market. People are getting bombarded with this all the time.

2. What deployment options can you offer?

Rohrer: Many people have hybrid cloud environments and a solution that works in just part of your environment. Certainly for cyber, it’s often deficient. How flexible is their deployment? Can they run in the cloud? On prem? In multiple environments like Linux, Windows, or whatever is needed to protect your data and achieve your goals? 

You need the right multicloud. For example, we use Google and Alibaba Cloud and AWS and Azure. Do they have a deployable solution in all those environments, or just one of those environments? And do we need that? Sometimes we don’t need that, sometimes we do. Cybersecurity is one of those use cases where we need logs from everywhere. So understand your solution space and know how flexible your deployment model needs to be to solve your problem. And bake that in.

Cyber is often one of the harder ones to pull off, because we have lots of dynamic ephemeral data. We’re often in many, many complex heterogeneous environments that are data-heavy and IO-heavy. So if you want a worst case scenario, cyber is often it.

Richardson: There could be hidden costs in deployment, too. If you have an environment where the vendor is saying you can get all this AI, but by the way, you have to use our cloud. There’s an associated cost—if you’re not already cloud-native and pushing your data to the cloud—in time, engineering, and money.

Rohrer: Even if you are cloud-native, there’s I/O overhead to push whatever data you have over to them. And all of the sudden, you have a million-dollar project on your hands.

3. What new infrastructure will I need to buy to run your software?

Rohrer: What infrastructure will you need to deploy your model? Do you have what you need, or can you readily purchase it without exorbitant costs? Can you afford it? If it’s an on-prem solution, does the proposal include the additional infrastructure you’re going to need? If it’s cloud-based, does it include all the cloud instances and data ingress/egress fees, or are those all extra? 

Richardson: If you’re telling me something is additive, great. If you’re telling me something’s rip and replace, that’s a different proposition.

4. How will you protect my model?

Richardson: People usually ask about data. How are you protecting my data? Is it isolated? Is it secure? Those are good questions to ask. But what happens when a service provider customizes an AI model for me? What are their policies around protecting that fine-tuned model for my environment? How are you protecting my model?

Because if they’re doing anything that’s real with ML or deep learning, the model is just as valuable as the data it’s trained on. It’s possible to back out training and fine-tuning data from a trained model, if you are sufficiently experienced with the techniques. 

That means it’s possible for people to access my sensitive information. My data didn’t leak, but this massive embedding space of my neural net leaked, and now it’s possible for someone, with a lot of work, to back out my training data. And not a lot of people are encrypting models. The constant encrypt/decrypt would totally thrash your throughput. There should be policies and procedures in place, ideally ones that can be automatically enforced, that protect these models. Ensure that your vendor is following best practices around implementing the least privileges possible when those models contain embeddings of your data.

5. What can you do with my data?

Rohrer: There are many service providers who are aggregating data and events across customers to improve a model for everyone, which is fine as long as you’re up front about it. But you know, one question to ask is whether your data is being used to improve a model for competitors or everyone else in the market? And make sure that you’re comfortable with that. In some cases that’s fine, if it’s weather data or whatever. Sometimes not so much. Because some of that data and the models you build from it have a real competitive advantage for you.

Richardson: I always come back to companies like Facebook and Twitter. The real value for them is your data. So they can use everyone’s data and training–and that gives them a superior ability and extra value. They’re selling you a service or a product and using your data to improve it.

6. Can I bring my data to the demo? 

Rohrer: Preparing for the demo is important, because that’s really where the rubber hits the road for most folks.

Richardson: Yeah, ideally you should have a set of criteria going in. Know what your requirements are. Maybe you have some incident or misconfiguration or problem. Can AI address that? Can we see it run in a customer environment, not just in your sandbox environment? 

Rohrer: Bring your problems to the demo.

7. Does your solution require tuning, and if so, how often? 

Rohrer: One recommendation is to bring some of your own data to the table. How do I ingest my data? What is the efficacy of problems that I actually have, not the one that the demo team is telling me I should have? And see how it performs with your data. If it doesn’t work with your data unless they tune it, then you know it’s not just buy and deploy. Now it’s a deploy after 3 months, 6 months, maybe 9 months of tuning. Now it’s not just a product purchase. It’s a purchase and integration contract and a support contract and the costs add up before you realize it.

8. How easy is your solution for our engineers to learn and use? 

Richardson: A lot of people don’t evaluate the person load. I know that’s hard to do in a trial. But whether it’s cybersecurity or IT or whatever, get your people evaluating that. Get your engineers involved in the process. Ask how will your engineers interact with the new software on a daily basis? We see this a lot, especially in cybersecurity, where you’ve added something to do function X. And in the end, it just creates more cognitive load on your humans that are working with it. It’s generating more noise than they can handle, even if it’s doing it at a pretty low false-positive rate. It’s additive.

Rohrer: Yeah, it’s 99% accurate, but it doubled the number of events your people have to deal with. That didn’t help them.

Richardson: AI is not magic. It’s just math. But it’s framed in the context of magic. Just be willing to look at AI critically. It’s just another technique. It’s not a magic bullet that is going to solve all your problems. We’re not living in the future yet.

About Bartley Richardson

picture of Bartley

Bartley Richardson is Director of Cybersecurity Engineering at NVIDIA and leads a cross-discipline team researching GPU-accelerated ML and Deep Learning techniques and creating new frameworks for cybersecurity. His interests include NLP and sequence-based methods applied to cyber network datasets and threat detection. Bartley holds a PhD in Computer Science and Engineering working on loosely and unstructured logical query optimization, and a BS in Computer Engineering with a focus on software design and AI.

More from Bartley Richardson:

About Daniel Rohrer

picture of Daniel

Daniel Rohrer is VP of Software Product Security at NVIDIA. In his 23 years at NVIDIA, he has held a variety of technical and leadership roles. Daniel has taken his integrated knowledge of ‘everything NVIDIA’ to hone security practices through the delivery of advanced technical solutions, reliable processes, and strategic investments to build trustworthy security solutions. He has a MS in Computer Science from the University of North Carolina, Chapel Hill.

More from Daniel Rohrer:

Edge Computing: Considerations for Security Architects | NVIDIA Technical Blog

Categories
Misc

AI Research Holds the Key to Affordable and Accessible Drug Development

Published in Nature Machine Intelligence, a panel of experts shares a vision for the future of biopharma featuring collaboration between ML and drug discovery powered by GPUs.

The field of drug discovery is at a fascinating inflection point. The physics of the problem are understood and calculable, yet quantum mechanical calculations are far too expensive and time consuming. Eroom’s Law observes that drug discovery is becoming slower and more expensive over time, despite improvements in technology.

A recent article examining the transformational role of GPU computing and deep learning in drug discovery is showing hope that this trend may soon reverse. 

Published in Nature Machine Intelligence, the review details numerous advances in challenges from molecular simulation and protein structure determination to generative drug design that are accelerating the computer-aided drug discovery workflow. These advances, driven by developments in highly parallelizable GPUs and GPU-enabled algorithms, are bringing new possibilities to computational chemistry and structural biology for the development of novel medicines.

The collaboration between researchers in drug discovery and machine learning to identify GPU-accelerated deep learning tools is creating new possibilities for these challenges that if solved, hold the key to faster, less expensive drug development.

“We expect that the growing availability of increasingly powerful GPU architectures, together with the development of advanced DL strategies, and GPU-accelerated algorithms, will help to make drug discovery affordable and accessible to the broader scientific community worldwide,” the study authors write.

Molecular simulation and free energy calculations

Molecular simulation powers many calculations important in drug discovery and is the computational microscope that can be used to perform virtual experiments using the laws of physics. GPU-powered molecular dynamics frameworks can simulate the cell’s machinery lending insight into fundamental mechanisms and calculate how strongly a candidate drug will bind to its intended protein target using calculations like free energy perturbation. Of central importance to molecular simulation is the calculation of potential energy surfaces. 

In the highlighted review, the authors cover how machine-learned potentials are fundamentally changing molecular simulation. Machine-learned or neural network potentials are models, which learn energies and forces for molecular simulation with the accuracy of quantum mechanics. 

The authors report that free energy simulations benefit greatly from GPUs. Neural network-based force fields such as ANI and AIMNet reduce absolute binding free-energy errors and human effort for force field development. Other deep learning frameworks like reweighted autoencoder variational Bayes (RAVE) are pushing the boundaries of molecular simulation, employing an enhanced sampling scheme for estimating protein-ligand binding free energies. Methods like Deep Docking are now employing DL models to estimate molecular docking scores and accelerate virtual screening.

Advances in protein structure determination

Over the last 10 years, there has been a 2.13x increase in the number of protein structures publicly available. An increasing rate of CryoEM structure deposition and the proliferation of proteomics has further contributed to an abundance of structure and sequence data. 

CryoEM is projected to dominate high-resolution macromolecular structural determination in the coming years with its simplicity, robustness, and ability to image large macromolecules. It is also less destructive to samples as it does not require crystallization. 

However, the data storage demands and computational requirements are sizable. The study’s authors detail how deep learning based approaches like DEFMap and DeepPicker are powering high-throughput automation of CryoEM for protein structure determination with the help of GPUs. With DEFMap, molecular dynamics simulations that understand relationships in local density data and deep learning algorithms are combined to extract dynamics associated with hidden atomic fluctuations.

The groundbreaking development of AlphaFold-2 and RoseTTAFold models that predict protein structure with atomic accuracy is ushering in a new era structure determination. A recent study by Mosalaganti et al. highlights the predictive power of these models. It also demonstrates how protein structure prediction models can be combined with cryoelectron tomography (CryoET) to determine the structure of the nuclear pore complex, a massive cellular structure comprised of over 1,000 proteins. Mosalagneti et al. go on to perform coarse-grained molecular dynamics simulations of the nuclear pore complex. This gives a glimpse into the future of the kinds of simulations made possible by the combination of AI-based protein structure prediction models, CryoEM and CryoET. 

Generative models and deep learning architectures

One of the central challenges of drug discovery is the overwhelming size of the chemical space. There are 1060 drug-like molecules to consider, so researchers need a representation of the chemical space that is organized and searchable. By training on a large base of existing molecules, generative models learn the rules of chemistry and to represent chemical space in the latent space of the model.

Generative models, by implicitly learning the rules of chemistry, produce molecules that they’ve never seen before. This results in exponentially more unique, valid molecules than in the original training database. Researchers can also construct numerical optimization algorithms that operate in the latent space of the model to search for optimal molecules. These function as gradients in the latent space that computational chemists can use to steer molecule generation toward desirable properties.

The authors report that numerous state-of-the-art deep learning architectures are driving more robust generative models. Graph neural networks, generative adversarial networks, variational encoders, and transformers are creating generative models transforming molecular representation and de novo drug design. 

Convolutional neural networks, like Chemception, have been trained to predict chemical properties such as toxicity, activity, and solvation. Recurrent neural networks have the capacity to learn latent representations of chemical spaces to make predictions for several datasets and tasks.

MegaMolBART is a transformer-based generative model that achieves 98.7% unique molecule generation at AI-supercomputing scale. With support for model parallel training, MegaMolBART can train 1B+ parameter models for training on large chemical databases and is tunable for a wide range of tasks. 

The Million-X leap in scientific computing

Today, GPUs are accelerating every step of the computer aided drug discovery workflow, showing effectiveness in everything from target elucidation to FDA approval. With accelerated computing, scientific calculations are being massively parallelized on GPUs. 

Supercomputers help these calculations to be scaled up and out to multiple nodes and GPUs, leveraging fast communication fabrics to tie GPUs and nodes together.

AT GTC, NVIDIA CEO Jensen Huang shared how NVIDIA has accelerated computing by a million-x over the past decade. The future is bright for digital biology, where these speed-ups are being realized to speed up drug discovery and deliver therapeutics to market faster.

Categories
Misc

Finding NeMo: Sensory Taps NVIDIA AI for Voice and Vision Applications

You may not know of Todd Mozer, but it’s likely you have experienced his company: It has enabled voice and vision AI for billions of consumer electronics devices worldwide. Sensory, started in 1994 from Silicon Valley, is a pioneer of compact models used in mobile devices from the industry’s giants. Today Sensory brings interactivity to Read article >

The post Finding NeMo: Sensory Taps NVIDIA AI for Voice and Vision Applications appeared first on NVIDIA Blog.

Categories
Misc

UN Satellite Centre Works With NVIDIA to Boost Sustainable Development Goals

To foster climate action for a healthy global environment, NVIDIA is working with the United Nations Satellite Centre (UNOSAT) to apply the powers of deep learning and AI. The effort supports the UN’s 2030 Agenda for Sustainable Development, which has at its core 17 interrelated Sustainable Development Goals. These SDGs — which include “climate action” Read article >

The post UN Satellite Centre Works With NVIDIA to Boost Sustainable Development Goals appeared first on NVIDIA Blog.

Categories
Misc

How to learn Keras properly?

Hi. I was messing around the last years times to times with AIs and machine learning. I watched nearly every keras youtube project, but always wondered if there isn’t a better way to learn it (without paying hundreds of bucks) So do you guys have any recommendations on how to learn it?

submitted by /u/fynnix27
[visit reddit] [comments]

Categories
Misc

Is Edward2 still a part of Tensorflow/Tensorflow Probability or is it discontinued?

https://github.com/google/edward2

submitted by /u/o-rka
[visit reddit] [comments]

Categories
Misc

Experimenting with Novel Distributed Applications Using NVIDIA Flare 2.1

In this post, I introduce new features of NVIDIA FLARE v2.1 and walk through proof-of-concept and production deployments of the NVIDIA FLARE platform.

NVIDIA FLARE (NVIDIA Federated Learning Application Runtime Environment) is an open-source Python SDK for collaborative computation. FLARE is designed with a componentized architecture that allows researchers and data scientists to adapt machine learning, deep learning, or general compute workflows to a federated paradigm to enable secure, privacy-preserving multi-party collaboration.

This architecture provides components for securely provisioning a federation, establishing secure communication, and defining and orchestrating distributed computational workflows. FLARE provides these components within an extensible API that allows customization to adapt existing workflows or easily experiment with novel distributed applications.

Diagram shows FL Simulator in POC mode, learning algorithms such as FedAvg or SCAFFOLD, federation workflows such as Scatter-Gather or Cyclic, the FLARE programming API, privacy preservation and secure management, as well as tools for provisioning, deployment, orchestration, and so on.
Figure 1. The high-level NVIDIA FLARE architecture

Figure 1 shows the high-level FLARE architecture with the foundational API components, including tools for privacy preservation and secure management of the platform. On top of this foundation are the building blocks for federated learning applications, with a set of federation workflows and learning algorithms.

Alongside the core FLARE stack are tools that allow experimentation and proof-of-concept (POC) development with the FL Simulator, coupled with a set of tools used to deploy and manage production workflows.

In this post, I focus on getting started with a simple POC and outline the process of moving from POC to a secure, production deployment. I also highlight some of the considerations when moving from a local POC to a distributed deployment.

Getting started with NVIDIA FLARE

To help you get started with NVIDIA FLARE, I walk through the basics of the platform and highlight some of the features in version 2.1 that help you bring a proof-of-concept into a production federated learning workflow.

Installation

The simplest way to get started with NVIDIA FLARE is in a Python virtual environment as described in Quickstart.

With just a few simple commands, you can prepare a FLARE workspace that supports a local deployment of independent server and clients. This local deployment can be used to run FLARE applications just as they would run on a secure, distributed deployment, without the overhead of configuration and deployment.

$ sudo apt update
$ sudo apt install python3-venv
$ python3 -m venv nvflare-env
$ source nvflare-env/bin/activate
(nvflare-env) $ python3 -m pip install -U pip setuptools
(nvflare-env) $ python3 -m pip install nvflare

Preparing the POC workspace

With the nvflare pip package installed, you now have access to the poc command. The only argument required when executing this command is the desired number of clients.

(nvflare-env) $ poc -h
usage: poc [-h] [-n NUM_CLIENTS]

optional arguments:
  -h, --help            show this help message and exit
  -n NUM_CLIENTS, --num_clients NUM_CLIENTS
                        number of client folders to create

After executing this command, for example poc -n 2 for two clients, you will have a POC workspace with folders for each of the participants: admin client, server, and site clients.

(nvflare-env) $ tree -d poc
poc
├── admin
│   └── startup
├── server
│   └── startup
├── site-1
│   └── startup
└── site-2
    └── startup

Each of these folders contains the configuration and scripts required to launch and connect the federation. By default, the server is configured to run on localhost, with site clients and the admin client connecting on ports 8002 and 8003, respectively. You can launch the server and clients in the background by running, for example:

(nvflare-env) $ for i in poc/{server,site-1,site-2}; do 
    ./$i/startup/start.sh; 
done

The server and client processes emit status messages to standard output and also log to their own poc/{server,site-?}/log.txt file. When launched as shown earlier, this standard output is interleaved. You may launch each in a separate terminal to prevent this interleaved output.

Deploying a FLARE application

After connecting the server and site clients, the entire federation can be managed with the admin client. Before you dive into the admin client, set up one of the examples from the NVIDIA FLARE GitHub repository.

(nvflare-env) $ git clone https://github.com/NVIDIA/NVFlare.git
(nvflare-env) $ mkdir -p poc/admin/transfer
(nvflare-env) $ cp -r NVFlare/examples/hello-pt-tb poc/admin/transfer/

This copies the Hello PyTorch with Tensorboard Streaming example into the admin client’s transfer directory, staging it for deployment to the server and site clients. For more information, see Quickstart (PyTorch with TensorBoard).

Before you deploy, you also install a few prerequisites.

(nvflare-env) $ python3 -m pip install torch torchvision tensorboard

Now that you’ve staged the application, you can launch the admin client.

(nvflare-env) $ ./poc/admin/startup/fl_admin.sh
Waiting for token from successful login...
Got primary SP localhost:8002:8003 from overseer. Host: localhost Admin_port: 8003 SSID: ebc6125d-0a
56-4688-9b08-355fe9e4d61a
login_result: OK token: d50b9006-ec21-11ec-bc73-ad74be5b77a4
Type ? to list commands; type "? cmdName" to show usage of a command.
> 

After connecting, the admin client can be used to check the status of the server and clients, manage applications, and submit jobs. For more information about the capabilities of the admin client, see Operating NVFLARE – Admin Client, Commands, FLAdminAPI.

For this example, submit the hello-pt-tb application for execution.

> submit_job hello-pt-tb
Submitted job: 303ffa9c-54ae-4ed6-bfe3-2712bc5eba40

At this point, you see confirmation of job submission with the job ID, along with status updates on the server and client terminals showing the progress of the server controller and client executors as training is executed.

You can use the list_jobs command to check the status of the job. When the job has finished, use the download_job command to download the results of the job from the server.

> download_job 303ffa9c-54ae-4ed6-bfe3-2712bc5eba40
Download to dir poc/admin/startup/../transfer

You can then start TensorBoard using the downloaded job directory as the TensorBoard log directory.

(nvflare-env) $ tensorboard 
    --logdir=poc/admin/transfer

This starts a local TensorBoard server using the logs that were streamed from the clients to the server and saved in the server’s run directory. You can open a browser to http://localhost:6006 to visualize the run.

Screenshot shows train_loss graph and validation_accuracy graph.
Figure 2. Example TensorBoard output from the hello-pt-tb application

The example applications provided with NVIDIA FLARE are all designed to use this POC mode and can be used as a starting point for the development of your own custom applications.

Some examples, like the CIFAR10 example, define end-to-end workflows that highlight the different features and algorithms available in NVIDIA FLARE and use the POC mode, as well as secure provisioning discussed in the next section.

Moving from proof-of-concept to production

NVIDIA FLARE v2.1 introduces some new concepts and features aimed at enabling robust production federated learning, two of the most visible being high availability and support for multi-job execution.

  • High availability (HA) supports multiple FL servers and automatically activates a backup server when the currently active server becomes unavailable. This is managed by a new entity in the federation, the overseer, that’s responsible for monitoring the state of all participants and orchestrating the cutover to a backup server when needed.
  • Multi-job execution supports resource-based multi-job execution by allowing for concurrent runs, provided that the resources required by the jobs are satisfied.

Secure deployment with high availability

The previous section covered FLARE’s POC mode in which security features are disabled to simplify local testing and experimentation.

To demonstrate high availability for a production deployment, start again with a single-system deployment like that used in POC mode and introduce the concept of provisioning with the OpenProvision API.

Similar to the poc command, NVIDIA FLARE provides the provision command to drive the OpenProvision API. The provision command reads a project.yml file that configures the participants and components used in a secure deployment. This command can be used without arguments to create a copy of sample project.yml as a starting point.

For this post, continue to use the same nvflare-venv Python virtual environment as configured in the previous section.

(nvflare-env) $ provision
No project.yml found in current folder.  Is it OK to generate one for you? (y/N) y
project.yml was created.  Please edit it to fit your FL configuration.

For a secure deployment, you must first configure the participants in the federation. You can modify the participants section of the sample project.yml file to create a simple local deployment as follows. The changes from the default project.yml file are shown in bold text.

participants:
  # change overseer.example.com to the FQDN of the overseer
  - name: overseer
    type: overseer
    org: nvidia
    protocol: https
    api_root: /api/v1
    port: 8443
  # change example.com to the FQDN of the server
  - name: server1
    type: server
    org: nvidia
    fed_learn_port: 8002
    admin_port: 8003
    # enable_byoc loads python codes in app.  Default is false.
    enable_byoc: true
    components:
      



There are a few important points in defining the participants:

  • The name for each participant must be unique. In the case of the overseer and servers, these names must be resolvable by all servers and clients, either as fully qualified domain names or as hostnames using /etc/hosts (more on that to come).
  • For a local deployment, servers must use unique ports for FL and admin. This is not required for a distributed deployment where the servers run on separate systems.
  • Participants should set enable_byoc: true to allow the deployment of apps with code in a /custom folder, as in the example applications.

The remaining sections of the project.yml file configure the builder modules that define the FLARE workspace. These can be left in the default configuration for now but require some consideration when moving from a secure local deployment to a true distributed deployment.

With the modified project.yml, you can now provision the secure startup kits for the participants.

(nvflare-env) $ provision -p project.yml
Project yaml file: project.yml.
Generated results can be found under workspace/example_project/prod_00.  Builder's wip folder removed.
$ tree -d workspace/
workspace/
└── example_project
    ├── prod_00
    │   ├── admin@nvidia.com
    │   │   └── startup
    │   ├── overseer
    │   │   └── startup
    │   ├── server1
    │   │   └── startup
    │   ├── server2
    │   │   └── startup
    │   ├── site-1
    │   │   └── startup
    │   └── site-2
    │       └── startup
    ├── resources
    └── state

As in POC mode, provisioning generates a workspace with a folder containing the startup kit for each participant, in addition to a zip file for each. The zip files can be used to easily distribute the startup kits in a distributed deployment. Each kit contains the configuration and startup scripts as in POC mode, with the addition of a set of shared certificates used to establish the identity and secure communication among participants.

In secure provisioning, these startup kits are signed to ensure that they have not been modified. Looking at the startup kit for server1, you can see these additional components.

(nvflare-env) $ tree workspace/example_project/prod_00/server1
workspace/example_project/prod_00/server1
└── startup
    ├── authorization.json
    ├── fed_server.json
    ├── log.config
    ├── readme.txt
    ├── rootCA.pem
    ├── server.crt
    ├── server.key
    ├── server.pfx
    ├── signature.json
    ├── start.sh
    ├── stop_fl.sh
    └── sub_start.sh

To connect the participants, all servers and clients must be able to resolve the servers and overseer at the name defined in project.yml. For a distributed deployment, this could be a fully qualified domain name.

You can also use /etc/hosts on each of the server and client systems to map the server and overseer name to its IP address. For this local deployment, use /etc/hosts to overload the loop-back interface. For example, the following code example adds entries for the overseer and both servers:

(nvflare-env) $ cat /etc/hosts
127.0.0.1 localhost
127.0.0.1 overseer
127.0.0.1 server1
127.0.0.1 server2

Because the overseer and servers all use unique ports, you can safely run all on the local 127.0.0.1 interface.

As in the previous section, you can loop over the set of participants to execute the start.sh scripts included in the startup kits to connect the overseer, servers, and site clients.

(nvflare-env) $ export WORKSPACE=workspace/example_project/prod_00/
(nvflare-env) $ for i in $WORKSPACE/{overseer,server1,server2,site-1,site-2}; do 
    ./$i/startup/start.sh & 
done

From here, the process of deploying an app using the admin client is the same as in POC mode with one important change. In secure provisioning, the admin client prompts for a username. In this example, the username is admin@nvidia.com, as configured in project.yml.

Considerations for secure, distributed deployment

In the previous sections, I discussed POC mode and secure deployment on a single system.  This single-system deployment factors out a lot of the complexity of a true secure, distributed deployment. On a single system, you have the benefit of a shared environment, shared filesystem, and local networking. A production FLARE workflow on distributed systems must address these issues.

Consistent environment

Every participant in the federation requires the NVIDIA FLARE runtime, along with any dependencies implemented in the server and client workflow. This is easily accommodated in a local deployment with a Python virtual environment.

When running distributed, the environment is not as easy to constrain. One way to address this is running in a container. For the examples earlier, you could create a simple Dockerfile to capture dependencies.

ARG PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:22.04-py3
FROM ${PYTORCH_IMAGE}

RUN python3 -m pip install -U pip
RUN python3 -m pip install -U setuptools
RUN python3 -m pip install torch torchvision tensorboard nvflare

WORKDIR /workspace/
RUN git clone https://github.com/NVIDIA/NVFlare.git

The WorkspaceBuilder referenced in the sample project.yml file includes a variable to define a Docker image:

# when docker_image is set to a Docker image name,
# docker.sh is generated on server/client/admin
docker_image: nvflare-pyt:latest

When docker_image is defined in the WorkspaceBuilder config, provisioning generates a docker.sh script in each startup kit.

Assuming this example Dockerfile has been built on each of the server, client, and admin systems with tag nvflare-pyt:latest, the container can be launched using the docker.sh script. This launches the container with startup kits mapped in and ready to run. This of course requires Docker and the appropriate permissions and network configuration on the server and client host systems.

An alternative would be to provide a requirements.txt file, as shown in many of the online examples, that can be pip installed in your nvflare-venv virtual environment before running the distributed startup kit.

Distributed systems

In the POC and secure deployment environments discussed so far, we’ve assumed a single system where you could leverage a local, shared filesystem, and where communication was confined to local network interfaces.

When running on distributed systems, you must address these simplifications to establish the federated system. Figure 3 shows the components required for a distributed deployment with high availability, including the relationship between the admin client, overseer, servers, and client systems.

Diagram shows flow from NVIDIA FLARE to service providers and clients, with input/output from admin and overseer, to output as GRPC, HTTP, and TCP.
Figure 3. The NVIDIA FLARE deployment for high availability (HA)

In this model, you must consider the following:

  • Network: Client systems must be able to resolve the overseer and service providers at their fully qualified domain names or by mapping IP addresses to hostnames.
  • Storage: Server systems must be able to access shared storage to facilitate cutover from the active (hot) service provider, as defined in the project.yml file snapshot_persistor.
  • Distribution of configurations or startup kits to each of the participants
  • Application configuration and the location of client datasets

Some of these considerations can be addressed by running in a containerized environment as discussed in the previous section, where startup kits and datasets can be mounted on a consistent path on each system.

Other aspects of a distributed deployment depend on the local environment of the host systems and networks and must be addressed individually.

Summary

NVIDIA FLARE v2.1 provides a robust set of tools that enable a researcher or developer to bring a federated learning concept to a real-world production workflow.

The deployment scenarios discussed here are based on our own experience building the FLARE platform, and on the experience of our early adopters bringing federated learning workflows to production. Hopefully, these can serve as a starting point for the development of your own federated applications.

We are actively developing the FLARE platform to meet the needs of researchers, data scientists, and platform developers, and welcome any suggestions and feedback in the NVIDIA FLARE GitHub community!

Categories
Misc

Bridging the Divide Between CLI and Automation IT Teams with NVIDIA NVUE

Learn more about the NVIDIA NVUE object-oriented, schema-driven model of a complete Cumulus Linux system. The API enables you to configure any system element.

When network engineers engage with networking gear for the first time, they do it through a command-line interface (CLI). While CLI is still widely used, network scale has reached new highs, making CLI inefficient for managing and configuring the entire data center. Natively, networking is no exception as the software industry has progressed to automation.

Network vendors have all provided different approaches to automate the network as they have branched out from the traditional CLI syntax. Unfortunately, this new branch in the industry has divided the network engineers and IT organizations into two groups: CLI-savvy teams and automation-savvy teams.

This segmentation creates two sets of problems. First, the CLI-savvy teams have difficulty closing the automation gap, limiting their growth pace. Second, finding network automation talent is a challenge, as most developers do not possess networking skills, and most network engineers do not possess automation skills.

To merge the two groups and solve these two problems, NVIDIA introduced a paradigm shift in the CLI approach called NVIDIA User Experience (NVUE).

NVUE is an object-oriented, schema-driven model of a complete Cumulus Linux system (hardware and software). NVUE provides a robust API that allows multiple interfaces to show and configure any element within the system. The NVUE CLI and REST API use the same API to interface with Cumulus Linux.

Block diagram of NVUE architecture: the NVUE object model is the core, one level above is the NVUE API, and the CLI and REST interact with the NVUE API in the same way.
Figure 1. NVUE architecture

Having all the interfaces use the same object model guarantees consistent results regardless of how an engineer interfaces with the system. For example, the CLI and REST API use the same methods to configure a BGP peer.

REST and CLI are expected for any network device today. An object model can be directly imported into a programming language like Python or Java. This enables you to build configurations for one device or an entire fabric of devices. The following code example shows what an NVUE Python interface might look like in the future:

from nvue import Switch
  
 spine01 = Switch()
 x = 1
 while x 



The benefits of this revolutionary approach are two-fold:

  • For the CLI-savvy, going from CLI to building full automation is an evolution, not an entirely new process.
  • Because REST is more common among developers than other networking-oriented models such as YANG, a developer with no networking skills can collaborate with a CLI-savvy network engineer and take the team a considerable step forward towards automating the network.

The more an organization automates its ongoing operation, the more it can focus on innovation rather than operation and serve its ever-growing business needs.

Try it out

One of the most valuable aspects of Cumulus Linux is the ability to try all our features and functions virtually. You can use NVIDIA Air to start using NVUE today and see what you think of the future of network CLIs and programmability.