This feature requires paid plan. Please contact the owner.

Feedspot paid plan is required to export Combined RSS feeds. If you are the owner of this feed, please login to your Feedspot account to see upgrade options. In case of any questions feel free to reach us at


This feature requires paid plan. Please contact the owner.

Feedspot paid plan is required to export Combined RSS feeds. If you are the owner of this feed, please login to your Feedspot account to see upgrade options. In case of any questions feel free to reach us at


How to interpret the output of get_weights for Keras LSTM?

Can someone please help with this – It’ll be much appreciated.

submitted by /u/Academic-Rent7800
[visit reddit] [comments]


Speed Up Machine Learning Models with Accelerated WEKA

Accelerated WEKA integrates the WEKA workbench with Python and Java libraries that support GPU to speedup the training and prediction time of machine learning models.

In recent years, there has been a surge in building and adopting machine learning (ML) tools. The use of GPUs to accelerate increasingly compute-intensive models has been a prominent trend.

To increase user access, the Accelerated WEKA project provides an accessible entry point for using GPUs in well-known WEKA algorithms by integrating open-source RAPIDS libraries.

In this post, you will be introduced to Accelerated WEKA and learn how to leverage GPU-accelerated algorithms with a graphical user interface (GUI) using WEKA software. This Java open-source alternative is suitable for beginners looking for a variety of ML algorithms from different environments or packages.

What is Accelerated WEKA?

Accelerated WEKA unifies the WEKA software, a well-known and open-source Java software, with new technologies that leverage the GPU to shorten the execution time of ML algorithms. It has two benefits aimed at users without expertise in system configuration and coding: an easy installation and a GUI that guides the configuration and execution of the ML tasks.

Accelerated WEKA is a collection of packages available for WEKA, and it can be extended to support new tools and algorithms.

What is RAPIDS?

RAPIDS  is a collection of open-source Python libraries for users to develop and deploy data science workloads on NVIDIA GPUs.  Popular libraries include cuDF for GPU-accelerated DataFrame processing and cuML for GPU-accelerated machine learning algorithms.  RAPIDS APIs conform as much as possible to the CPU counterparts, such as pandas and scikit-learn.

Accelerated WEKA architecture

The building blocks of Accelerated WEKA are packages like WekaDeeplearning4j and wekaRAPIDS (inspired by wekaPython). WekaDeeplearning4j (WDL4J) already supports GPU processing but has very specific needs in terms of libraries and environment configuration. WDL4J provides WEKA wrappers for the Deeplearning4j library.

For Python users wekaPython initially provided Python integration by creating a server and communicating with it through sockets. With this the user can execute scikit-learn ML algorithms (or even XGBoost) inside the WEKA workbench. Furthermore, wekaRAPIDS provides integration with RAPIDS cuML library by using the same technique in wekaPython.

Together, both packages provide enhanced functionality and performance inside the user-friendly WEKA workbench. Accelerated WEKA goes a step further in the direction of performance by improving the communication between the JVM and Python interpreter. It does so by using alternatives like Apache Arrow and GPU memory sharing for efficient data transfer between the two languages.

Accelerated WEKA also provides integration with the RAPIDS cuML library, which implements machine learning algorithms that are accelerated on NVIDIA GPUs. Some cuML algorithms can even support multi-GPU solutions.

Supported algorithms

The algorithms currently supported by Accelerated WEKA are:

  • LinearRegression
  • LogisticRegression
  • Ridge
  • Lasso
  • ElasticNet
  • MBSGDClassifier
  • MBSGDRegressor
  • MultinomialNB
  • BernoulliNB
  • GaussianNB
  • RandomForestClassifier
  • RandomForestRegressor
  • SVC
  • SVR
  • LinearSVC
  • KNeighborsRegressor
  • KNeighborsClassifier

The algorithms supported by Accelerated WEKA in multi-GPU mode are:

  • KNeighborsRegressor
  • KNeighborsClassifier
  • LinearRegression
  • Ridge
  • Lasso
  • ElasticNet
  • MultinomialNB
  • CD

Using Accelerated WEKA GUI 

During the Accelerated WEKA design stage, one main goal was for it to be easy to use. The following steps outline how to set it up on a system along with a brief example. 

Please refer to the documentation for more information, and a comprehensive getting started. The only prerequisite for Accelerated WEKA is having Conda installed in your system.

  • The installation of Accelerated WEKA is available through Conda, a system providing package and environment management. Such capability means that a simple command can install all dependencies for the project. For example, on a Linux machine, issue the following command in a terminal for installing Accelerated WEKA and all dependencies.

conda create -n accelweka -c rapidsai -c nvidia -c conda-forge  -c waikato weka

  • After Conda has created the environment, activate it with the following command:

conda activate accelweka

  • This terminal instance just loaded all dependencies for Accelerated WEKA. Launch WEKA GUI Chooser with the command:


  • Figure 1 shows the WEKA GUI Chooser window. From there, click the Explorer button to access the functionalities of Accelerated WEKA.
 Screenshot of Weka GUI Chooser upon starting WEKA application.
Figure 1. WEKA GUI Chooser window. This is the first window that appears when you start WEKA
  • In the WEKA Explorer window (Figure 2), click the Open file button to select a dataset file. WEKA works with ARFF files but can read from CSVs. Converting from CSVs can be pretty straightforward or require some configuration by the user, depending on the types of the attributes.
Screenshot of the WEKA Explorer window showing the Open file button.
Figure 2. In the WEKA Explorer window users can import datasets, check statistics about the attributes, and apply filters to the dataset as preprocessing
  • The WEKA Explorer window with a dataset loaded is shown in Figure 3. Assuming one does not want to preprocess the data, clicking the Classify tab will present the classification options to the user. 
 Screenshot of WEKA Explorer with a dataset loaded in the memory.
Figure 3. WEKA Explorer window with a dataset loaded. After loading the dataset (either from an ARFF file or a CSV file) the attribute names appear on the left. Information regarding the selected attribute appears in the upper right. A chart containing the distribution of the class according to the selected attribute is viewable in the lower right

The Classify tab is presented in Figure 4. Clicking “Choose” button will show the implemented classifiers. Some might be disabled because of the dataset characteristics. To use Accelerated WEKA, the user must select rapids.CuMLClassifier. After that, clicking the bold CuMLClassifier will take the user to the option windows for the classifier.

 Screenshot of the WEKA Classify tab for the user to configure the Classifiers and their options as well as the testing of the built models.
Figure 4. In the WEKA Classify tab, the user can configure the classification algorithm and the test options that are going to be used in the experiment using the previously selected dataset
  • Figure 5 shows the Option window for CuMLClassifier. With the field RAPIDS learner, the user can choose the desired classifier among the ones supported by the package. The field Learner parameters are for the modification of the cuML parameters, details of which can be found in the cuML documentation

The other options are for the user to fine-tune the attribute conversion, configure which python environment is to be used, and determine the number of decimal places the algorithm should operate. For the sake of this tutorial, select Random Forest Classifier and keep everything with the default configuration. Clicking OK will close the window and return to the previous tab.

Screenshot of the classifier configuration window in WEKA showing the options for the CuMLClassifier integration with RandomForestClassifier selected.
Figure 5. With the WEKA Classifier configuration window, the user can configure the parameters of the selected classifier. In this case, it is showing the newly integrated CuMLClassifier options with the RandomForestClassifier learner selected
  • After configuring the Classifier according to the previous step, the parameters will be shown in the text field beside the Choose button. After clicking Start, WEKA will start executing the chosen classifier with the dataset. 

Figure 6 shows the classifier in action. The Classifier output is showing debug and general information regarding the experiment, such as parameters, classifiers, dataset, and test options. The status shows the current state of the execution and the Weka bird on the bottom animates and flips from one side to the other while the experiment is running.

Screenshot of WEKA Classify tab with Random Forest Classifier running. The output shows information about the classifier being used.
Figure 6. WEKA Classify tab with the chosen classification algorithm in progress
  • After the algorithm finishes the task, it will output the summary of the execution with information regarding predictive performance and the time taken. In Figure 7, the output shows the results for 10-fold cross-validation using the RandomForestClassifier from cuML through CuMLClassifier.
 Screenshot of the Classify tab after completing the execution of Random Forest Classifier.
Figure 7. WEKA Classify tab after the experiment has been completed

Benchmarking Accelerated WEKA

We evaluated the performance of Accelerated WEKA, comparing the execution time of the algorithms on the CPU with the execution time using the Accelerated WEKA. The hardware used in the experiments was an i7-6700K, a GTX 1080Ti, and a DGX Station with four A100 GPUs. Unless stated otherwise, the benchmarks use a single GPU.

We used datasets with different characteristics for the benchmarks. Some of them were synthetic for better control of the attributes and instances, like the RDG and RBF generators. The RDG generator builds instances based on decision lists. The default configuration has 10 attributes, 2 classes, a minimum rule size of 1, and a maximum rule size of 10. We changed the minimum and maximum limits to 5 and 20, respectively. With this generator, we created datasets with 1, 2, 5, and 10 million instances, as well as 5 million instances with 20 attributes.

The RBFgenerator creates a random set of centers for each class and then generates instances by getting random offsets from the centers for the attribute values. The number of attributes is indicated with the suffix a__ (for example, a5k means 5 thousand attributes), and the number of instances is indicated by the suffix n__ (for example, n10k means 10 thousand instances). 

Lastly, we used the HIGGS dataset, which contains data about the kinematic properties of the atom accelerator. The first 5 million instances of the HIGGS dataset were used to create HIGGS_5m.

The results for the wekaRAPIDS integration are shown, where we make a direct comparison between the baseline CPU execution with the Accelerated WEKA execution. The results for the WDL4J are shown in Table 5.

XGBoost (CV) i7-6700K GTX 1080Ti Speedup
dataset Baseline (seconds) AWEKA SGM (seconds)
RDG1_1m 266.59 65.77 4.05
RDG1_2m 554.34 122.75 4.52
RDG1_5m 1423.34 294.40 4.83
RDG1_10m 2795.28 596.74 4.68
RDG1_5m_20a 2664.39 403.39 6.60
RBFa5k 17.16 15.75 1.09
RBFa5kn1k 110.14 25.43 4.33
RBFa5kn5k 397.83 49.38 8.06
Table 1. Execution time of experiments with XGBoost using cross-validation comparing the baseline CPU execution time with the Accelerated WEKA execution time while sharing GPU memory on a GTX 1080Ti GPU
XGBoost (no-CV) i7-6700K GTX 1080Ti Speedup A100 Speedup
dataset Baseline (seconds) AWEKA CSV (seconds) AWEKA CSV (seconds)
RDG1_1m 46.40 21.19 2.19 22.69 2.04
RDG1_2m 92.87 34.76 2.67 35.42 2.62
RDG1_5m 229.38 73.49 3.12 65.16 3.52
RDG1_10m 461.83 143.08 3.23 106.00 4.36
RDG1_5m_20a 268.98 73.31 3.67
RBFa5k 5.76 7.73 0.75 8.68 0.66
RBFa5kn1k 23.59 13.38 1.76 19.84 1.19
RBFa5kn5k 78.68 34.61 2.27 29.84 2.64
HIGGS_5m 214.77 169.48 1.27 76.82 2.80
Table 2. Execution time of experiments with XGBoost without using cross-validation. A comparison of the baseline CPU execution time with the Accelerated WEKA execution time while sending a CSV file through sockets on a GTX 1080Ti GPU. Loading times of the dataset were taken out
RandomForest (CV) i7-6700K GTX 1080Ti Speedup
dataset Baseline (seconds) AWEKA SGM (seconds)
RDG1_1m 494.27 97.55 5.07
RDG1_2m 1139.86 200.93 5.67
RDG1_5m 3216.40 511.08 6.29
RDG1_10m 6990.00 1049.13 6.66
RDG1_5m_20a 5375.00 825.89 6.51
RBFa5k 13.09 29.61 0.44
RBFa5kn1k 42.33 49.57 0.85
RBFa5kn5k 189.46 137.16 1.38
Table 3. Execution time of experiments with Random Forest using cross-validation comparing the baseline CPU execution time with the Accelerated WEKA execution time while sharing GPU memory on a GTX 1080Ti GPU
KNN (no-CV) AMD EPYC 7742 (4 cores) NVIDIA A100 Speedup 4X NVIDIA A100 Speedup
dataset Baseline (seconds) wekaRAPIDS (seconds) wekaRAPIDS (seconds)
covertype 3755.80 67.05 56.01 42.42 88.54
RBFa5kn5k 6.58 59.94 0.11 56.21 0.12
RBFa5kn10k 11.54 62.98 0.18 59.82 0.19
RBFa500n10k 2.40 44.43 0.05 39.80 0.06
RBFa500n100k 182.97 65.36 2.80 45.97 3.98
RBFa50n10k 2.31 42.24 0.05 37.33 0.06
RBFa50n100k 177.34 43.37 4.09 37.94 4.67
RBFa50n1m 21021.74 77.33 271.84 46.00 456.99
Table 4. Execution time of experiments with KNN without using cross-validation comparing the baseline CPU execution time with the Accelerated WEKA execution on an NVIDIA A100 GPU
3,230,621 params Neural Network i7-6700K GTX 1080Ti Speedup
Epochs Baseline (seconds) WDL4J (seconds)
50 1235.50 72.56 17.03
100 2775.15 139.86 19.84
250 7224.00 343.14 21.64
500 15375.00 673.48 22.83
Table 5. Execution time of experiments with a 3,230,621 parameter neural network comparing the baseline CPU execution time with the Accelerated WEKA execution on a GTX 1080Ti GPU. The experiments used a small subset of the MNIST dataset while increasing the number of epochs

This benchmarking shows that Accelerated WEKA provides the most benefit to compute-intensive tasks with larger datasets. Small datasets like the RBFa5k and RBFa5kn1k (possessing 100 and 1,000 instances, respectively) present bad speedup, which happens because the dataset is too small to make the overhead of moving things to GPU memory worthwhile. 

Such behavior is noticeable in the A100 (Table 4) experiments, where the architecture is more complex. The benefits of using it start to kick in at the 100,000 instances or bigger datasets. For instance, The RBF datasets with 100,000 instances show ~3 and 4x speedup, which is still lackluster but shows improvement. Bigger datasets like the covertype dataset (~700,000 instances) or the RBFa50n1m dataset (1 million instances) show speedups of 56X and 271X, respectively. Note that for Deep Learning tasks, the Speedup can reach over 20X even with the GTX 1080Ti.

Key takeaways (Tie back to the Call to Action)

Accelerated WEKA will help you supercharge WEKA using RAPIDS. Accelerated WEKA helps with efficient algorithm implementations of RAPIDS and has an easy-to-use GUI. The installation process is simplified using the Conda environment, making it straightforward to use Accelerated WEKA from the beginning.

If you use Accelerated WEKA, please use the hashtag #AcceleratedWEKA on social media. Also, please refer to the documentation for the correct publication to cite Accelerated WEKA in academic work and find out more details about the project. 

Contributing to Accelerated WEKA

WEKA is freely available under the GPL open-source license and so is Accelerated WEKA. In fact, Accelerated WEKA is provided through Conda to automate the installation of the required tools for the environment, and the additions to the source code are published to the main packages for WEKA. Contributions and bug fixes can be contributed as patch files and posted to the WEKA mailing list.


We would like to thank Ettikan Karuppiah, Nick Becker, Brian Kloboucher, and Dr. Johan Barthelemy from NVIDIA for the technical support they provided during the execution of this project. Their insights were essential in helping us reach the goal of efficient integration with the RAPIDS library. In addition, we would like to thank Johan Barthelemy for running benchmarks in extra graphic cards.


Convert to numpy

Hello members,

So my question is I have a variable which of type ops.Tensor. I need to convert this variable to a numpy variable. I tried different solutions online but in those cases the variable needs to be ops.EagerTensor to convert the variable into numpy object(. numpy (), tf.make_ndarray etc). Soo how can I convert my tensor object to eager tensor object or directly to numpy??

submitted by /u/Anonymous_Guy_12
[visit reddit] [comments]


Building a Four-Node Cluster with NVIDIA Jetson Xavier NX

Create a compact desktop cluster with four NVIDIA Jetson Xavier NX modules to accelerate training and inference of AI and deep learning workflows.

Following in the footsteps of large-scale supercomputers like the NVIDIA DGX SuperPOD, this post guides you through the process of creating a small-scale cluster that fits on your desk. Below is the recommended hardware and software to complete this project. This small-scale cluster can be utilized to accelerate training and inference of artificial intelligence (AI) and deep learning (DL) workflows, including the use of containerized environments from sources such as the NVIDIA NGC Catalog.


Picture of hardware components used in this post

While the Seeed Studio Jetson Mate, USB-C PD power supply, and USB-C cable are not required, they were used in this post and are highly recommended for a neat and compact desktop cluster solution.


For more information, see the NVIDIA Jetson Xavier NX development kit.


Write the JetPack image to a microSD card and perform initial JetPack configuration steps:

The first iteration through this post is targeted toward the Slurm control node (slurm-control). After you have the first node configured, you can either choose to repeat each step for each module, or you can clone this first microSD card for the other modules; more detail on this later.

For more information about the flashing and initial setup of JetPack, see Getting Started with Jetson Xavier NX Developer Kit.

While following the getting started guide above:

  • Skip the wireless network setup portion as a wired connection will be used.
  • When selecting a username and password, choose what you like and keep it consistent across all nodes.
  • Set the computer’s name to be the target node you’re currently working with, the first being slurm-control.
  • When prompted to select a value for Nvpmodel Mode, choose MODE_20W_6CORE for maximum performance.

After flashing and completing the getting started guide, run the following commands:

echo "`id -nu` ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/`id -nu`
sudo systemctl mask snapd.service apt-daily.service apt-daily-upgrade.service
sudo systemctl mask apt-daily.timer apt-daily-upgrade.timer
sudo apt update
sudo apt upgrade -y
sudo apt autoremove -y

Disable NetworkManager, enable systemd-networkd, and configure network [DHCP]:

sudo systemctl disable NetworkManager.service NetworkManager-wait-online.service NetworkManager-dispatcher.service network-manager.service
sudo systemctl mask avahi-daemon
sudo systemctl enable systemd-networkd
sudo ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
cat  /dev/null




sudo sed -i "/#MulticastDNS=/cMulticastDNS=yes" /etc/systemd/resolved.conf
sudo sed -i "/#Domains=/cDomains=local" /etc/systemd/resolved.conf

Configure the node hostname:

If you have already set the hostname in the initial JetPack setup, this step can be skipped.


sudo hostnamectl set-hostname slurm-control
sudo sed -i "s/*/`hostname`/" /etc/hosts


Compute nodes should follow a particular naming convention to be easily addressable by Slurm. Use a consistent identifier followed by a sequentially incrementing number (for example, node1, node2, and so on). In this post, I suggest using nx1, nx2, and nx3 for the compute nodes. However, you can choose anything that follows a similar convention.

sudo hostnamectl set-hostname nx[1-3]
sudo sed -i "s/*/`hostname`/" /etc/hosts

Create users and groups for Munge and Slurm:

sudo groupadd -g 1001 munge
sudo useradd -m -c "MUNGE" -d /var/lib/munge -u 1001 -g munge -s /sbin/nologin munge
sudo groupadd -g 1002 slurm
sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u 1002 -g slurm -s /bin/bash slurm

Install Munge:

sudo apt install libssl-dev -y
git clone
cd munge 
sudo make install -j6
sudo ldconfig
sudo mkdir -m0755 -pv /usr/local/var/run/munge
sudo chown -R munge: /usr/local/etc/munge /usr/local/var/run/munge /usr/local/var/log/munge

Create or copy the Munge encryption keys:


sudo -u munge mungekey --verbose


sudo sftp -s 'sudo /usr/lib/openssh/sftp-server' `id -nu`@slurm-control 

Start Munge and test the local installation:

sudo systemctl enable munge
sudo systemctl start munge
munge -n | unmunge

Expected result: STATUS: Success (0)

Verify that the Munge encryption keys match from a compute node to slurm-control:


munge -n | ssh slurm-control unmunge

Expected result: STATUS: Success (0)

Install Slurm (20.11.9):

cd ~
tar -xjvf slurm-20.11-latest.tar.bz2
cd slurm-20.11.9
./configure --prefix=/usr/local
sudo make install -j6

Index the Slurm shared objects and copy the systemd service files:

sudo ldconfig -n /usr/local/lib/slurm
sudo cp etc/*.service /lib/systemd/system

Create directories for Slurm and apply permissions:

sudo mkdir -pv /usr/local/var/{log,run,spool} /usr/local/var/spool/{slurmctld,slurmd}
sudo chown slurm:root /usr/local/var/spool/slurm*
sudo chmod 0744 /usr/local/var/spool/slurm*

Create a Slurm configuration file for all nodes:

For this step, you can follow the included commands and use the following configuration file for the cluster (recommended). To customize variables related to Slurm, use the configuration tool.

cat  /dev/null
#slurm.conf for all nodes#

NodeName=nx[1-3] RealMemory=7000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

sudo chmod 0744 /usr/local/etc/slurm.conf
sudo chown slurm: /usr/local/etc/slurm.conf

Install Enroot 3.3.1:

cd ~
sudo apt install curl jq parallel zstd -y
arch=$(dpkg --print-architecture)curl -fSsL -O${arch}.deb
sudo dpkg -i enroot_3.3.1-1_${arch}.deb

Install Pyxis (0.13):

git clone
cd pyxis
sudo make install -j6

Create the Pyxis plug-in directory and config file:

sudo mkdir /usr/local/etc/plugstack.conf.d
echo "include /usr/local/etc/plugstack.conf.d/*" | sudo tee /usr/local/etc/plugstack.conf > /dev/null

Link the Pyxis default config file to the plug-in directory:

sudo ln -s /usr/local/share/pyxis/pyxis.conf /usr/local/etc/plugstack.conf.d/pyxis.conf

Verify Enroot/Pyxis installation success:

srun --help | grep container-image

Expected result: --container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH


When replicating the configuration across the remaining nodes, label the JetsonNX modules with the assigned node name and/or the microSD cards. This helps prevent confusion later on when moving modules or cards around.

There are two different methods in which you can replicate your installation to the remaining modules: manual configuration or cloning slurm-control. Read over both methods and choose which method you prefer.

Manually configure the remaining nodes

Follow the “Enable and start the Slurm service daemon” section below for your current module, then repeat the entire process for the remaining modules, skipping any steps tagged under [slurm-control]. When all modules are fully configured, install them into the Jetson Mate in their respective slots, as outlined in the “Install all Jetson Xavier NX modules into the enclosure” section.

Clone slurm-control installation for remaining nodes

To avoid repeating all installation steps for each node, clone the slurm-control node’s card as a base image and flash it onto all remaining cards. This requires a microSD-to-SD card adapter if you have only one multi-port card reader and want to do card-to-card cloning. Alternatively, creating an image file from the source slurm-control card onto the local machine and then flashing target cards is also an option.

  1. Shut down the Jetson that you’ve been working with, remove the microSD card from the module, and insert it into the card reader.
  2. If you’re performing a physical card to card clone (using Balena Etcher, dd, or any other utility that will do sector by sector writes), insert the blank target microSD into the SD card adapter, then insert it into the card reader.
  3. Identify which card is which for the source (microSD) and destination (SD card) in the application that you’re using and start the cloning process.
  4. If you are creating an image file, using a utility of your choice, create an image file from the slurm-control microSD card on the local machine, then remove that card and flash the remaining blank cards using that image.
  5. After cloning is completed, insert a cloned card into a Jetson module and power on. Configure the node hostname for a compute node, then proceed to enable and start the Slurm service daemon. Repeat this process for all remaining card/module pairs.

Enable and start the Slurm service daemon:


sudo systemctl enable slurmctld
sudo systemctl start slurmctld


sudo systemctl enable slurmd
sudo systemctl start slurmd

Install all Jetson Xavier NX modules into the enclosure

First power down any running modules, then remove them from their carriers. Install all Jetson modules into the Seeed Studio Jetson Mate, ensuring that the control node is placed in the primary slot labeled “MASTER”, and compute nodes 1-3 are placed in secondary slots labeled “WORKE 1, 2, and 3” respectively. Optional fan extension cables are available from the Jetson Mate kit for each module.

The video output on the enclosure is connected to the primary module slot, as is the vertical USB2 port, and USB3 port 1. All other USB ports are wired to the other modules according to their respective port numbers.

Photo of fully assembled cluster on a table.
Figure 1. Fully assembled cluster inside of the SeeedStudio Jetson Mate


This section contains some helpful commands to assist in troubleshooting common networking and Slurm-related issues.

Test network configuration and connectivity

The following command should show eth0 in the routable state, with IP address information obtained from the DHCP server:

networkctl status

The command should respond with the local node’s hostname and .local as the domain (for example, slurm-control.local), along with DHCP assigned IP addresses:

host `hostname`

Choose a compute node hostname that is configured and online. It should respond similarly to the previous command. For example: host nx1 – nx1.local has address This should also work for any other host that has an mDNS resolver daemon running on your LAN.

host [compute-node-hostname]

All cluster nodes should be pingable by all other nodes, and all local LAN IP addresses should be pingable as well, such as your router.

ping [compute-node-hostname/local-network-host/ip]

Test the external DNS name resolution and confirm that routing to the internet is functional:


Check Slurm cluster status and node communication

The following command shows the current status of the cluster, including node states:

sinfo -lNe

If any nodes in the sinfo output show UNKNOWN or DOWN for their state, the following command signals to the specified nodes to change their state and become available for job scheduling ([ ] specifies a range of numbers following the hostname ‘nx’):

scontrol update NodeName=hostname[1-3] State=RESUME

The following command runs hostname on all available compute nodes. Nodes should respond back with their corresponding hostname in your console.

srun -N3 hostname


You’ve now successfully built a multi-node Slurm cluster that fits on your desk. There’s a vast amount of benchmarks, projects, workloads, and containers that you can now run on your mini-cluster. Feel free to share your feedback on this post and, of course, anything that your new cluster is being used for.

Power on and enjoy Slurm!

For more information, see the following resources:


Special thanks to Robert Sohigian, a technical marketing engineer on our team, for all the guidance in creating this post, providing feedback on the clarity of instructions, and for being the lab rat in multiple runs of building this cluster. Your feedback was invaluable and made this post what it is!


Driving Data Center Innovation Through Ecosystem Partners

Leading security, storage, and networking vendors are joining the DOCA and DPU community.

The DPU, or data processing unit, is a new class of programmable processors that specializes in moving data around the data center and now joins CPUs and GPUs as the third pillar of modern computing. NVIDIA DOCA is core to the NVIDIA Bluefield DPU offering because it provides ecosystem partners with an open platform to deliver the advanced networking, storage, and security services needed today. 

DOCA  unlocks data center innovation by enabling an open ecosystem and developer community to rapidly create applications and services on top of Bluefield DPUs, using industry-standard open APIs and frameworks. 

Integral to our customers’ success, and our own, is the collaboration with our ecosystem partners. For more than 15 years, our ecosystem partners have harnessed the power of CUDA to develop the world’s most effective accelerated applications for a multitude of use cases. 

The NVIDIA CUDA Toolkit provides everything that is needed to develop GPU-accelerated applications. Similarly, the NVIDIA DOCA Software Framework is an open SDK that enables you to rapidly create applications and services on top of Bluefield DPUs.  

Where partners have achieved such success with NVIDIA GPUs and CUDA, we are emulating that formula with our DPU portfolio and DOCA. Moreover, we recognize that to deliver best-in-class solutions for customers, we need to partner with the world’s leading technology vendors. Proprietary applications have their place, but who better to provide world-class security, storage, and networking solutions, than the world’s leading vendors in those fields?

A meeting of the minds

During the last two years, our ecosystem partners have been delivering innovative solutions and services essential for digital transformation. The most turbulent period in recent history has forced us all to find new ways to collaborate and embrace technology at a rate never expected. Not only have we had to adapt as individuals, but organizations across the globe have been forced to re-think their day-to-day activities.

We work closely with our partners to define and create more DOCA libraries and services to address innovative use cases.  More than ever, we’re witnessing a realignment between technology requirements in the data center and ever-changing business priorities. In turn, matching customers to ecosystem partners provides an opportunity to create customized technology solutions tuned to meet specific business objectives.

Today, NVIDIA is working with leading platform vendors and partners to integrate and expand DOCA support for commercial distributions on BlueField DPUs. Dozens of industry leaders, including VMWare, Red Hat, DDN, Aria Cybersecurity, and Juniper Networks, have started to integrate their solutions using the DPU/DOCA architecture. You’ll start to see more new applications in the coming year. 

Earlier this year, Palo Alto Networks, a global cybersecurity leader developed the first next-generation firewall (NGFW) specifically designed to be accelerated by the BlueField DPU. This first-to-market, hardware-accelerated software NGFW is a prime example of how the BlueField DPU boosts performance and optimizes data center security coverage and efficiency.

Third-party developers can create and distribute DPU-accelerated applications with the DOCA SDK, which is fully integrated into the NGC catalog of containerized software. Such accelerated solutions will be wide-ranging, including advanced applications for infrastructure, storage, and security. It will be the key to unlocking data center innovation.

Try DOCA today

NVIDIA DOCA is the key to unlocking the potential of the NVIDIA BlueField DPU to offload, accelerate, and isolate data center workloads. With DOCA, you can program the data center infrastructure of tomorrow by creating software-defined, cloud-native, DPU-accelerated services with zero-trust protection to address the increasing performance and security demands of modern data centers.

To start developing on DOCA:


Upcoming Webinar: Using GPUs to Accelerate HD Mapping and Location-Based Services

Join us on July 20 for a webinar highlighting how using NVIDIA A100 GPUs can help map and location-based service providers speed up map creation and workflows, while reducing costs.


Advanced API Performance: Vulkan Clearing and Presenting

This post covers best practices for Vulkan clearing and presentation on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

This post covers best practices for Vulkan clearing and presenting on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

With the recent Vulkan 1.3 release, it’s timely to add some Vulkan-specific tips that are not necessarily explicitly covered by the other Advanced API Performance posts. In addition to introducing new Vulkan 1.3 core features, this post shares a set of good practices for clearing and presenting surfaces.

Vulkan 1.3 Core

Vulkan 1.3 brings improvements through extensions to key parts in the API. This section summarizes our recommendations for obtaining the best performance when working with a number of these new features.



This section provides a guideline for achieving performance when invoking clear commands. This type of command clears a region within a color image or within the bound framebuffer attachments. 

  • Use VK_ATTACHMENT_LOAD_OP_CLEAR to clear attachments at the beginning of a subpass instead of clear commands. This can allow the driver to skip loading unnecessary data.
  • Outside of a render pass instance, prefer the usage of vkCmdClearColorImage instead of a CS invocation to clear images. This path enables bandwidth optimizations.
  • If possible, batch clears to avoid interleaving single clears between dispatches.
  • Coordinate VkClearDepthStencilValue with the test function to achieve better depth testing performance:
    • 0.5 ≤ depth value VK_COMPARE_OP_LESS_OR_EQUAL
    • 0.0 ≤ depth value VK_COMPARE_OP_GREATER_OR_EQUAL

Not recommended

  • Specifying more than 30 unique clear values per application (or more than 15 on Turing) does not make the most of clear bandwidth optimizations.
  • “Clear shaders” should be avoided unless there is overlap of a compute clear with a neighboring dispatch.


The following section offers insight into the preferred way of using the presentation modes supported by a surface in order to achieve good performance.


  • Rely on VK_PRESENT_MODE_FIFO_KHR or VK_PRESENT_MODE_MAILBOX_KHR (for VSync on). Noteworthy aspects:
    • VK_PRESENT_MODE_FIFO_KHR is preferred as it does not drop frames and lacks tearing.
    • VK_PRESENT_MODE_MAILBOX_KHR may offer lower latency, but frames might be dropped.
    • VK_PRESENT_MODE_FIFO_RELAXED_KHR is compelling when your application only occasionally lags behind the refresh rate, allowing tearing so that it can “catch back up”.
  • Rely on VK_PRESENT_MODE_IMMEDIATE_KHR for VSync off.
  • On Windows systems, use the VK_EXT_full_screen_exclusive extension to bypass compositing.
  • Handle both out-of-date and suboptimal swapchains to re-create stale swapchains when windows resize, for example.
  • For latency-sensitive applications, use the Vulkan Reflex SDK to minimize latency by completing game engine work just-in-time for rendering.

More information

For more information about using Vulkan with NVIDIA GPUs, see Vulkan Do’s and Don’ts.

To view the Vulkan API state, use the API Inspector in Nsight Graphics. (free download)

With Nsight Systems, you can view Vulkan usage on a unified CPU-GPU timeline, investigate stutter, and track GPU cold spots to their CPU origins. Download Nsight Systems for free.


Thanks to  Piers Daniell, Ivan Fedorov, Adam Moss, Ryan Prescott, Joshua Schnarr, Juha Sjöholm, and Márton Tamás for their feedback and contributions.


Building Generally Capable AI Agents with MineDojo

A large compilation of Minecraft videos that MineDojo uses to train the AINVIDIA is helping push the limits of training AI generalist agents with a new open-sourced framework called MineDojo.A large compilation of Minecraft videos that MineDojo uses to train the AI

Using video games as a medium for training AI has become a popular method within the AI research community. These autonomous agents have had great success in Atari games, Starcraft, Dota, and Go. But while these advancements have been popular for AI research, the agents do not generalize beyond a very specific set of tasks, unlike humans that continuously learn from open-ended tasks.

Building an embodied agent that can attain high-level performance across a wide spectrum of tasks has been one of the greatest challenges facing the AI research community. In order to build a successful generalist agent, users need an environment that supports a multitude of tasks and goals, a large-scale database of multimodal knowledge, and a flexible and scalable agent architecture.

Enter Minecraft, the most played game in the world. With its flexible gameplay players can do a wide variety of actions. This ranges from building a medieval castle to exploring dangerous environments to gathering resources for building a Nether Portal to battle the Nether Dragon. This creative atmosphere is the perfect environment for an embodied agent to train.

A table of images showing the NVIDIA AI agent completing different tasks.
Figure 1. The NVIDIA AI agent follows the prompts within the MineDojo framework

To take advantage of such an optimal training ground, NVIDIA researchers created MineDojo. MineDojo has built a massive framework that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base. Building an AI powerful enough to complete these tasks would not be possible without an expansive data library.

The mission of MineDojo is to promote research towards the goal of generally capable embodied agents. In order for the embodied agent to be successful, the environment needs to provide an almost infinite number of open-ended tasks and actions. This is done by giving the agent access to a large database of information to pull knowledge and then apply learnings. The training gained from the embodied agent needs to be scalable to convert the large-scale knowledge into actionable insights later on.

A few screenshots and word maps detailing the vast annotated database from YouTube, the Minecraft Wiki and Reddit for an AI agent to train on.
Figure 2. The MineDojo framework takes advantage of an Internet-scale database to train an AI agent

In MineDojo, the embodied agent has access to three internet-scale datasets. With 750,000 Minecraft YouTube videos—amounting to over 33 years of Minecraft videos—pulled into the database, over 2 million words were transcribed. 

MineDojo also scraped over 6,000 web pages from the Minecraft Wiki, with over 2.2 million bounding boxes created for the visual elements of those pages). Also, millions of Reddit threads related to Minecraft and the variety of activities one can do within the game were captured. The questions included how to solve certain tasks and showcase achievements and creations in image and video formats, along with general tips and tricks.

Screenshots of webpages being annotated for the AI agent's training from the Minecraft Wiki.
Screenshots of Reddit questions being scraped for the AI to train on.
Figure 3. Examples of content annotated and scraped from the internet for the MineDojo framework

MineDojo offers a set of simulator APIs that users can use to train their AI agents. It provides unified observation and action spaces to help facilitate the agent to adapt to new scenarios and multitask. Additionally, using the APIs users can take advantage of all three worlds within the Minecraft universe to expand on the number of tasks and actions the agent can do. 

Within the simulator, MIneDojo splits the benchmarking tasks into two categories: programmatic tasks and creative tasks. 

Programmatic tasks are well defined and can be easily evaluated, such as “surviving 3 days” or “obtain one unit of pumpkin in the forest.” 

Creative tasks are much more open-ended, such as “build a beautiful beach house.” It is very difficult to define what qualifies as a beach house by an explicit set of rules. these tasks are to encourage the research community to develop more human-like and imaginative AI agents.

Video clips of the variety of tasks that are benchmarked through MineDojo.
Figure 4. MineDojo currently provides benchmarks for thousands of creative and programmatic tasks

Natural language is a cornerstone of the MineDojo framework. It aids open-vocabulary understanding, provides grounding for image and video modalities, and serves as an intuitive interface to specify instructions. Combined with the latest speech recognition technology, it is possible in the near future to talk to an AI Agent as you would to a friend in multiplayer co-op mode.

For example: “plant a row of blue flowers in front of our house. Add some gold decorations to the door frame. Let’s go explore the cave next to the river,” could all be possible.

Proof of concept using MineCLIP

To help promote the project and provide a proof of concept, the MineDojo researchers have implemented a single language-prompted agent to complete several complex tasks within Minecraft, called MineCLIP. This novel agent learning algorithm takes advantage of the 33 years worth of Minecraft YouTube videos. However, it is good to point out that any agent can use any or all three sections of the Internet-scale database at the user’s discretion.

A flowchart of the MineCLIP agent showing the reward signal to train the embodied agent.
Figure 5. MineCLIP learns to associate video and text from the large amount of YouTube videos. The association score provides a reward signal to guide the agent to learn multiple tasks in parallel

MineCLIP as an embodied agent learns from the YouTube videos the concepts and actions of Minecraft without human hand labeling. YouTubers typically narrate what they are doing as they stream the gameplay video. MineCLIP is a large Transformer model that learns to associate a video clip and its corresponding English transcripts. 

This association score can be provided as a reward signal to guide a reinforcement learning agent towards completing the task. For the example task, “shear a sheep to obtain wool,” MineCLIP gives a high reward to the agent if it approaches the sheep, but a low reward if the agent wanders aimlessly. It is even capable of multitasking within the game to complete a wide range of simple tasks.

Building generally capable embodied agents is a holy grail goal of AI research. MineDojo provides a benchmark of 1000s of tasks, an internet-scale rich knowledge base, and an innovative algorithm as a first step towards solving the grand challenge. 

Stay posted to see what new models and techniques the research community comes up with next! Start using MineDojo today.

Read more about the framework and its findings. Explore other NVIDIA research.