Feedspot paid plan is required to export Combined RSS feeds. If you are the owner of this feed, please login to your Feedspot account to see upgrade options. In case of any questions feel free to reach us at team@feedspot.com.
Can someone please help with this – https://stackoverflow.com/questions/72809662/how-to-interpret-the-output-of-get-weights-for-keras-lstm? It’ll be much appreciated.
submitted by /u/Academic-Rent7800
[visit reddit] [comments]
Accelerated WEKA integrates the WEKA workbench with Python and Java libraries that support GPU to speedup the training and prediction time of machine learning models.
In recent years, there has been a surge in building and adopting machine learning (ML) tools. The use of GPUs to accelerate increasingly compute-intensive models has been a prominent trend.
To increase user access, the Accelerated WEKA project provides an accessible entry point for using GPUs in well-known WEKA algorithms by integrating open-source RAPIDS libraries.
In this post, you will be introduced to Accelerated WEKA and learn how to leverage GPU-accelerated algorithms with a graphical user interface (GUI) using WEKA software. This Java open-source alternative is suitable for beginners looking for a variety of ML algorithms from different environments or packages.
What is Accelerated WEKA?
Accelerated WEKA unifies the WEKA software, a well-known and open-source Java software, with new technologies that leverage the GPU to shorten the execution time of ML algorithms. It has two benefits aimed at users without expertise in system configuration and coding: an easy installation and a GUI that guides the configuration and execution of the ML tasks.
Accelerated WEKA is a collection of packages available for WEKA, and it can be extended to support new tools and algorithms.
What is RAPIDS?
RAPIDS is a collection of open-source Python libraries for users to develop and deploy data science workloads on NVIDIA GPUs. Popular libraries include cuDF for GPU-accelerated DataFrame processing and cuML for GPU-accelerated machine learning algorithms. RAPIDS APIs conform as much as possible to the CPU counterparts, such as pandas and scikit-learn.
Accelerated WEKA architecture
The building blocks of Accelerated WEKA are packages like WekaDeeplearning4j and wekaRAPIDS (inspired by wekaPython). WekaDeeplearning4j (WDL4J) already supports GPU processing but has very specific needs in terms of libraries and environment configuration. WDL4J provides WEKA wrappers for the Deeplearning4j library.
For Python users wekaPython initially provided Python integration by creating a server and communicating with it through sockets. With this the user can execute scikit-learn ML algorithms (or even XGBoost) inside the WEKA workbench. Furthermore, wekaRAPIDS provides integration with RAPIDS cuML library by using the same technique in wekaPython.
Together, both packages provide enhanced functionality and performance inside the user-friendly WEKA workbench. Accelerated WEKA goes a step further in the direction of performance by improving the communication between the JVM and Python interpreter. It does so by using alternatives like Apache Arrow and GPU memory sharing for efficient data transfer between the two languages.
Accelerated WEKA also provides integration with the RAPIDS cuML library, which implements machine learning algorithms that are accelerated on NVIDIA GPUs. Some cuML algorithms can even support multi-GPU solutions.
Supported algorithms
The algorithms currently supported by Accelerated WEKA are:
- LinearRegression
- LogisticRegression
- Ridge
- Lasso
- ElasticNet
- MBSGDClassifier
- MBSGDRegressor
- MultinomialNB
- BernoulliNB
- GaussianNB
- RandomForestClassifier
- RandomForestRegressor
- SVC
- SVR
- LinearSVC
- KNeighborsRegressor
- KNeighborsClassifier
The algorithms supported by Accelerated WEKA in multi-GPU mode are:
- KNeighborsRegressor
- KNeighborsClassifier
- LinearRegression
- Ridge
- Lasso
- ElasticNet
- MultinomialNB
- CD
Using Accelerated WEKA GUI
During the Accelerated WEKA design stage, one main goal was for it to be easy to use. The following steps outline how to set it up on a system along with a brief example.
Please refer to the documentation for more information, and a comprehensive getting started. The only prerequisite for Accelerated WEKA is having Conda installed in your system.
- The installation of Accelerated WEKA is available through Conda, a system providing package and environment management. Such capability means that a simple command can install all dependencies for the project. For example, on a Linux machine, issue the following command in a terminal for installing Accelerated WEKA and all dependencies.
conda create -n accelweka -c rapidsai -c nvidia -c conda-forge -c waikato weka
- After Conda has created the environment, activate it with the following command:
conda activate accelweka
- This terminal instance just loaded all dependencies for Accelerated WEKA. Launch WEKA GUI Chooser with the command:
weka
- Figure 1 shows the WEKA GUI Chooser window. From there, click the Explorer button to access the functionalities of Accelerated WEKA.
- In the WEKA Explorer window (Figure 2), click the Open file button to select a dataset file. WEKA works with ARFF files but can read from CSVs. Converting from CSVs can be pretty straightforward or require some configuration by the user, depending on the types of the attributes.
- The WEKA Explorer window with a dataset loaded is shown in Figure 3. Assuming one does not want to preprocess the data, clicking the Classify tab will present the classification options to the user.
The Classify tab is presented in Figure 4. Clicking “Choose” button will show the implemented classifiers. Some might be disabled because of the dataset characteristics. To use Accelerated WEKA, the user must select rapids.CuMLClassifier. After that, clicking the bold CuMLClassifier will take the user to the option windows for the classifier.
- Figure 5 shows the Option window for CuMLClassifier. With the field RAPIDS learner, the user can choose the desired classifier among the ones supported by the package. The field Learner parameters are for the modification of the cuML parameters, details of which can be found in the cuML documentation.
The other options are for the user to fine-tune the attribute conversion, configure which python environment is to be used, and determine the number of decimal places the algorithm should operate. For the sake of this tutorial, select Random Forest Classifier and keep everything with the default configuration. Clicking OK will close the window and return to the previous tab.
- After configuring the Classifier according to the previous step, the parameters will be shown in the text field beside the Choose button. After clicking Start, WEKA will start executing the chosen classifier with the dataset.
Figure 6 shows the classifier in action. The Classifier output is showing debug and general information regarding the experiment, such as parameters, classifiers, dataset, and test options. The status shows the current state of the execution and the Weka bird on the bottom animates and flips from one side to the other while the experiment is running.
- After the algorithm finishes the task, it will output the summary of the execution with information regarding predictive performance and the time taken. In Figure 7, the output shows the results for 10-fold cross-validation using the RandomForestClassifier from cuML through CuMLClassifier.
Benchmarking Accelerated WEKA
We evaluated the performance of Accelerated WEKA, comparing the execution time of the algorithms on the CPU with the execution time using the Accelerated WEKA. The hardware used in the experiments was an i7-6700K, a GTX 1080Ti, and a DGX Station with four A100 GPUs. Unless stated otherwise, the benchmarks use a single GPU.
We used datasets with different characteristics for the benchmarks. Some of them were synthetic for better control of the attributes and instances, like the RDG and RBF generators. The RDG generator builds instances based on decision lists. The default configuration has 10 attributes, 2 classes, a minimum rule size of 1, and a maximum rule size of 10. We changed the minimum and maximum limits to 5 and 20, respectively. With this generator, we created datasets with 1, 2, 5, and 10 million instances, as well as 5 million instances with 20 attributes.
The RBFgenerator creates a random set of centers for each class and then generates instances by getting random offsets from the centers for the attribute values. The number of attributes is indicated with the suffix a__ (for example, a5k means 5 thousand attributes), and the number of instances is indicated by the suffix n__ (for example, n10k means 10 thousand instances).
Lastly, we used the HIGGS dataset, which contains data about the kinematic properties of the atom accelerator. The first 5 million instances of the HIGGS dataset were used to create HIGGS_5m.
The results for the wekaRAPIDS integration are shown, where we make a direct comparison between the baseline CPU execution with the Accelerated WEKA execution. The results for the WDL4J are shown in Table 5.
XGBoost (CV) | i7-6700K | GTX 1080Ti | Speedup |
dataset | Baseline (seconds) | AWEKA SGM (seconds) | |
RDG1_1m | 266.59 | 65.77 | 4.05 |
RDG1_2m | 554.34 | 122.75 | 4.52 |
RDG1_5m | 1423.34 | 294.40 | 4.83 |
RDG1_10m | 2795.28 | 596.74 | 4.68 |
RDG1_5m_20a | 2664.39 | 403.39 | 6.60 |
RBFa5k | 17.16 | 15.75 | 1.09 |
RBFa5kn1k | 110.14 | 25.43 | 4.33 |
RBFa5kn5k | 397.83 | 49.38 | 8.06 |
XGBoost (no-CV) | i7-6700K | GTX 1080Ti | Speedup | A100 | Speedup |
dataset | Baseline (seconds) | AWEKA CSV (seconds) | AWEKA CSV (seconds) | ||
RDG1_1m | 46.40 | 21.19 | 2.19 | 22.69 | 2.04 |
RDG1_2m | 92.87 | 34.76 | 2.67 | 35.42 | 2.62 |
RDG1_5m | 229.38 | 73.49 | 3.12 | 65.16 | 3.52 |
RDG1_10m | 461.83 | 143.08 | 3.23 | 106.00 | 4.36 |
RDG1_5m_20a | 268.98 | 73.31 | 3.67 | – | |
RBFa5k | 5.76 | 7.73 | 0.75 | 8.68 | 0.66 |
RBFa5kn1k | 23.59 | 13.38 | 1.76 | 19.84 | 1.19 |
RBFa5kn5k | 78.68 | 34.61 | 2.27 | 29.84 | 2.64 |
HIGGS_5m | 214.77 | 169.48 | 1.27 | 76.82 | 2.80 |
RandomForest (CV) | i7-6700K | GTX 1080Ti | Speedup |
dataset | Baseline (seconds) | AWEKA SGM (seconds) | |
RDG1_1m | 494.27 | 97.55 | 5.07 |
RDG1_2m | 1139.86 | 200.93 | 5.67 |
RDG1_5m | 3216.40 | 511.08 | 6.29 |
RDG1_10m | 6990.00 | 1049.13 | 6.66 |
RDG1_5m_20a | 5375.00 | 825.89 | 6.51 |
RBFa5k | 13.09 | 29.61 | 0.44 |
RBFa5kn1k | 42.33 | 49.57 | 0.85 |
RBFa5kn5k | 189.46 | 137.16 | 1.38 |
KNN (no-CV) | AMD EPYC 7742 (4 cores) | NVIDIA A100 | Speedup | 4X NVIDIA A100 | Speedup |
dataset | Baseline (seconds) | wekaRAPIDS (seconds) | wekaRAPIDS (seconds) | ||
covertype | 3755.80 | 67.05 | 56.01 | 42.42 | 88.54 |
RBFa5kn5k | 6.58 | 59.94 | 0.11 | 56.21 | 0.12 |
RBFa5kn10k | 11.54 | 62.98 | 0.18 | 59.82 | 0.19 |
RBFa500n10k | 2.40 | 44.43 | 0.05 | 39.80 | 0.06 |
RBFa500n100k | 182.97 | 65.36 | 2.80 | 45.97 | 3.98 |
RBFa50n10k | 2.31 | 42.24 | 0.05 | 37.33 | 0.06 |
RBFa50n100k | 177.34 | 43.37 | 4.09 | 37.94 | 4.67 |
RBFa50n1m | 21021.74 | 77.33 | 271.84 | 46.00 | 456.99 |
3,230,621 params Neural Network | i7-6700K | GTX 1080Ti | Speedup |
Epochs | Baseline (seconds) | WDL4J (seconds) | |
50 | 1235.50 | 72.56 | 17.03 |
100 | 2775.15 | 139.86 | 19.84 |
250 | 7224.00 | 343.14 | 21.64 |
500 | 15375.00 | 673.48 | 22.83 |
This benchmarking shows that Accelerated WEKA provides the most benefit to compute-intensive tasks with larger datasets. Small datasets like the RBFa5k and RBFa5kn1k (possessing 100 and 1,000 instances, respectively) present bad speedup, which happens because the dataset is too small to make the overhead of moving things to GPU memory worthwhile.
Such behavior is noticeable in the A100 (Table 4) experiments, where the architecture is more complex. The benefits of using it start to kick in at the 100,000 instances or bigger datasets. For instance, The RBF datasets with 100,000 instances show ~3 and 4x speedup, which is still lackluster but shows improvement. Bigger datasets like the covertype dataset (~700,000 instances) or the RBFa50n1m dataset (1 million instances) show speedups of 56X and 271X, respectively. Note that for Deep Learning tasks, the Speedup can reach over 20X even with the GTX 1080Ti.
Key takeaways (Tie back to the Call to Action)
Accelerated WEKA will help you supercharge WEKA using RAPIDS. Accelerated WEKA helps with efficient algorithm implementations of RAPIDS and has an easy-to-use GUI. The installation process is simplified using the Conda environment, making it straightforward to use Accelerated WEKA from the beginning.
If you use Accelerated WEKA, please use the hashtag #AcceleratedWEKA on social media. Also, please refer to the documentation for the correct publication to cite Accelerated WEKA in academic work and find out more details about the project.
Contributing to Accelerated WEKA
WEKA is freely available under the GPL open-source license and so is Accelerated WEKA. In fact, Accelerated WEKA is provided through Conda to automate the installation of the required tools for the environment, and the additions to the source code are published to the main packages for WEKA. Contributions and bug fixes can be contributed as patch files and posted to the WEKA mailing list.
Acknowledgments
We would like to thank Ettikan Karuppiah, Nick Becker, Brian Kloboucher, and Dr. Johan Barthelemy from NVIDIA for the technical support they provided during the execution of this project. Their insights were essential in helping us reach the goal of efficient integration with the RAPIDS library. In addition, we would like to thank Johan Barthelemy for running benchmarks in extra graphic cards.
Convert to numpy
Hello members,
So my question is I have a variable which of type ops.Tensor. I need to convert this variable to a numpy variable. I tried different solutions online but in those cases the variable needs to be ops.EagerTensor to convert the variable into numpy object(. numpy (), tf.make_ndarray etc). Soo how can I convert my tensor object to eager tensor object or directly to numpy??
submitted by /u/Anonymous_Guy_12
[visit reddit] [comments]
Create a compact desktop cluster with four NVIDIA Jetson Xavier NX modules to accelerate training and inference of AI and deep learning workflows.
Following in the footsteps of large-scale supercomputers like the NVIDIA DGX SuperPOD, this post guides you through the process of creating a small-scale cluster that fits on your desk. Below is the recommended hardware and software to complete this project. This small-scale cluster can be utilized to accelerate training and inference of artificial intelligence (AI) and deep learning (DL) workflows, including the use of containerized environments from sources such as the NVIDIA NGC Catalog.
Hardware:
- 4x NVIDIA Jetson Xavier NX Dev Kits
- 4x MicroSD Cards (128GB+)
- 1x SD+microSD Card Reader
- 1x (Optional) Seeed Studio Jetson Mate Cluster Mini
- 1x (Optional) USB-C PD Power Supply (90w+)
- 1x (Optional) USB-C PD 100w Power Cable
While the Seeed Studio Jetson Mate, USB-C PD power supply, and USB-C cable are not required, they were used in this post and are highly recommended for a neat and compact desktop cluster solution.
Software:
For more information, see the NVIDIA Jetson Xavier NX development kit.
Installation
Write the JetPack image to a microSD card and perform initial JetPack configuration steps:
The first iteration through this post is targeted toward the Slurm control node (slurm-control
). After you have the first node configured, you can either choose to repeat each step for each module, or you can clone this first microSD card for the other modules; more detail on this later.
For more information about the flashing and initial setup of JetPack, see Getting Started with Jetson Xavier NX Developer Kit.
While following the getting started guide above:
- Skip the wireless network setup portion as a wired connection will be used.
- When selecting a username and password, choose what you like and keep it consistent across all nodes.
- Set the computer’s name to be the target node you’re currently working with, the first being
slurm-control
. - When prompted to select a value for Nvpmodel Mode, choose
MODE_20W_6CORE
for maximum performance.
After flashing and completing the getting started guide, run the following commands:
echo "`id -nu` ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/`id -nu` sudo systemctl mask snapd.service apt-daily.service apt-daily-upgrade.service sudo systemctl mask apt-daily.timer apt-daily-upgrade.timer sudo apt update sudo apt upgrade -y sudo apt autoremove -y
Disable NetworkManager, enable systemd-networkd, and configure network [DHCP]:
sudo systemctl disable NetworkManager.service NetworkManager-wait-online.service NetworkManager-dispatcher.service network-manager.service sudo systemctl mask avahi-daemon sudo systemctl enable systemd-networkd sudo ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf cat /dev/null [Match] Name=eth0 [Network] DHCP=ipv4 MulticastDNS=yes [DHCP] UseHostname=false UseDomains=false EOF sudo sed -i "/#MulticastDNS=/cMulticastDNS=yes" /etc/systemd/resolved.conf sudo sed -i "/#Domains=/cDomains=local" /etc/systemd/resolved.conf
Configure the node hostname:
If you have already set the hostname in the initial JetPack setup, this step can be skipped.
[slurm-control]
sudo hostnamectl set-hostname slurm-control sudo sed -i "s/127.0.1.1.*/127.0.1.1t`hostname`/" /etc/hosts
[compute-node]
Compute nodes should follow a particular naming convention to be easily addressable by Slurm. Use a consistent identifier followed by a sequentially incrementing number (for example, node1, node2, and so on). In this post, I suggest using nx1
, nx2
, and nx3
for the compute nodes. However, you can choose anything that follows a similar convention.
sudo hostnamectl set-hostname nx[1-3] sudo sed -i "s/127.0.1.1.*/127.0.1.1t`hostname`/" /etc/hosts
Create users and groups for Munge and Slurm:
sudo groupadd -g 1001 munge sudo useradd -m -c "MUNGE" -d /var/lib/munge -u 1001 -g munge -s /sbin/nologin munge sudo groupadd -g 1002 slurm sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u 1002 -g slurm -s /bin/bash slurm
Install Munge:
sudo apt install libssl-dev -y git clone https://github.com/dun/munge cd munge ./bootstrap ./configure sudo make install -j6 sudo ldconfig sudo mkdir -m0755 -pv /usr/local/var/run/munge sudo chown -R munge: /usr/local/etc/munge /usr/local/var/run/munge /usr/local/var/log/munge
Create or copy the Munge encryption keys:
[slurm-control]
sudo -u munge mungekey --verbose
[compute-node]
sudo sftp -s 'sudo /usr/lib/openssh/sftp-server' `id -nu`@slurm-controlStart Munge and test the local installation:
sudo systemctl enable munge sudo systemctl start munge munge -n | unmungeExpected result:
STATUS: Success (0)
Verify that the Munge encryption keys match from a compute node to slurm-control:
[compute-node]
munge -n | ssh slurm-control unmungeExpected result:
STATUS: Success (0)
Install Slurm (20.11.9):
cd ~ wget https://download.schedmd.com/slurm/slurm-20.11-latest.tar.bz2 tar -xjvf slurm-20.11-latest.tar.bz2 cd slurm-20.11.9 ./configure --prefix=/usr/local sudo make install -j6Index the Slurm shared objects and copy the systemd service files:
sudo ldconfig -n /usr/local/lib/slurm sudo cp etc/*.service /lib/systemd/systemCreate directories for Slurm and apply permissions:
sudo mkdir -pv /usr/local/var/{log,run,spool} /usr/local/var/spool/{slurmctld,slurmd} sudo chown slurm:root /usr/local/var/spool/slurm* sudo chmod 0744 /usr/local/var/spool/slurm*Create a Slurm configuration file for all nodes:
For this step, you can follow the included commands and use the following configuration file for the cluster (recommended). To customize variables related to Slurm, use the configuration tool.
cat /dev/null #slurm.conf for all nodes# ClusterName=SlurmNX SlurmctldHost=slurm-control MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=2 SlurmctldPidFile=/usr/local/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/usr/local/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/usr/local/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/usr/local/var/spool/slurmctld SwitchType=switch/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory JobCompType=jobcomp/none SlurmctldDebug=info SlurmctldLogFile=/usr/local/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/usr/local/var/log/slurmd.log NodeName=nx[1-3] RealMemory=7000 Sockets=1 CoresPerSocket=6 ThreadsPerCore=1 State=UNKNOWN PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP EOFsudo chmod 0744 /usr/local/etc/slurm.conf sudo chown slurm: /usr/local/etc/slurm.confInstall Enroot 3.3.1:
cd ~ sudo apt install curl jq parallel zstd -y arch=$(dpkg --print-architecture)curl -fSsL -O https://github.com/NVIDIA/enroot/releases/download/v3.3.1/enroot_3.3.1-1_${arch}.deb sudo dpkg -i enroot_3.3.1-1_${arch}.debInstall Pyxis (0.13):
git clone https://github.com/NVIDIA/pyxis cd pyxis sudo make install -j6Create the Pyxis plug-in directory and config file:
sudo mkdir /usr/local/etc/plugstack.conf.d echo "include /usr/local/etc/plugstack.conf.d/*" | sudo tee /usr/local/etc/plugstack.conf > /dev/nullLink the Pyxis default config file to the plug-in directory:
sudo ln -s /usr/local/share/pyxis/pyxis.conf /usr/local/etc/plugstack.conf.d/pyxis.confVerify Enroot/Pyxis installation success:
srun --help | grep container-imageExpected result:
--container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
Finalization
When replicating the configuration across the remaining nodes, label the JetsonNX modules with the assigned node name and/or the microSD cards. This helps prevent confusion later on when moving modules or cards around.
There are two different methods in which you can replicate your installation to the remaining modules: manual configuration or cloning slurm-control. Read over both methods and choose which method you prefer.
Manually configure the remaining nodes
Follow the “Enable and start the Slurm service daemon” section below for your current module, then repeat the entire process for the remaining modules, skipping any steps tagged under [slurm-control]. When all modules are fully configured, install them into the Jetson Mate in their respective slots, as outlined in the “Install all Jetson Xavier NX modules into the enclosure” section.
Clone slurm-control installation for remaining nodes
To avoid repeating all installation steps for each node, clone the
slurm-control
node’s card as a base image and flash it onto all remaining cards. This requires a microSD-to-SD card adapter if you have only one multi-port card reader and want to do card-to-card cloning. Alternatively, creating an image file from the sourceslurm-control
card onto the local machine and then flashing target cards is also an option.
- Shut down the Jetson that you’ve been working with, remove the microSD card from the module, and insert it into the card reader.
- If you’re performing a physical card to card clone (using Balena Etcher, dd, or any other utility that will do sector by sector writes), insert the blank target microSD into the SD card adapter, then insert it into the card reader.
- Identify which card is which for the source (microSD) and destination (SD card) in the application that you’re using and start the cloning process.
- If you are creating an image file, using a utility of your choice, create an image file from the
slurm-control
microSD card on the local machine, then remove that card and flash the remaining blank cards using that image.- After cloning is completed, insert a cloned card into a Jetson module and power on. Configure the node hostname for a compute node, then proceed to enable and start the Slurm service daemon. Repeat this process for all remaining card/module pairs.
Enable and start the Slurm service daemon:
[slurm-control]
sudo systemctl enable slurmctld sudo systemctl start slurmctld[compute-node]
sudo systemctl enable slurmd sudo systemctl start slurmdInstall all Jetson Xavier NX modules into the enclosure
First power down any running modules, then remove them from their carriers. Install all Jetson modules into the Seeed Studio Jetson Mate, ensuring that the control node is placed in the primary slot labeled “MASTER”, and compute nodes 1-3 are placed in secondary slots labeled “WORKE 1, 2, and 3” respectively. Optional fan extension cables are available from the Jetson Mate kit for each module.
The video output on the enclosure is connected to the primary module slot, as is the vertical USB2 port, and USB3 port 1. All other USB ports are wired to the other modules according to their respective port numbers.
Troubleshooting
This section contains some helpful commands to assist in troubleshooting common networking and Slurm-related issues.
Test network configuration and connectivity
The following command should show
eth0
in theroutable
state, with IP address information obtained from the DHCP server:networkctl statusThe command should respond with the local node’s hostname and .local as the domain (for example,
slurm-control.local
), along with DHCP assigned IP addresses:host `hostname`Choose a compute node hostname that is configured and online. It should respond similarly to the previous command. For example: host nx1 – nx1.local has address 192.168.0.1. This should also work for any other host that has an mDNS resolver daemon running on your LAN.
host [compute-node-hostname]All cluster nodes should be pingable by all other nodes, and all local LAN IP addresses should be pingable as well, such as your router.
ping [compute-node-hostname/local-network-host/ip]Test the external DNS name resolution and confirm that routing to the internet is functional:
ping www.nvidia.comCheck Slurm cluster status and node communication
The following command shows the current status of the cluster, including node states:
sinfo -lNeIf any nodes in the
sinfo
output show UNKNOWN or DOWN for their state, the following command signals to the specified nodes to change their state and become available for job scheduling ([ ] specifies a range of numbers following the hostname ‘nx’):scontrol update NodeName=hostname[1-3] State=RESUMEThe following command runs
hostname
on all available compute nodes. Nodes should respond back with their corresponding hostname in your console.srun -N3 hostnameSummary
You’ve now successfully built a multi-node Slurm cluster that fits on your desk. There’s a vast amount of benchmarks, projects, workloads, and containers that you can now run on your mini-cluster. Feel free to share your feedback on this post and, of course, anything that your new cluster is being used for.
Power on and enjoy Slurm!
For more information, see the following resources:
Acknowledgments
Special thanks to Robert Sohigian, a technical marketing engineer on our team, for all the guidance in creating this post, providing feedback on the clarity of instructions, and for being the lab rat in multiple runs of building this cluster. Your feedback was invaluable and made this post what it is!
Leading security, storage, and networking vendors are joining the DOCA and DPU community.
The DPU, or data processing unit, is a new class of programmable processors that specializes in moving data around the data center and now joins CPUs and GPUs as the third pillar of modern computing. NVIDIA DOCA is core to the NVIDIA Bluefield DPU offering because it provides ecosystem partners with an open platform to deliver the advanced networking, storage, and security services needed today.
DOCA unlocks data center innovation by enabling an open ecosystem and developer community to rapidly create applications and services on top of Bluefield DPUs, using industry-standard open APIs and frameworks.
Integral to our customers’ success, and our own, is the collaboration with our ecosystem partners. For more than 15 years, our ecosystem partners have harnessed the power of CUDA to develop the world’s most effective accelerated applications for a multitude of use cases.
The NVIDIA CUDA Toolkit provides everything that is needed to develop GPU-accelerated applications. Similarly, the NVIDIA DOCA Software Framework is an open SDK that enables you to rapidly create applications and services on top of Bluefield DPUs.
Where partners have achieved such success with NVIDIA GPUs and CUDA, we are emulating that formula with our DPU portfolio and DOCA. Moreover, we recognize that to deliver best-in-class solutions for customers, we need to partner with the world’s leading technology vendors. Proprietary applications have their place, but who better to provide world-class security, storage, and networking solutions, than the world’s leading vendors in those fields?
A meeting of the minds
During the last two years, our ecosystem partners have been delivering innovative solutions and services essential for digital transformation. The most turbulent period in recent history has forced us all to find new ways to collaborate and embrace technology at a rate never expected. Not only have we had to adapt as individuals, but organizations across the globe have been forced to re-think their day-to-day activities.
We work closely with our partners to define and create more DOCA libraries and services to address innovative use cases. More than ever, we’re witnessing a realignment between technology requirements in the data center and ever-changing business priorities. In turn, matching customers to ecosystem partners provides an opportunity to create customized technology solutions tuned to meet specific business objectives.
Today, NVIDIA is working with leading platform vendors and partners to integrate and expand DOCA support for commercial distributions on BlueField DPUs. Dozens of industry leaders, including VMWare, Red Hat, DDN, Aria Cybersecurity, and Juniper Networks, have started to integrate their solutions using the DPU/DOCA architecture. You’ll start to see more new applications in the coming year.
Earlier this year, Palo Alto Networks, a global cybersecurity leader developed the first next-generation firewall (NGFW) specifically designed to be accelerated by the BlueField DPU. This first-to-market, hardware-accelerated software NGFW is a prime example of how the BlueField DPU boosts performance and optimizes data center security coverage and efficiency.
Third-party developers can create and distribute DPU-accelerated applications with the DOCA SDK, which is fully integrated into the NGC catalog of containerized software. Such accelerated solutions will be wide-ranging, including advanced applications for infrastructure, storage, and security. It will be the key to unlocking data center innovation.
Try DOCA today
NVIDIA DOCA is the key to unlocking the potential of the NVIDIA BlueField DPU to offload, accelerate, and isolate data center workloads. With DOCA, you can program the data center infrastructure of tomorrow by creating software-defined, cloud-native, DPU-accelerated services with zero-trust protection to address the increasing performance and security demands of modern data centers.
To start developing on DOCA:
- Apply for the DOCA Early Access program.
- Take the free Introduction to DOCA for DPUs course.
- Find us on GitLab.
Join us on July 20 for a webinar highlighting how using NVIDIA A100 GPUs can help map and location-based service providers speed up map creation and workflows, while reducing costs.
This post covers best practices for Vulkan clearing and presentation on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.
This post covers best practices for Vulkan clearing and presenting on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.
With the recent Vulkan 1.3 release, it’s timely to add some Vulkan-specific tips that are not necessarily explicitly covered by the other Advanced API Performance posts. In addition to introducing new Vulkan 1.3 core features, this post shares a set of good practices for clearing and presenting surfaces.
Vulkan 1.3 Core
Vulkan 1.3 brings improvements through extensions to key parts in the API. This section summarizes our recommendations for obtaining the best performance when working with a number of these new features.
Recommended
- Skip framebuffer and render pass object setup by taking advantage of dynamic rendering.
- Reduce the number of pipeline state objects with core support for dynamic states.
- Simplify synchronization and avoid unnecessary image layout transitions by using the improved synchronization API
Clears
This section provides a guideline for achieving performance when invoking clear commands. This type of command clears a region within a color image or within the bound framebuffer attachments.
- Use
VK_ATTACHMENT_LOAD_OP_CLEAR
to clear attachments at the beginning of a subpass instead of clear commands. This can allow the driver to skip loading unnecessary data. - Outside of a render pass instance, prefer the usage of
vkCmdClearColorImage
instead of a CS invocation to clear images. This path enables bandwidth optimizations. - If possible, batch clears to avoid interleaving single clears between dispatches.
- Coordinate
VkClearDepthStencilValue
with the test function to achieve better depth testing performance:- 0.5 ≤ depth value VK_COMPARE_OP_LESS_OR_EQUAL
- 0.0 ≤ depth value VK_COMPARE_OP_GREATER_OR_EQUAL
Not recommended
- Specifying more than 30 unique clear values per application (or more than 15 on Turing) does not make the most of clear bandwidth optimizations.
- “Clear shaders” should be avoided unless there is overlap of a compute clear with a neighboring dispatch.
Present
The following section offers insight into the preferred way of using the presentation modes supported by a surface in order to achieve good performance.
Recommended
- Rely on
VK_PRESENT_MODE_FIFO_KHR
orVK_PRESENT_MODE_MAILBOX_KHR
(forVSync
on). Noteworthy aspects:VK_PRESENT_MODE_FIFO_KHR
is preferred as it does not drop frames and lacks tearing.
VK_PRESENT_MODE_MAILBOX_KHR
may offer lower latency, but frames might be dropped.
VK_PRESENT_MODE_FIFO_RELAXED_KHR
is compelling when your application only occasionally lags behind the refresh rate, allowing tearing so that it can “catch back up”.
- Rely on
VK_PRESENT_MODE_IMMEDIATE_KHR
forVSync
off. - On Windows systems, use the
VK_EXT_full_screen_exclusive
extension to bypass compositing. - Handle both out-of-date and suboptimal swapchains to re-create stale swapchains when windows resize, for example.
- For latency-sensitive applications, use the Vulkan Reflex SDK to minimize latency by completing game engine work just-in-time for rendering.
More information
For more information about using Vulkan with NVIDIA GPUs, see Vulkan Do’s and Don’ts.
To view the Vulkan API state, use the API Inspector in Nsight Graphics. (free download)
With Nsight Systems, you can view Vulkan usage on a unified CPU-GPU timeline, investigate stutter, and track GPU cold spots to their CPU origins. Download Nsight Systems for free.
Acknowledgments
Thanks to Piers Daniell, Ivan Fedorov, Adam Moss, Ryan Prescott, Joshua Schnarr, Juha Sjöholm, and Márton Tamás for their feedback and contributions.
NVIDIA is helping push the limits of training AI generalist agents with a new open-sourced framework called MineDojo.
Using video games as a medium for training AI has become a popular method within the AI research community. These autonomous agents have had great success in Atari games, Starcraft, Dota, and Go. But while these advancements have been popular for AI research, the agents do not generalize beyond a very specific set of tasks, unlike humans that continuously learn from open-ended tasks.
Building an embodied agent that can attain high-level performance across a wide spectrum of tasks has been one of the greatest challenges facing the AI research community. In order to build a successful generalist agent, users need an environment that supports a multitude of tasks and goals, a large-scale database of multimodal knowledge, and a flexible and scalable agent architecture.
Enter Minecraft, the most played game in the world. With its flexible gameplay players can do a wide variety of actions. This ranges from building a medieval castle to exploring dangerous environments to gathering resources for building a Nether Portal to battle the Nether Dragon. This creative atmosphere is the perfect environment for an embodied agent to train.
To take advantage of such an optimal training ground, NVIDIA researchers created MineDojo. MineDojo has built a massive framework that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base. Building an AI powerful enough to complete these tasks would not be possible without an expansive data library.
The mission of MineDojo is to promote research towards the goal of generally capable embodied agents. In order for the embodied agent to be successful, the environment needs to provide an almost infinite number of open-ended tasks and actions. This is done by giving the agent access to a large database of information to pull knowledge and then apply learnings. The training gained from the embodied agent needs to be scalable to convert the large-scale knowledge into actionable insights later on.
In MineDojo, the embodied agent has access to three internet-scale datasets. With 750,000 Minecraft YouTube videos—amounting to over 33 years of Minecraft videos—pulled into the database, over 2 million words were transcribed.
MineDojo also scraped over 6,000 web pages from the Minecraft Wiki, with over 2.2 million bounding boxes created for the visual elements of those pages). Also, millions of Reddit threads related to Minecraft and the variety of activities one can do within the game were captured. The questions included how to solve certain tasks and showcase achievements and creations in image and video formats, along with general tips and tricks.
MineDojo offers a set of simulator APIs that users can use to train their AI agents. It provides unified observation and action spaces to help facilitate the agent to adapt to new scenarios and multitask. Additionally, using the APIs users can take advantage of all three worlds within the Minecraft universe to expand on the number of tasks and actions the agent can do.
Within the simulator, MIneDojo splits the benchmarking tasks into two categories: programmatic tasks and creative tasks.
Programmatic tasks are well defined and can be easily evaluated, such as “surviving 3 days” or “obtain one unit of pumpkin in the forest.”
Creative tasks are much more open-ended, such as “build a beautiful beach house.” It is very difficult to define what qualifies as a beach house by an explicit set of rules. these tasks are to encourage the research community to develop more human-like and imaginative AI agents.
Natural language is a cornerstone of the MineDojo framework. It aids open-vocabulary understanding, provides grounding for image and video modalities, and serves as an intuitive interface to specify instructions. Combined with the latest speech recognition technology, it is possible in the near future to talk to an AI Agent as you would to a friend in multiplayer co-op mode.
For example: “plant a row of blue flowers in front of our house. Add some gold decorations to the door frame. Let’s go explore the cave next to the river,” could all be possible.
Proof of concept using MineCLIP
To help promote the project and provide a proof of concept, the MineDojo researchers have implemented a single language-prompted agent to complete several complex tasks within Minecraft, called MineCLIP. This novel agent learning algorithm takes advantage of the 33 years worth of Minecraft YouTube videos. However, it is good to point out that any agent can use any or all three sections of the Internet-scale database at the user’s discretion.
MineCLIP as an embodied agent learns from the YouTube videos the concepts and actions of Minecraft without human hand labeling. YouTubers typically narrate what they are doing as they stream the gameplay video. MineCLIP is a large Transformer model that learns to associate a video clip and its corresponding English transcripts.
This association score can be provided as a reward signal to guide a reinforcement learning agent towards completing the task. For the example task, “shear a sheep to obtain wool,” MineCLIP gives a high reward to the agent if it approaches the sheep, but a low reward if the agent wanders aimlessly. It is even capable of multitasking within the game to complete a wide range of simple tasks.
Building generally capable embodied agents is a holy grail goal of AI research. MineDojo provides a benchmark of 1000s of tasks, an internet-scale rich knowledge base, and an innovative algorithm as a first step towards solving the grand challenge.
Stay posted to see what new models and techniques the research community comes up with next! Start using MineDojo today.
Read more about the framework and its findings. Explore other NVIDIA research.
Resources for hands on TF
Hi all,
Hopefully this is no against any group rules, but I’m a DS master degree student coming from a CS bachelor, and I really love DeepLearning and all the magics that we can do solving optimization problems, even without NN involved.
I have a good preparation from the theoretical POV thanks to the university, and i’ve coded manually many optimization problem calculating gradient by hand, however I love the idea of autodiff that TF and PyTorch gives out of the box, and I’m really looking forward to learn TF from the ground up, however I really struggle to find material that does not lead in just stacking layers on a sequential model from Keras…
My aim is to be able to take an idea of (example) a layer, and code it using tensors and autodiff from TF, and not looking for online code that already solves that (or even maybe optimizers, since I’m pretty familiar to many other not already implemented in TF)
Do you have any online resource or book that you feel that is a good starting point? I usually learn hand on and reading Docs, however I feel like TF is better to learn it how it’s supposed to, to fully grasp everything that it can offers
In other words, I have a good theoretical preparation on ML/DL but I feel I’m lacking in a more practical aspect… so… how/where can I learn to use GradientTape and of those magic things (everything is accepted, online offline, paper digital, paid not paid)?
submitted by /u/bertosini
[visit reddit] [comments]