Categories
Misc

Speed Up Machine Learning Models with Accelerated WEKA

Accelerated WEKA integrates the WEKA workbench with Python and Java libraries that support GPU to speedup the training and prediction time of machine learning models.

In recent years, there has been a surge in building and adopting machine learning (ML) tools. The use of GPUs to accelerate increasingly compute-intensive models has been a prominent trend.

To increase user access, the Accelerated WEKA project provides an accessible entry point for using GPUs in well-known WEKA algorithms by integrating open-source RAPIDS libraries.

In this post, you will be introduced to Accelerated WEKA and learn how to leverage GPU-accelerated algorithms with a graphical user interface (GUI) using WEKA software. This Java open-source alternative is suitable for beginners looking for a variety of ML algorithms from different environments or packages.

What is Accelerated WEKA?

Accelerated WEKA unifies the WEKA software, a well-known and open-source Java software, with new technologies that leverage the GPU to shorten the execution time of ML algorithms. It has two benefits aimed at users without expertise in system configuration and coding: an easy installation and a GUI that guides the configuration and execution of the ML tasks.

Accelerated WEKA is a collection of packages available for WEKA, and it can be extended to support new tools and algorithms.

What is RAPIDS?

RAPIDS  is a collection of open-source Python libraries for users to develop and deploy data science workloads on NVIDIA GPUs.  Popular libraries include cuDF for GPU-accelerated DataFrame processing and cuML for GPU-accelerated machine learning algorithms.  RAPIDS APIs conform as much as possible to the CPU counterparts, such as pandas and scikit-learn.

Accelerated WEKA architecture

The building blocks of Accelerated WEKA are packages like WekaDeeplearning4j and wekaRAPIDS (inspired by wekaPython). WekaDeeplearning4j (WDL4J) already supports GPU processing but has very specific needs in terms of libraries and environment configuration. WDL4J provides WEKA wrappers for the Deeplearning4j library.

For Python users wekaPython initially provided Python integration by creating a server and communicating with it through sockets. With this the user can execute scikit-learn ML algorithms (or even XGBoost) inside the WEKA workbench. Furthermore, wekaRAPIDS provides integration with RAPIDS cuML library by using the same technique in wekaPython.

Together, both packages provide enhanced functionality and performance inside the user-friendly WEKA workbench. Accelerated WEKA goes a step further in the direction of performance by improving the communication between the JVM and Python interpreter. It does so by using alternatives like Apache Arrow and GPU memory sharing for efficient data transfer between the two languages.

Accelerated WEKA also provides integration with the RAPIDS cuML library, which implements machine learning algorithms that are accelerated on NVIDIA GPUs. Some cuML algorithms can even support multi-GPU solutions.

Supported algorithms

The algorithms currently supported by Accelerated WEKA are:

  • LinearRegression
  • LogisticRegression
  • Ridge
  • Lasso
  • ElasticNet
  • MBSGDClassifier
  • MBSGDRegressor
  • MultinomialNB
  • BernoulliNB
  • GaussianNB
  • RandomForestClassifier
  • RandomForestRegressor
  • SVC
  • SVR
  • LinearSVC
  • KNeighborsRegressor
  • KNeighborsClassifier

The algorithms supported by Accelerated WEKA in multi-GPU mode are:

  • KNeighborsRegressor
  • KNeighborsClassifier
  • LinearRegression
  • Ridge
  • Lasso
  • ElasticNet
  • MultinomialNB
  • CD

Using Accelerated WEKA GUI 

During the Accelerated WEKA design stage, one main goal was for it to be easy to use. The following steps outline how to set it up on a system along with a brief example. 

Please refer to the documentation for more information, and a comprehensive getting started. The only prerequisite for Accelerated WEKA is having Conda installed in your system.

  • The installation of Accelerated WEKA is available through Conda, a system providing package and environment management. Such capability means that a simple command can install all dependencies for the project. For example, on a Linux machine, issue the following command in a terminal for installing Accelerated WEKA and all dependencies.

conda create -n accelweka -c rapidsai -c nvidia -c conda-forge  -c waikato weka

  • After Conda has created the environment, activate it with the following command:

conda activate accelweka

  • This terminal instance just loaded all dependencies for Accelerated WEKA. Launch WEKA GUI Chooser with the command:

weka

  • Figure 1 shows the WEKA GUI Chooser window. From there, click the Explorer button to access the functionalities of Accelerated WEKA.
 Screenshot of Weka GUI Chooser upon starting WEKA application.
Figure 1. WEKA GUI Chooser window. This is the first window that appears when you start WEKA
  • In the WEKA Explorer window (Figure 2), click the Open file button to select a dataset file. WEKA works with ARFF files but can read from CSVs. Converting from CSVs can be pretty straightforward or require some configuration by the user, depending on the types of the attributes.
Screenshot of the WEKA Explorer window showing the Open file button.
Figure 2. In the WEKA Explorer window users can import datasets, check statistics about the attributes, and apply filters to the dataset as preprocessing
  • The WEKA Explorer window with a dataset loaded is shown in Figure 3. Assuming one does not want to preprocess the data, clicking the Classify tab will present the classification options to the user. 
 Screenshot of WEKA Explorer with a dataset loaded in the memory.
Figure 3. WEKA Explorer window with a dataset loaded. After loading the dataset (either from an ARFF file or a CSV file) the attribute names appear on the left. Information regarding the selected attribute appears in the upper right. A chart containing the distribution of the class according to the selected attribute is viewable in the lower right

The Classify tab is presented in Figure 4. Clicking “Choose” button will show the implemented classifiers. Some might be disabled because of the dataset characteristics. To use Accelerated WEKA, the user must select rapids.CuMLClassifier. After that, clicking the bold CuMLClassifier will take the user to the option windows for the classifier.

 Screenshot of the WEKA Classify tab for the user to configure the Classifiers and their options as well as the testing of the built models.
Figure 4. In the WEKA Classify tab, the user can configure the classification algorithm and the test options that are going to be used in the experiment using the previously selected dataset
  • Figure 5 shows the Option window for CuMLClassifier. With the field RAPIDS learner, the user can choose the desired classifier among the ones supported by the package. The field Learner parameters are for the modification of the cuML parameters, details of which can be found in the cuML documentation

The other options are for the user to fine-tune the attribute conversion, configure which python environment is to be used, and determine the number of decimal places the algorithm should operate. For the sake of this tutorial, select Random Forest Classifier and keep everything with the default configuration. Clicking OK will close the window and return to the previous tab.

Screenshot of the classifier configuration window in WEKA showing the options for the CuMLClassifier integration with RandomForestClassifier selected.
Figure 5. With the WEKA Classifier configuration window, the user can configure the parameters of the selected classifier. In this case, it is showing the newly integrated CuMLClassifier options with the RandomForestClassifier learner selected
  • After configuring the Classifier according to the previous step, the parameters will be shown in the text field beside the Choose button. After clicking Start, WEKA will start executing the chosen classifier with the dataset. 

Figure 6 shows the classifier in action. The Classifier output is showing debug and general information regarding the experiment, such as parameters, classifiers, dataset, and test options. The status shows the current state of the execution and the Weka bird on the bottom animates and flips from one side to the other while the experiment is running.

Screenshot of WEKA Classify tab with Random Forest Classifier running. The output shows information about the classifier being used.
Figure 6. WEKA Classify tab with the chosen classification algorithm in progress
  • After the algorithm finishes the task, it will output the summary of the execution with information regarding predictive performance and the time taken. In Figure 7, the output shows the results for 10-fold cross-validation using the RandomForestClassifier from cuML through CuMLClassifier.
 Screenshot of the Classify tab after completing the execution of Random Forest Classifier.
Figure 7. WEKA Classify tab after the experiment has been completed

Benchmarking Accelerated WEKA

We evaluated the performance of Accelerated WEKA, comparing the execution time of the algorithms on the CPU with the execution time using the Accelerated WEKA. The hardware used in the experiments was an i7-6700K, a GTX 1080Ti, and a DGX Station with four A100 GPUs. Unless stated otherwise, the benchmarks use a single GPU.

We used datasets with different characteristics for the benchmarks. Some of them were synthetic for better control of the attributes and instances, like the RDG and RBF generators. The RDG generator builds instances based on decision lists. The default configuration has 10 attributes, 2 classes, a minimum rule size of 1, and a maximum rule size of 10. We changed the minimum and maximum limits to 5 and 20, respectively. With this generator, we created datasets with 1, 2, 5, and 10 million instances, as well as 5 million instances with 20 attributes.

The RBFgenerator creates a random set of centers for each class and then generates instances by getting random offsets from the centers for the attribute values. The number of attributes is indicated with the suffix a__ (for example, a5k means 5 thousand attributes), and the number of instances is indicated by the suffix n__ (for example, n10k means 10 thousand instances). 

Lastly, we used the HIGGS dataset, which contains data about the kinematic properties of the atom accelerator. The first 5 million instances of the HIGGS dataset were used to create HIGGS_5m.

The results for the wekaRAPIDS integration are shown, where we make a direct comparison between the baseline CPU execution with the Accelerated WEKA execution. The results for the WDL4J are shown in Table 5.

XGBoost (CV) i7-6700K GTX 1080Ti Speedup
dataset Baseline (seconds) AWEKA SGM (seconds)
RDG1_1m 266.59 65.77 4.05
RDG1_2m 554.34 122.75 4.52
RDG1_5m 1423.34 294.40 4.83
RDG1_10m 2795.28 596.74 4.68
RDG1_5m_20a 2664.39 403.39 6.60
RBFa5k 17.16 15.75 1.09
RBFa5kn1k 110.14 25.43 4.33
RBFa5kn5k 397.83 49.38 8.06
Table 1. Execution time of experiments with XGBoost using cross-validation comparing the baseline CPU execution time with the Accelerated WEKA execution time while sharing GPU memory on a GTX 1080Ti GPU
XGBoost (no-CV) i7-6700K GTX 1080Ti Speedup A100 Speedup
dataset Baseline (seconds) AWEKA CSV (seconds) AWEKA CSV (seconds)
RDG1_1m 46.40 21.19 2.19 22.69 2.04
RDG1_2m 92.87 34.76 2.67 35.42 2.62
RDG1_5m 229.38 73.49 3.12 65.16 3.52
RDG1_10m 461.83 143.08 3.23 106.00 4.36
RDG1_5m_20a 268.98 73.31 3.67
RBFa5k 5.76 7.73 0.75 8.68 0.66
RBFa5kn1k 23.59 13.38 1.76 19.84 1.19
RBFa5kn5k 78.68 34.61 2.27 29.84 2.64
HIGGS_5m 214.77 169.48 1.27 76.82 2.80
Table 2. Execution time of experiments with XGBoost without using cross-validation. A comparison of the baseline CPU execution time with the Accelerated WEKA execution time while sending a CSV file through sockets on a GTX 1080Ti GPU. Loading times of the dataset were taken out
RandomForest (CV) i7-6700K GTX 1080Ti Speedup
dataset Baseline (seconds) AWEKA SGM (seconds)
RDG1_1m 494.27 97.55 5.07
RDG1_2m 1139.86 200.93 5.67
RDG1_5m 3216.40 511.08 6.29
RDG1_10m 6990.00 1049.13 6.66
RDG1_5m_20a 5375.00 825.89 6.51
RBFa5k 13.09 29.61 0.44
RBFa5kn1k 42.33 49.57 0.85
RBFa5kn5k 189.46 137.16 1.38
Table 3. Execution time of experiments with Random Forest using cross-validation comparing the baseline CPU execution time with the Accelerated WEKA execution time while sharing GPU memory on a GTX 1080Ti GPU
KNN (no-CV) AMD EPYC 7742 (4 cores) NVIDIA A100 Speedup 4X NVIDIA A100 Speedup
dataset Baseline (seconds) wekaRAPIDS (seconds) wekaRAPIDS (seconds)
covertype 3755.80 67.05 56.01 42.42 88.54
RBFa5kn5k 6.58 59.94 0.11 56.21 0.12
RBFa5kn10k 11.54 62.98 0.18 59.82 0.19
RBFa500n10k 2.40 44.43 0.05 39.80 0.06
RBFa500n100k 182.97 65.36 2.80 45.97 3.98
RBFa50n10k 2.31 42.24 0.05 37.33 0.06
RBFa50n100k 177.34 43.37 4.09 37.94 4.67
RBFa50n1m 21021.74 77.33 271.84 46.00 456.99
Table 4. Execution time of experiments with KNN without using cross-validation comparing the baseline CPU execution time with the Accelerated WEKA execution on an NVIDIA A100 GPU
3,230,621 params Neural Network i7-6700K GTX 1080Ti Speedup
Epochs Baseline (seconds) WDL4J (seconds)
50 1235.50 72.56 17.03
100 2775.15 139.86 19.84
250 7224.00 343.14 21.64
500 15375.00 673.48 22.83
Table 5. Execution time of experiments with a 3,230,621 parameter neural network comparing the baseline CPU execution time with the Accelerated WEKA execution on a GTX 1080Ti GPU. The experiments used a small subset of the MNIST dataset while increasing the number of epochs

This benchmarking shows that Accelerated WEKA provides the most benefit to compute-intensive tasks with larger datasets. Small datasets like the RBFa5k and RBFa5kn1k (possessing 100 and 1,000 instances, respectively) present bad speedup, which happens because the dataset is too small to make the overhead of moving things to GPU memory worthwhile. 

Such behavior is noticeable in the A100 (Table 4) experiments, where the architecture is more complex. The benefits of using it start to kick in at the 100,000 instances or bigger datasets. For instance, The RBF datasets with 100,000 instances show ~3 and 4x speedup, which is still lackluster but shows improvement. Bigger datasets like the covertype dataset (~700,000 instances) or the RBFa50n1m dataset (1 million instances) show speedups of 56X and 271X, respectively. Note that for Deep Learning tasks, the Speedup can reach over 20X even with the GTX 1080Ti.

Key takeaways (Tie back to the Call to Action)

Accelerated WEKA will help you supercharge WEKA using RAPIDS. Accelerated WEKA helps with efficient algorithm implementations of RAPIDS and has an easy-to-use GUI. The installation process is simplified using the Conda environment, making it straightforward to use Accelerated WEKA from the beginning.

If you use Accelerated WEKA, please use the hashtag #AcceleratedWEKA on social media. Also, please refer to the documentation for the correct publication to cite Accelerated WEKA in academic work and find out more details about the project. 

Contributing to Accelerated WEKA

WEKA is freely available under the GPL open-source license and so is Accelerated WEKA. In fact, Accelerated WEKA is provided through Conda to automate the installation of the required tools for the environment, and the additions to the source code are published to the main packages for WEKA. Contributions and bug fixes can be contributed as patch files and posted to the WEKA mailing list.

Acknowledgments

We would like to thank Ettikan Karuppiah, Nick Becker, Brian Kloboucher, and Dr. Johan Barthelemy from NVIDIA for the technical support they provided during the execution of this project. Their insights were essential in helping us reach the goal of efficient integration with the RAPIDS library. In addition, we would like to thank Johan Barthelemy for running benchmarks in extra graphic cards.

Leave a Reply

Your email address will not be published. Required fields are marked *