Statistics and machine learning with Weka
One for All
Everyone has probably heard of machine learning, but how exactly does it work? Does it mean that an intelligent machine makes decisions on behalf of humans? In a way, yes, but strictly speaking, no. You might want to replace the term "intelligent machine" with "efficient algorithm" and add that this algorithm works with data. In doing so, it delivers a view that captures the essence of the data. Simply put, machine learning focuses on building models that learn from existing data and then uses those models to make logical decisions without requiring human intervention. The methods used to learn these models are the algorithms.
A variety of algorithms exist, but no one of them is suitable for every case and everywhere. An algorithm that performs well on one data collection can fail on another, which is why researchers apply different algorithms to a given set of data to see which algorithms work. If you had to program all these processes yourself, it would certainly be too difficult a task. That said, it is also tricky to find a platform that provides ready-made algorithms. Weka not only offers researchers a large number of ready-made machine learning algorithms, it also has features such as visualization and preprocessing.
Weka Basics
The open source Weka is licensed under the GNU General Public License (GPL) and was created at the University of Waikato in Hamilton, New Zealand. Interestingly, Weka is not an abbreviation or engineering jargon. It is the name of a flightless bird that lives on the islands of New Zealand.
Written in Java, Weka runs on any operating system and hardware platform and is under constant development. Each update includes new features and ditches less popular ones. Version 3.8.6, released February 21, 2022, was current at the time this article was written. Version 3.9 was also available, but it might have contained some bugs, so I avoided it at the time. The uniqueness of the tool lies in the availability of a variety of different methods that cover almost all aspects of machine learning through a common interface.
The Weka wiki [1] holds a wealth of information and is where you will find installation instructions for Linux, macOS, and Windows. After the install, Weka can be used either directly in the Weka graphical user interface (GUI) or by application programming interface (API) calls in Java code. In this article, I only look at the GUI.
The first window you encounter is the Weka GUI Chooser (Figure 1), which you use to access one of five views:
- The Explorer tab lets you load the data you want to use and provides options for pre-processing and initial data analysis (Figure 2).
- Under Experimenter , you set up and run targeted experiments, analyze the results, and use it to compare the suitability of different methods for given data collections (Figure 3).
- KnowledgeFlow provides a graphical representation of the flow of information and a process-oriented view of the methods used. You can use this tool to design a machine learning pipeline from data input through results output, which you then run and evaluate in Weka.
- Workbench combines all the views already listed in one window (Figure 4).
- SimpleCLI is an alternative to the GUI that allows you to enter commands to control Weka.
Weka Explorer, the most widely used interface, provides tabs for a range of functions, including Preprocess (data preprocessing), Classify (classification algorithms), Cluster (clustering algorithms), Associate (association rules), Select attributes (feature selection), and Visualize (data visualization).
In the Preprocess tab you can load the data from a file, a database, or a link. Files are usually tabular in structure, consisting of columns and rows. Each row represents an object (known as an instance in Weka), and each column represents a property of the object under investigation (an attribute).
Weka supports real (floating-point values), integer, string, and nominal (i.e., yes and no) data types. Moreover, Weka supports a special extended CSV format known as the attribute-relation file format (ARFF). ARFF files have a header with information about the names and data types of the attributes.
After installing the Weka package, you will find example datasets in the /usr/share/doc/weka/examples/
subfolder in the form of ARFF files that you can use to experiment with the tool. Listing 1 shows an excerpt from diabetes.arff
. To load this file in Explorer, press the Open file
button.
Listing 1
diabetes.arff
@relation pima_diabetes @attribute ,preg' numeric @attribute ,plas' numeric @attribute ,pres' numeric @attribute ,skin' numeric @attribute ,insu' numeric @attribute ,mass' numeric @attribute ,pedi' numeric @attribute ,age' numeric @attribute ,class' { tested_negative, tested_positive } @data 6,148,72,35,0,33.6,0.627,50,tested_positive 1,85,66,29,0,26.6,0.351,31,tested_negative 8,183,64,0,0,23.3,0.672,32,tested_positive 1,89,66,23,94,28.1,0.167,21,tested_negative [...]
You could load a CSV file in the same way after selecting that data type in the file dialog. However, this operation often results in problems that lead to error messages such as Data values neither numeric nor nominal , which is why ARFF is the preferred file format for Weka.
CSV files can be converted to ARFF format by adding an appropriate header. To do this, first name the relation that the data file reflects (in this example, @relation pima_diabetes
). Then, add information about all fields and their respective data types (e.g., @attribute ,age' numeric
). Finally, add the class
attribute – at least in this case. For applications that classify something, the attribute must be of the nominal
type. To calculate a regression, you need the real
data type. The data starts after the @data
directive as comma-separated attribute values in rows.
After loading the file, Explorer displays some information about the data. In Figure 5 you can see the number of instances and the names of the attributes on the left side. On the right side, Weka provides statistical key figures for the data (e.g., minimum, maximum, mean, etc.). Additionally, the tool shows a histogram for the attribute selected on the left.
Preprocessing
Before applying machine learning algorithms, you first need to clean up the data. The first step is to remove attributes that are not needed to analyze the problem. For example, the person's name would add nothing to the diabetes diagnosis. It is best to remove such attributes to reduce the load for the algorithm and possibly improve performance at the same time. To do this, check the attribute boxes in question and click Remove .
Weka provides a large number of filters that you can apply to the dataset. For example, if you are only interested in certain values of an attribute, you can hide all others with a filter. Another example would be the desire to normalize certain variables such as age or income. In any case, this step requires an understanding of both the data and the algorithm to decide which filters to consider.
You can select filters by clicking the Choose button below the Filter label. Two types of filters are available: supervised and unsupervised filters. Supervised filters use class values and are very rare compared with unsupervised filters. A distinction is also made between class and instance filters. The Allfilter or Multifilter options let you combine filters.
Figure 6 shows how a normalization filter was applied to the preg
attribute of the sample dataset. The text box to the right of the Choose
button contains the parameters for the filter. By the way, the Explorer also has an Undo
button, which can be used to undo any change, and a Save
button to save the current values to a file.
Machine Learning
After preprocessing, you can move on to an arbitrary machine learning task, be it classification, regression calculation, clustering, or association rule discovery. The Classify tab provides an interface to many very well known classification algorithms, including decision trees, support vector machines, naive Bayes, multilayer perceptrons, and logistic regressions, to name a few. Meta-learning with bagging, boosting, and stacking is also possible with Weka. Cross-validation and the hold-out method are available to evaluate the classifier. The parameters of all of the algorithms can be adjusted. Metrics such as accuracy, precision, or recognition value can be used to evaluate the predictive performance of a classifier.
The following example is intended to show how a decision tree classifier can be applied to the sample data set diabetes.arff
. The J48 algorithm is used, which is the open source version of C4.5 programmed in Java; it generates a decision tree classifier. In Explorer, first select Classify
from the top menu bar and then press the Choose
button to select J48 from the list of classification algorithms. The algorithm and some default parameters will appear in the text box to the right (Figure 7). You can edit the parameters at any time, provided you have the necessary understanding of the algorithm's function and the meaning of its parameters.
The classifier can use the data in several ways. Either it treats the entire data set as a training set and generates a model that can be saved and reloaded later to predict new test data, or you can use the data in a cross-validation or hold-out method and split the data into a training set and a test set on a percentage basis. This method is usually employed to evaluate the performance of the classifier for the given data set.
If you click on More options , you can select the evaluation metrics to use, along with some other options. However, this step, which you can also skip at the beginning, is optional. Next, press the Start button to run the desired task with the selected classifier. The right part of the window displays the steps taken by the classifier and the results with the values of the selected metrics. You now have the option to run the same classifier over and over again with the same or different parameters or options.
If the performance of one classifier does not satisfy you, simply switch to another. In this way, you can try out different algorithms with the same tool in a graphical interface to find the one best suited to your use case. If you want to focus on a single classifier, you can automatically tune the parameters by cross-validation to determine the optimal values. Following exactly the same approach, you can also use cluster algorithms like SimpleKMeans, FarthestFirst, or HierachicalClusterer in the corresponding tab. Association rule mining can be performed from the Associate tab, and the Apriori algorithm helps mine frequent patterns.
Finally, the Visualize tab helps with data visualization. For example, scatter plots of all attribute combinations can be output automatically (Figure 8). The different colors represent different class memberships, which is very helpful in discovering relationships between attributes and, in turn, assists in filtering and analysis.
Buy this article as PDF
(incl. VAT)