Assignment 3: Interfacing with Weka

There are 3 main ways in which we can interface with Weka:

Through the Graphical User Interface (GUI)
Through the Command Line Interface (CLI)
Through the Application Programmer's Interface (API)

The purpose of this exercise is to gain some familiarity with each of these.

GUI

Start Weka and select Workbench.

Select "Open file...". In your Weka installation folder, find the Data folder and open iris.arff.

Select "Classify", then "Choose" and select an algorithm to run on this dataset. Hit "start" and examine the reuslts.

Try a few different tree algorithms until you get one with good accuracy.

The text area to the right of the "Choose" button shows the algorithm and parameters you are using. Try adjusting some of the parameters to see if you can improve the accuracy of the algorithm.

CLI

Next, we will see how to do this from command prompt. We will use the following command:

java -cp "C:\Program Files\Weka-3-8\weka.jar" ____________ -t "C:\Program Files\Weka-3-8\data\iris.arff"

...except with the algorithm and parameters in place of the ____________ above. You can copy the parameters by right-clicking the algorithm-and-parameters textbox in Weka and selecting "copy configuration to clipboard".

Open a command prompt and execute the the command above with your algorithm and parameters of choice. You should see the same results as before except in the command prompt. Try redirecting the results to a .txt file by appending > results.txt to the end of your command.

API

Finally, we will see how to interface with Weka programmatically using Java.

Make sure you have the Java JDK installed (check both Program Files and Program Files (x86) for a Java folder and another folder inside that starts with the letters jdk).

Make sure the Java compiler runs from command prompt (type in javac on the command prompt. If it says it is not found, follow these instructions to add it to your path, then close and re-open the command prompt).

Save this Java program and the compile, run, and run with redirect scripts into your working folder.

Use the scripts to compile and run. You should get some results from cross-validation.

Try replacing the line of code the instantiates a DecisionStump with your own choice of algorithm, such as:

RandomTree learner = new RandomTree();
HoeffdingTree learner = new HoeffdingTree();
J48 learner = new J48();
RandomForest learner = new RandomForest();

Compile and run. Did it do better than the DecisionStump?

Finally, look at the API documentation for Weka (it is also provided with your installation). Try adjusting some of the parameters of the ML algorithm programmatically.

Once you got this far show me your program. Study the rest of that program to get a basic understanding of how the API works.

A challenge

It is straightforward to take an average of all 10 folds to get an estimate of accuracy, like this:

15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
13.0 correct out of 15.0 = 86.66666666666667 %
14.0 correct out of 15.0 = 93.33333333333333 %
15.0 correct out of 15.0 = 100.0 %
Overall: 144 correct out of 150 = 96.0 %

Although this gives us an estimate of our algorithm's prediction accuracy on this dataset, it's just that: an estimate of accuracy. If we tested it on 10-million instances, we would surely find that it settles on a prediction accuracy a little bigger or smaller than what we see on 150 instances.

To get a sense of how stable our accuracy estimate is, we can repeat the whole cross-validation process many times. That is: repeat the whole process of random splitting and 10-fold cross-validation many times (say, 20 times) to get 20 different estimates of the model's accuracy. Then we can determine the mean and standard deviation among these 20 samples. With the standard deviation in hand, we have a sense for the model's instability - that is, how much the accuracy varies over diffferent iterations.

Running 20 repetitions of 10-fold cross validation on the RandomTree algorithm gave these results:

14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
13.0 correct out of 15.0 = 86.66666666666667 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
13.0 correct out of 15.0 = 86.66666666666667 %
13.0 correct out of 15.0 = 86.66666666666667 %
13.0 correct out of 15.0 = 86.66666666666667 %
Overall: 138 correct out of 150 = 92.0 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
13.0 correct out of 15.0 = 86.66666666666667 %
14.0 correct out of 15.0 = 93.33333333333333 %
12.0 correct out of 15.0 = 80.0 %
Overall: 140 correct out of 150 = 93.33333333333333 %
...
Overall: 143 correct out of 150 = 95.33333333333333 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 140 correct out of 150 = 93.33333333333333 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 138 correct out of 150 = 92.0 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 143 correct out of 150 = 95.33333333333333 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 139 correct out of 150 = 92.66666666666667 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 139 correct out of 150 = 92.66666666666667 %
...
Overall: 140 correct out of 150 = 93.33333333333333 %
...
Overall: 139 correct out of 150 = 92.66666666666667 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 138 correct out of 150 = 92.0 %
Average accuracy: 93.63333333333334 %
Std dev of accuracy: 1.0214368964028402

...whereas running 20 repetitions of 10-fold cross validation on the RandomForest algorithm gave these results:

14.0 correct out of 15.0 = 93.33333333333333 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
13.0 correct out of 15.0 = 86.66666666666667 %
14.0 correct out of 15.0 = 93.33333333333333 %
15.0 correct out of 15.0 = 100.0 %
Overall: 143 correct out of 150 = 95.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
15.0 correct out of 15.0 = 100.0 %
14.0 correct out of 15.0 = 93.33333333333333 %
15.0 correct out of 15.0 = 100.0 %
14.0 correct out of 15.0 = 93.33333333333333 %
14.0 correct out of 15.0 = 93.33333333333333 %
12.0 correct out of 15.0 = 80.0 %
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 143 correct out of 150 = 95.33333333333333 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 143 correct out of 150 = 95.33333333333333 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 140 correct out of 150 = 93.33333333333333 %
...
Overall: 140 correct out of 150 = 93.33333333333333 %
...
Overall: 140 correct out of 150 = 93.33333333333333 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 140 correct out of 150 = 93.33333333333333 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 141 correct out of 150 = 94.0 %
...
Overall: 143 correct out of 150 = 95.33333333333333 %
...
Overall: 142 correct out of 150 = 94.66666666666667 %
...
Overall: 141 correct out of 150 = 94.0 %
Average accuracy: 94.33333333333333 %
Std dev of accuracy: 0.6831300510643282

Not only did RandomForest have a slightly higher accuracy, but it had much less variance in accuracy. In other words, RandomTree exhibits more instability here than RandomForest. Instability is a harbinger of poor future performance, so RandomForest would be an easy choice over RandomTree in this setting.