Assignment 5: Feature selection

Removing unneeded features can significantly improve accuracy of some algorithms. In this exercise, you will be implementing backward feature selection, which starts with all features and removes them one at a time as long as accuracy (or some other performance metric) improves.

Backward feature selection

Download the lakes dataset. We begin with a method that does 10-fold cross validation and reports the area under the ROC curve for the first value of the target feature:

	public static double aucFor(Instances theData) throws Exception {
		double totalAuc = 0;
		int folds = 10; // the number of folds to generate.  Must be at least 2.
		theData.stratify(folds); //make stratified folds
		for (int n = 0; n < folds; n++) {
			Instances trainingSet = theData.trainCV(folds, n);
			Instances testSet = theData.testCV(folds, n);

			J48 learner = new J48(); // The C4.5 decision tree algorithm
			learner.buildClassifier(trainingSet);
			Evaluation eval = new Evaluation(trainingSet);

			eval.evaluateModel(learner, testSet);
			double auc = eval.areaUnderROC(0);

			totalAuc += auc;
		}
		return (totalAuc / folds);
	}

You will also need the usual imports (and a main method of course).

import weka.core.*;
import weka.core.converters.*;
import weka.core.converters.ConverterUtils.*;
import weka.classifiers.bayes.*;
import weka.classifiers.trees.*;
import weka.classifiers.Evaluation;
import java.util.*;

We are using the area under the ROC curve as a performance metric because it is more robust than accuracy. For example, accuracy can be misleading if 90% of the instances are "Yes" and the learner just guesses "Yes" all the time.

Create an Instances object with the lakes data, use the randomize() method to randomize it as we did last time, and set the class attribute to be the last one by using data.setClassIndex(data.numAttributes() - 1); (assuming your Instances object is named data). Call the aucFor method and print out the area under the ROC curve. You should see something like this:

Area under ROC curve: 0.9130290516206483

Now write a loop which goes through each feature index except the target feature (which is last). You can use data.numAttributes() for this purpose. On each iteration of the loop, you should:

Note that on each iteration of the loop, we are restoring back to the original dataset before removing a different feature. So each time, we are doing cross-validation on all but one of the features. Your program should now look something like this:

We are trying to beat: 0.9152919007603041 AUC
=====
0.9180521248499401 AUC if we remove feature 0.
0.9239544857943178 AUC if we remove feature 1.
0.8153967827130852 AUC if we remove feature 2.
0.9240548459383753 AUC if we remove feature 3.
0.9267355982392959 AUC if we remove feature 4.
0.9189524209683875 AUC if we remove feature 5.
0.9131308523409363 AUC if we remove feature 6.
0.9189925010004002 AUC if we remove feature 7.
0.9186723729491797 AUC if we remove feature 8.
0.9194526850740296 AUC if we remove feature 9.
0.9150315886354543 AUC if we remove feature 10.
0.9200929411764707 AUC if we remove feature 11.
0.9080128611444579 AUC if we remove feature 12.
0.9186324529811924 AUC if we remove feature 13.
0.9223143177270909 AUC if we remove feature 14.
0.9147107883153263 AUC if we remove feature 15.
0.8944269867947179 AUC if we remove feature 16.
0.9200929411764707 AUC if we remove feature 17.

You probably have several features that would result in better performance if they were removed. Once the loop is completed, remove only one feature: the one that looks most advantageous to remove. But be sure to remove it from the original dataset, not the copy:

Now trying to beat: 0.9152919007603041 AUC
=====
0.9180521248499401 AUC if we remove feature 0.  Best to remove so far!
0.9239544857943178 AUC if we remove feature 1.  Best to remove so far!
0.8153967827130852 AUC if we remove feature 2.
0.9240548459383753 AUC if we remove feature 3.  Best to remove so far!
0.9267355982392959 AUC if we remove feature 4.  Best to remove so far!
0.9189524209683875 AUC if we remove feature 5.
0.9131308523409363 AUC if we remove feature 6.
0.9189925010004002 AUC if we remove feature 7.
0.9186723729491797 AUC if we remove feature 8.
0.9194526850740296 AUC if we remove feature 9.
0.9150315886354543 AUC if we remove feature 10.
0.9200929411764707 AUC if we remove feature 11.
0.9080128611444579 AUC if we remove feature 12.
0.9186324529811924 AUC if we remove feature 13.
0.9223143177270909 AUC if we remove feature 14.
0.9147107883153263 AUC if we remove feature 15.
0.8944269867947179 AUC if we remove feature 16.
0.9200929411764707 AUC if we remove feature 17.
Removed feaure 4

Now we repeat that whole process again using a nested loop. If we get to a stage where removing a feature doesn't help anymore, we stop. If it only goes for one iteration of the main loop, you can try using a different random seed to verify it does the right thing on subsequent iterations. Your program should look something like this now:

Now trying to beat: 0.9017678111244496 AUC
=====
0.8903236094437774 AUC if we remove feature 0.
0.9017481792717087 AUC if we remove feature 1.
0.8088212404961984 AUC if we remove feature 2.
0.8972678351340537 AUC if we remove feature 3.
0.8963271068427371 AUC if we remove feature 4.
0.8854435614245698 AUC if we remove feature 5.
0.9078490516206481 AUC if we remove feature 6.  Best to remove so far!
0.8978283073229292 AUC if we remove feature 7.
0.8958843857543017 AUC if we remove feature 8.
0.894785394157663 AUC if we remove feature 9.
0.8922430492196879 AUC if we remove feature 10.
0.8940851140456182 AUC if we remove feature 11.
0.8852618487394958 AUC if we remove feature 12.
0.901568107242897 AUC if we remove feature 13.
0.902568507402961 AUC if we remove feature 14.
0.9003476190476191 AUC if we remove feature 15.
0.8994072428971588 AUC if we remove feature 16.
0.89208431372549 AUC if we remove feature 17.
Removed feaure 6
Now trying to beat: 0.9078490516206481 AUC
=====
0.9157898599439778 AUC if we remove feature 0.  Best to remove so far!
0.9149295478191277 AUC if we remove feature 1.
0.8583924529811926 AUC if we remove feature 2.
0.9202716366546617 AUC if we remove feature 3.  Best to remove so far!
0.9137096598639456 AUC if we remove feature 4.
0.9288960544217687 AUC if we remove feature 5.  Best to remove so far!
0.9118883473389356 AUC if we remove feature 6.
0.9141292356942777 AUC if we remove feature 7.
0.9281360144057624 AUC if we remove feature 8.
0.9273153341336535 AUC if we remove feature 9.
0.9164501560624251 AUC if we remove feature 10.
0.9233768147258903 AUC if we remove feature 11.
0.9111880512204882 AUC if we remove feature 12.
0.9058064825930371 AUC if we remove feature 13.
0.9183909323729491 AUC if we remove feature 14.
0.9126692196878752 AUC if we remove feature 15.
0.9150696038415367 AUC if we remove feature 16.
Removed feaure 5
Now trying to beat: 0.9288960544217687 AUC
=====
0.928779855942377 AUC if we remove feature 0.
0.9269786714685875 AUC if we remove feature 1.
0.8397467787114845 AUC if we remove feature 2.
0.9163952460984393 AUC if we remove feature 3.
0.9160538295318126 AUC if we remove feature 4.
0.9344990076030413 AUC if we remove feature 5.  Best to remove so far!
0.9143948539415765 AUC if we remove feature 6.
0.9292178231292517 AUC if we remove feature 7.
0.9135125970388156 AUC if we remove feature 8.
0.9269786714685875 AUC if we remove feature 9.
0.9132531572629052 AUC if we remove feature 10.
0.9251172629051622 AUC if we remove feature 11.
0.9227773189275712 AUC if we remove feature 12.
0.9253980392156864 AUC if we remove feature 13.
0.9159349659863946 AUC if we remove feature 14.
0.9317987434973991 AUC if we remove feature 15.
Removed feaure 5
Now trying to beat: 0.9344990076030413 AUC
=====
0.9270963025210085 AUC if we remove feature 0.
0.9292172148859544 AUC if we remove feature 1.
0.8330236254501802 AUC if we remove feature 2.
0.9169350300120049 AUC if we remove feature 3.
0.9193933093237294 AUC if we remove feature 4.
0.8971277631052421 AUC if we remove feature 5.
0.9295973829531814 AUC if we remove feature 6.
0.9192544457783113 AUC if we remove feature 7.
0.925936350540216 AUC if we remove feature 8.
0.9206964945978392 AUC if we remove feature 9.
0.930757831132453 AUC if we remove feature 10.
0.9254570468187275 AUC if we remove feature 11.
0.9226764785914365 AUC if we remove feature 12.
0.9020898439375751 AUC if we remove feature 13.
0.9268563505402163 AUC if we remove feature 14.

Extra stuff (optional)

It would be nice to see which features were removed, rather than indices. Use the API to display the names of the features that are removed.

We only saw small improvements for the C4.5 algorithm when backward feature selection was applied. Try it again with the nearest neighbor approach (weka.classifiers.lazy.IBk) and see how dramatic the difference is.