Assignment 4: Data Quality Report

In preparation for implementing your own ML algorithms, you will need to be able to read in an .arff file. In this project you will need to write an .arff reader and print out some statistical information about the file (similar to the Data Quality Report in section 3.1 of the text).

You can use any language you wish (but pick one that you will still want to use when implementing algorithms later on).

Reading the .arff

I don't expect an implementation that fully supports every nuance of the .arff format, but at the very least your program should be able to read lakes.arff. In particular, you should support numeric and categorical feature types. Be sure to strip out comments, which start with the % character. If you are using a weakly-typed language, it may be convenient to store the data all as one big 2-dimensional array. If you are using a strongly-typed language, a few options are:

store everything as a 2D array of doubles, but for categorical types the double represents an index, or
store everything as a 2D array of Strings and convert to doubles on the fly as needed, or
store categorical and numeric data in separate 2D arrays, and keep track of how the indices are reorderd.

Writing the data quality report

Your program should output a data quality report as a comma-delimited .csv file. For each continuous feature it should report at least: cardinality, minimum, 1st quartile, median, 3rd quartile, and maximum. Cardinality is the number of unique values that appear for a given feature, and is readily implemented by inserting values into a hash table / hash set / associative array and then checking how many values are in the data structure. The other 5 stats are all special cases of an order statistic: 0th percentile is the minimum, 25th percentile is the 1st quartile, 50th percentile is the median, etc. So you can write one function or method that supports all 5. For categorical features, report at least the cardinality and the mode.

Here is an example data quality report for lakes.arff. I put some additional tables in there so that histograms can be built in a spreadsheet (optional feature).

Challenges

Here are some optional (bonus) features:

Support missing values, which are denoted with a ?
Deal with the fact that values can be surrounded by single-quotes, like 'some value'
Implement the mean and standard deviation for continuous features
Implement percent missing for numeric and categorical features
Implement one or more of: mode frequency, mode percent, second mode, second mode frequency, second mode percent for categorical features
Have your program generate data for displaying histograms

Hints

You can read all @attribute lines and @data lines into a variable-length arrays first before allocating the data array, so you know how big to make it.

If you do standard deviation: note we are using the unadjusted formula. Weka uses a slightly different formula and will get slightly different results.

The implementation is up to you, but here is the interface I went with in Java:

public Dataset(File arffFile) throws Exception
public int getNumberOfFeatures()
public int getNumberOfInstances()
public String getFeatureName(int index)
public boolean isNumericFeature(int index)
public boolean isCategoricalFeature(int index)
public boolean isTargetFeature(int index)
public void setTargetFeature(int index)
public int getTargetFeatureIndex()
public int indexOfCategory(String category, int featureIndex)
	//Index of the given discrete value (category) within the given feature (which must be categorical).
public ArrayList<Integer> getFeatureIndices(boolean numeric, boolean categorical, boolean considerTarget)
	//Gets a list of indices mapping to numeric features, categorical features, both, or neither.
	//If considerTarget is false then the target feature is automatically skipped.
public ArrayList<Integer> getNumericFeatureIndices()
	//Convenience method that calls getFeatureIndices()
public ArrayList<Integer> getCategoricalFeatureIndices()
	//Convenience method that calls getFeatureIndices()
public void writeDataQualityReport(File csvFile) throws Exception
public double getPercentMissing(int featureIndex)
public int getCardinality(int featureIndex)
public double getMean(int featureIndex)
public double getStandardDeviation(int featureIndex)
public double getOrderStatistic(int featureIndex, double fraction)
	//So getOrderStatistic(0, .75) would be the 3rd quartile of feature 0.  getOrderStatistic(0, 0) would be the minimum value of feature 0.
private HashMap<Double, Integer> getValueCountsForFeature(int featureIndex)
	//Gets all the values that occur for a given feature, and how many counts for each value.
	//This is useful for implementing mode, modeFrequency, cardinality, etc.
public String getMode(int featureIndex)
public int getModeFrequency(int featureIndex)
public double getModePercent(int featureIndex)
public String getSecondMode(int featureIndex)
public int getSecondModeFrequency(int featureIndex)
public double getSecondModePercent(int featureIndex)