I mostly use Weka from the command line (instead of through the Explorer and Experimenter GUIs). It was not very easy to figure out how to get the best out of Weka using the command line, and this page is not organized but may be of some help.
Note that some command lines are for version 3-3-4, while others are for previous versions.
1) Getting started - a simple feature selection experiment
Assume one has a file called vowel.arff (the Deterding vowel dataset available here) with 10 attributes + class (which corresponds to the last attribute) and want to reduce the number of attributes by feature selection, creating a new file fs_vowel.arff.
The suffix AttributeEval of a class name indicates a score is evaluated for all attributes in the dataset. After evaluating each attribute individually, the natural method to get N attributes is to simply rank them using the weka.attributeSelection.Ranker as the "search" method. For example, for getting N = 5 attributes:
@attribute Class {hid,hId,hEd,hAd,hYd,had,hOd,hod,hUd,hud,hed}
@data
0.418,-3.639,-0.168,1.627,-0.67,hid
0.496,-3.327,-0.265,1.933,-0.694,hId
...
If the option -N is omitted (or set to -1), Ranker returns all attributes. In this case, the output file has the same information as the input, but with the attributes permutated.
Principal component analysis (PCA) can be easily done with Weka. For example:
Note that strictly speaking, PCA is not a feature selection but a feature extraction method. The new attributes are obtained by a linear combination of the original attributes. Dimensionality reduction is achieved by keeping the components with highest variance.
2) If a test file needs to be generated, Weka provides the "batch" processing. Check the options using:
java weka.filters.AttributeSelectionFilter -b -h
The following command uses the dataset vowel_train.arff to rank and keep only 5 attributes, and then generates two output files with only these 5 attributes: fs_vowel_train.arff and fs_vowel_test.arff.
(note the mandatory -b to indicate a "batch" processing)
3) General guidelines
In the package weka.attributeSelection, the option S specifies the "search" method and E the "evaluator". From the help: -S <"Name of search class [search options]"> e.g. weka.attributeSelection.BestFirst Sets search method for subset evaluators. -E <"Name of attribute/subset evaluation class [evaluator options]"> e.g. weka.attributeSelection.CfsSubsetEval Sets attribute/subset evaluator.
Quotation marks are import to pass multiple options. Below, some example of command lines are provided.
java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.RankSearch -A weka.attributeSelection.SVMAttributeEval" -E "weka.attributeSelection.CfsSubsetEval" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last
b) Don't forget -R
Below doesn't select 3 (but 4 features), i.e., -N is not used ! java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.ForwardSelection -N 3" -E "weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last
One needs to add -R to weka.attributeSelection.ForwardSelection options, e.g. for the new Weka 3-3-4: java -cp g:\programs\weka-3-3-4\weka.jar weka.filters.supervised.attribute.AttributeSelection -S "weka.attributeSelection.ForwardSelection -N 3 -R" -E "weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.bayes.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last
c) ClassifierSubsetEval uses a classifier to estimate the "merit" of a set of attributes. With -T it uses the training data for accuracy estimation rather than a hold out/ test set.
or equivalently (new version): java -cp g:\programs\weka-3-3-4\weka.jar weka.filters.supervised.attribute.AttributeSelection -S "weka.attributeSelection.ForwardSelection -N 3 -R" -E "weka.attributeSelection.ClassifierSubsetEval -T -B weka.classifiers.bayes.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last
d) ClassifierSubsetEval uses a classifier to estimate the "merit" of a set of attributes. With -H it uses the file containing hold out/test instances for accuracy estimation
OrderedWrap is possible in Weka I think - unless I've forgotten some of the details of the paper. You need to use RankSearch as the search method. This search method generates an ordered list of attributes using either an AttributeEvaluator or a SubsetEvaluator. When a SubsetEvaluator is specified it generates the ranking by using a forward selection search that continues to the far side of the search space (ie. keeps adding attributes until all attributes are included). For OrderedWrap, you would specify ClassifierSubsetEval as the parameter to the RankSearch. Furthermore, you would set the "use training data" option to true for this. This combo will then generate a ranking using the specified base classifier according to performance on the training data. Then the correct number of features is selected by evaluating the highest ranked feature, the top two highest ranked features, etc. with respect to a further attribute evaluator - again a ClassifierSubsetEval, but this time using a hold out set.
g) ORDERED-FS
Note that it's trained with lesstrain and outputs lesstrain. Need to find a way of writing the whole train. Note: I think this was done with various.DeleteAttributes !