用weka做特征选择

xmubingo · 发表于 2013-3-25 17:52:00

原文：
http://www.laps.ufpa.br/aldebaro/weka/feature_selection.html

Some feature selection using Weka through the command line

Prof. Mark Hall wrote most of Weka's code for feature selection, so a good place to find documentation is his web page: http://www.cs.waikato.ac.nz/~mhall/pubs.html

I mostly use Weka from the command line (instead of through the Explorer and Experimenter GUIs). It was not very easy to figure out how to get the best out of Weka using the command line, and this page is not organized but may be of some help.

Note that some command lines are for version 3-3-4, while others are for previous versions.

1) Getting started - a simple feature selection experiment

Assume one has a file called vowel.arff (the Deterding vowel dataset available here) with 10 attributes + class (which corresponds to the last attribute) and want to reduce the number of attributes by feature selection, creating a new file fs_vowel.arff.

The suffix AttributeEval of a class name indicates a score is evaluated for all attributes in the dataset. After evaluating each attribute individually, the natural method to get N attributes is to simply rank them using the weka.attributeSelection.Ranker as the "search" method. For example, for getting N = 5 attributes:

java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.Ranker -N 5" -E "weka.attributeSelection.InfoGainAttributeEval" -i vowel.arff -o fs_vowel.arff -c last

The first lines of the input file vowel.arff are:

@relation vowel-train

@attribute 'Feature 0' numeric

@attribute 'Feature 1' numeric

@attribute 'Feature 2' numeric

@attribute 'Feature 3' numeric

@attribute 'Feature 4' numeric

@attribute 'Feature 5' numeric

@attribute 'Feature 6' numeric

@attribute 'Feature 7' numeric

@attribute 'Feature 8' numeric

@attribute 'Feature 9' numeric

@attribute Class {hid,hId,hEd,hAd,hYd,had,hOd,hod,hUd,hud,hed}

@data

-3.639,0.418,-0.67,1.779,-0.168,1.627,-0.388,0.529,-0.874,-0.814,hid

-3.327,0.496,-0.694,1.365,-0.265,1.933,-0.363,0.51,-0.621,-0.488,hId

...

The first lines of the output file fs_vowel.arff are:

@relation 'vowel-train-weka.filters.AttributeSelectionFilter-Eweka.attributeS

@attribute 'Feature 1' numeric

@attribute 'Feature 0' numeric

@attribute 'Feature 4' numeric

@attribute 'Feature 5' numeric

@attribute 'Feature 2' numeric

@attribute Class {hid,hId,hEd,hAd,hYd,had,hOd,hod,hUd,hud,hed}

@data

0.418,-3.639,-0.168,1.627,-0.67,hid

0.496,-3.327,-0.265,1.933,-0.694,hId

...

If the option -N is omitted (or set to -1), Ranker returns all attributes. In this case, the output file has the same information as the input, but with the attributes permutated.

Principal component analysis (PCA) can be easily done with Weka. For example:

java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.Ranker" -E "weka.attributeSelection.PrincipalComponents -R 0.5" -i vowel.arff -o fs_vowel.arff

The first lines of the output file fs_vowel.arff are:

@relation 'vowel-train-weka.filters.AttributeSelectionFilter-Eweka.attributeSelection.PrincipalComponents -R 0.5-Sweka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1'

@attribute '0.379Feature 0-0.227Feature 1-0.526Feature 2-0.139Feature 3+0.069Feature 4+0.491Feature 5+0.237Feature 6+0.193Feature 7-0.209Feature 8-0.354Feature 9' numeric @attribute '0.249Feature 0-0.556Feature 1+0.043Feature 2+0.525Feature 3+0.131Feature 4+0.205Feature 5-0.461Feature 6-0.106Feature 7+0.083Feature 8+0.25 Feature 9' numeric @attribute '-0.182Feature 0-0.145Feature 1+0.026Feature 2-0.163Feature 3+0.616Feature 4-0.021Feature 5+0.164Feature 6-0.489Feature 7-0.499Feature 8+0.152Feature 9' numeric @attribute Class {hid,hId,hEd,hAd,hYd,had,hOd,hod,hUd,hud,hed} @data 0.185966,0.499603,-0.270085,hid 0.221032,0.497447,-0.306291,hId ...

Note that strictly speaking, PCA is not a feature selection but a feature extraction method. The new attributes are obtained by a linear combination of the original attributes. Dimensionality reduction is achieved by keeping the components with highest variance.

2) If a test file needs to be generated, Weka provides the "batch" processing. Check the options using:

java weka.filters.AttributeSelectionFilter -b -h

The following command uses the dataset vowel_train.arff to rank and keep only 5 attributes, and then generates two output files with only these 5 attributes: fs_vowel_train.arff and fs_vowel_test.arff.

java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.Ranker -N 5" -E "weka.attributeSelection.InfoGainAttributeEval" -i vowel_train.arff -o fs_vowel_train.arff -r vowel_test.arff -s fs_vowel_test.arff -c last -b

(note the mandatory -b to indicate a "batch" processing)

3) General guidelines

In the package weka.attributeSelection, the option S specifies the "search" method and E the "evaluator". From the help: -S <"Name of search class [search options]"> e.g. weka.attributeSelection.BestFirst Sets search method for subset evaluators. -E <"Name of attribute/subset evaluation class [evaluator options]"> e.g. weka.attributeSelection.CfsSubsetEval Sets attribute/subset evaluator.

Quotation marks are import to pass multiple options. Below, some example of command lines are provided.

a) Ranker vs. RankSearch

java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.Ranker -N 5" -E "weka.attributeSelection.SVMAttributeEval -P 10 -E 15" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last

java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.RankSearch -A weka.attributeSelection.SVMAttributeEval" -E "weka.attributeSelection.CfsSubsetEval" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last

b) Don't forget -R

Below doesn't select 3 (but 4 features), i.e., -N is not used ! java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.ForwardSelection -N 3" -E "weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last

One needs to add -R to weka.attributeSelection.ForwardSelection options, e.g. for the new Weka 3-3-4: java -cp g:\programs\weka-3-3-4\weka.jar weka.filters.supervised.attribute.AttributeSelection -S "weka.attributeSelection.ForwardSelection -N 3 -R" -E "weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.bayes.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last

c) ClassifierSubsetEval uses a classifier to estimate the "merit" of a set of attributes. With -T it uses the training data for accuracy estimation rather than a hold out/ test set.

java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.ForwardSelection -N 3 -R" -E "weka.attributeSelection.ClassifierSubsetEval -T -B weka.classifiers.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last

or equivalently (new version): java -cp g:\programs\weka-3-3-4\weka.jar weka.filters.supervised.attribute.AttributeSelection -S "weka.attributeSelection.ForwardSelection -N 3 -R" -E "weka.attributeSelection.ClassifierSubsetEval -T -B weka.classifiers.bayes.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last

d) ClassifierSubsetEval uses a classifier to estimate the "merit" of a set of attributes. With -H it uses the file containing hold out/test instances for accuracy estimation

java weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.ForwardSelection -N 3 -R" -E "weka.attributeSelection.ClassifierSubsetEval -H binary.arff -B weka.classifiers.NaiveBayes" -i binary.arff -o train.arff -r binary.arff -s test.arff -b -c last

Note that for WrapperSubsetEval, it stops if std / mean > threshold after 2 runs of cross-validation.

e) To train a classifier with selected attributes.

Below works: java weka.classifiers.FilteredClassifier -t binary.arff -B weka.classifiers.SMO -F "weka.filters.AttributeSelectionFilter -E weka.attributeSelection.PrincipalComponents -S weka.attributeSelection.Ranker"

but below doesn't not java weka.classifiers.FilteredClassifier -t binary.arff -B weka.classifiers.SMO -F "weka.filters.AttributeSelectionFilter -E "weka.attributeSelection.PrincipalComponents -R 0.95" -S "weka.attributeSelection.Ranker""

neither (new version) java -cp g:\programs\weka-3-3-4\weka.jar weka.classifiers.meta.FilteredClassifier -t binary.arff -B weka.classifiers.bayes.NaiveBayes -F "weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.PrincipalComponents "-S weka.attributeSelection.Ranker -c last ""

TO DO:

Append email explaining working around for problem above.

Explains:

f) CONVENTIONAL WEKA'S WRAPPER

java -Xmx256M weka.filters.AttributeSelectionFilter -S weka.attributeSelection.ForwardSelection -E "weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.SMO" -i tieuboosting_100_train.arff -r tieuboosting_100_test.arff -o conwrappertieuboosting10_train.arff -s conwrappertieuboosting10_test.arff -b -c last

OrderedWrap is possible in Weka I think - unless I've forgotten some of the details of the paper. You need to use RankSearch as the search method. This search method generates an ordered list of attributes using either an AttributeEvaluator or a SubsetEvaluator. When a SubsetEvaluator is specified it generates the ranking by using a forward selection search that continues to the far side of the search space (ie. keeps adding attributes until all attributes are included). For OrderedWrap, you would specify ClassifierSubsetEval as the parameter to the RankSearch. Furthermore, you would set the "use training data" option to true for this. This combo will then generate a ranking using the specified base classifier according to performance on the training data. Then the correct number of features is selected by evaluating the highest ranked feature, the top two highest ranked features, etc. with respect to a further attribute evaluator - again a ClassifierSubsetEval, but this time using a hold out set.

g) ORDERED-FS

Note that it's trained with lesstrain and outputs lesstrain. Need to find a way of writing the whole train. Note: I think this was done with various.DeleteAttributes !

java -Xmx256M weka.filters.AttributeSelectionFilter -S "weka.attributeSelection.RankSearch -A weka.attributeSelection.ClassifierSubsetEval -- -B weka.classifiers.SMO -T" -E "weka.attributeSelection.ClassifierSubsetEval -B weka.classifiers.SMO -H "tieuboosting_100_heldout.arff" " -i tieuboosting_100_lesstrain.arff -r tieuboosting_100_test.arff -o wrappertieuboosting100_lesstrain.arff -s wrappertieuboosting100_test.arff -b -c last

h) STANDARD-WRAP

java -Xmx256M weka.filters.AttributeSelectionFilter -S weka.attributeSelection.ForwardSelection -E "weka.attributeSelection.ClassifierSubsetEval -B weka.classifiers.SMO -H "tieuboosting_10_heldout.arff" " -i tieuboosting_10_lesstrain.arff -r tieuboosting_10_test.arff -o stdwrappertieuboosting10_lesstrain.arff -s stdwrappertieuboosting10_test.arff -b -c last

i) From Mark Hall's fs.exp, which was generated for Sun's JDK 1.3, with Weka 3-3-2: AttributeSelectedClassifier -B "weka.classifiers.trees.j48.J48 -C 0.25 -M 2" -E "weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.trees.j48.J48 -F 5 -T 0.01 -S 1 -- -C 0.25 -M 2" -S "weka.attributeSelection.RankSearch -A weka.attributeSelection.CfsSubsetEval --" -A weka.attributeSelection.InfoGainAttributeEval --" -A weka.attributeSelection.ConsistencySubsetEval --" -A weka.attributeSelection.PrincipalComponents -- -R 1.0" -A weka.attributeSelection.WrapperSubsetEval -- -B weka.classifiers.trees.j48.J48 -F 5 -T 0.01 -S 1 -- -C 0.25 -M 2"

weka.classifiers.AttributeSelectedClassifier -B "weka.classifiers.j48.J48 -C 0.25 -M 2" -E "weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.j48.J48 -F 5 -T 0.01 -S 1 -- -C 0.25 -M 2" -S "weka.attributeSelection.RankSearch -A weka.attributeSelection.CfsSubsetEval --"

		自动登录	找回密码
密码			注册

用weka做特征选择

浏览过的版块