Martin O'Shea
2014-02-13 11:24:33 UTC
Hello
My question(s) concern the value of confidence when using J48 classifiers in
Weka.
To give some background: I am currently Java programs to have J48 classify
my own datasets of keyword frequencies from 50 RSS feeds using between 5 and
7 classes. Each training dataset covers 27 days per month and testing data
covers the remaining 3 (or 4) days of each month. I rotate these periods
ten-fold to allow for 'temporal' cross validation, so that at the end, when
I have ten sets of results from Weka, I average them to give a monthly
classification success / failure rate.
Typically a training set for the above is about 1400 records (from the 50
feeds over 27 days) and a test set 150 records (from the remainder of the
month).
My basic Java code per iteration of the above is:
Classifier cls = null;
J48 j48 = new J48();
String options = (classificationOutputType.getParameters());
String[] optionsArray = options.split(" ");
j48.setOptions(optionsArray);
cls = j48; // Allow cls to store reference to other Weka
classifiers.
// With thanks to:
http://www.ibm.com/developerworks/library/os-weka2/ and
//
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Classification-Train/test
set
cls.buildClassifier(trainingInstances);
Evaluation eval = new Evaluation(trainingInstances);
eval.evaluateModel(cls, testingInstances);
Where the various parameters for J48 are retrieved from the database at run
time and used to populate the options array.
However I find that if the confidence goes above 0.5, I get the repeating
message in the output logs:
‘WARNING: confidence value for pruning too high. Error estimate not
modified.’
Despite a successful classification (0.6 confidence in this case) as
follows:
Correctly Classified Instances 100 66.6667 %
Incorrectly Classified Instances 50 33.3333 %
Kappa statistic 0.539
Mean absolute error 0.1575
Root mean squared error 0.3167
Relative absolute error 52.2998 %
Root relative squared error 81.6067 %
Coverage of cases (0.95 level) 89.3333 %
Mean rel. region size (0.95 level) 40.9333 %
Total Number of Instances 150
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.667 0.007 0.909 0.667 0.769 0.855
Business and finance and economics
0.528 0.07 0.704 0.528 0.603 0.762
News and current affairs
0.3 0.067 0.529 0.3 0.383 0.769
Science and nature and technology
0.933 0 1 0.933 0.966 0.964
Sport
0.889 0.344 0.593 0.889 0.711 0.792
Entertainment and arts
Weighted Avg. 0.667 0.155 0.679 0.667 0.651 0.804
=== Confusion Matrix ===
a b c d e <-- classified as
10 1 1 0 3 | a = Business and finance and economics
1 19 4 0 12 | b = News and current affairs
0 3 9 0 18 | c = Science and nature and technology
0 1 0 14 0 | d = Sport
0 3 3 0 48 | e = Entertainment and arts
So to my question(s): can I ignore this message because if it is only a
warning? Or am I right in thinking that higher levels of confidence above
0.5 actually increase over-fitting because of less pruning? Therefore is it
better to use smaller confidence values or to remove pruning altogether?
Thanks
Martin O’Shea.
--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031.html
Sent from the WEKA mailing list archive at Nabble.com.
My question(s) concern the value of confidence when using J48 classifiers in
Weka.
To give some background: I am currently Java programs to have J48 classify
my own datasets of keyword frequencies from 50 RSS feeds using between 5 and
7 classes. Each training dataset covers 27 days per month and testing data
covers the remaining 3 (or 4) days of each month. I rotate these periods
ten-fold to allow for 'temporal' cross validation, so that at the end, when
I have ten sets of results from Weka, I average them to give a monthly
classification success / failure rate.
Typically a training set for the above is about 1400 records (from the 50
feeds over 27 days) and a test set 150 records (from the remainder of the
month).
My basic Java code per iteration of the above is:
Classifier cls = null;
J48 j48 = new J48();
String options = (classificationOutputType.getParameters());
String[] optionsArray = options.split(" ");
j48.setOptions(optionsArray);
cls = j48; // Allow cls to store reference to other Weka
classifiers.
// With thanks to:
http://www.ibm.com/developerworks/library/os-weka2/ and
//
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Classification-Train/test
set
cls.buildClassifier(trainingInstances);
Evaluation eval = new Evaluation(trainingInstances);
eval.evaluateModel(cls, testingInstances);
Where the various parameters for J48 are retrieved from the database at run
time and used to populate the options array.
However I find that if the confidence goes above 0.5, I get the repeating
message in the output logs:
‘WARNING: confidence value for pruning too high. Error estimate not
modified.’
Despite a successful classification (0.6 confidence in this case) as
follows:
Correctly Classified Instances 100 66.6667 %
Incorrectly Classified Instances 50 33.3333 %
Kappa statistic 0.539
Mean absolute error 0.1575
Root mean squared error 0.3167
Relative absolute error 52.2998 %
Root relative squared error 81.6067 %
Coverage of cases (0.95 level) 89.3333 %
Mean rel. region size (0.95 level) 40.9333 %
Total Number of Instances 150
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.667 0.007 0.909 0.667 0.769 0.855
Business and finance and economics
0.528 0.07 0.704 0.528 0.603 0.762
News and current affairs
0.3 0.067 0.529 0.3 0.383 0.769
Science and nature and technology
0.933 0 1 0.933 0.966 0.964
Sport
0.889 0.344 0.593 0.889 0.711 0.792
Entertainment and arts
Weighted Avg. 0.667 0.155 0.679 0.667 0.651 0.804
=== Confusion Matrix ===
a b c d e <-- classified as
10 1 1 0 3 | a = Business and finance and economics
1 19 4 0 12 | b = News and current affairs
0 3 9 0 18 | c = Science and nature and technology
0 1 0 14 0 | d = Sport
0 3 3 0 48 | e = Entertainment and arts
So to my question(s): can I ignore this message because if it is only a
warning? Or am I right in thinking that higher levels of confidence above
0.5 actually increase over-fitting because of less pruning? Therefore is it
better to use smaller confidence values or to remove pruning altogether?
Thanks
Martin O’Shea.
--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031.html
Sent from the WEKA mailing list archive at Nabble.com.