Discussion:
Value of confidence and pruning in J48
Martin O'Shea
2014-02-13 11:24:33 UTC
Permalink
Hello

My question(s) concern the value of confidence when using J48 classifiers in
Weka.

To give some background: I am currently Java programs to have J48 classify
my own datasets of keyword frequencies from 50 RSS feeds using between 5 and
7 classes. Each training dataset covers 27 days per month and testing data
covers the remaining 3 (or 4) days of each month. I rotate these periods
ten-fold to allow for 'temporal' cross validation, so that at the end, when
I have ten sets of results from Weka, I average them to give a monthly
classification success / failure rate.

Typically a training set for the above is about 1400 records (from the 50
feeds over 27 days) and a test set 150 records (from the remainder of the
month).

My basic Java code per iteration of the above is:

Classifier cls = null;
J48 j48 = new J48();
String options = (classificationOutputType.getParameters());
String[] optionsArray = options.split(" ");
j48.setOptions(optionsArray);
cls = j48; // Allow cls to store reference to other Weka
classifiers.

// With thanks to:
http://www.ibm.com/developerworks/library/os-weka2/ and
//
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Classification-Train/test
set

cls.buildClassifier(trainingInstances);
Evaluation eval = new Evaluation(trainingInstances);
eval.evaluateModel(cls, testingInstances);

Where the various parameters for J48 are retrieved from the database at run
time and used to populate the options array.

However I find that if the confidence goes above 0.5, I get the repeating
message in the output logs:

‘WARNING: confidence value for pruning too high. Error estimate not
modified.’

Despite a successful classification (0.6 confidence in this case) as
follows:

Correctly Classified Instances 100 66.6667 %
Incorrectly Classified Instances 50 33.3333 %
Kappa statistic 0.539
Mean absolute error 0.1575
Root mean squared error 0.3167
Relative absolute error 52.2998 %
Root relative squared error 81.6067 %
Coverage of cases (0.95 level) 89.3333 %
Mean rel. region size (0.95 level) 40.9333 %
Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.667 0.007 0.909 0.667 0.769 0.855
Business and finance and economics
0.528 0.07 0.704 0.528 0.603 0.762
News and current affairs
0.3 0.067 0.529 0.3 0.383 0.769
Science and nature and technology
0.933 0 1 0.933 0.966 0.964
Sport
0.889 0.344 0.593 0.889 0.711 0.792
Entertainment and arts
Weighted Avg. 0.667 0.155 0.679 0.667 0.651 0.804

=== Confusion Matrix ===

a b c d e <-- classified as
10 1 1 0 3 | a = Business and finance and economics
1 19 4 0 12 | b = News and current affairs
0 3 9 0 18 | c = Science and nature and technology
0 1 0 14 0 | d = Sport
0 3 3 0 48 | e = Entertainment and arts

So to my question(s): can I ignore this message because if it is only a
warning? Or am I right in thinking that higher levels of confidence above
0.5 actually increase over-fitting because of less pruning? Therefore is it
better to use smaller confidence values or to remove pruning altogether?

Thanks

Martin O’Shea.




--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031.html
Sent from the WEKA mailing list archive at Nabble.com.
Eibe Frank
2014-02-13 21:19:16 UTC
Permalink
If you use values > 0.5 for the confidence parameter in J48 then
"pruning" will be done based on unmodified classification error on the
training data. This is effectively equivalent to turning pruning off.

Note that if you want to have a fully expanded tree, you should also
turn collapsing off, which amounts to pruning based on classification
error on the training data.

Cheers,
Eibe
Post by Martin O'Shea
Hello
My question(s) concern the value of confidence when using J48 classifiers in
Weka.
To give some background: I am currently Java programs to have J48 classify
my own datasets of keyword frequencies from 50 RSS feeds using between 5 and
7 classes. Each training dataset covers 27 days per month and testing data
covers the remaining 3 (or 4) days of each month. I rotate these periods
ten-fold to allow for 'temporal' cross validation, so that at the end, when
I have ten sets of results from Weka, I average them to give a monthly
classification success / failure rate.
Typically a training set for the above is about 1400 records (from the 50
feeds over 27 days) and a test set 150 records (from the remainder of the
month).
Classifier cls = null;
J48 j48 = new J48();
String options = (classificationOutputType.getParameters());
String[] optionsArray = options.split(" ");
j48.setOptions(optionsArray);
cls = j48; // Allow cls to store reference to other Weka
classifiers.
http://www.ibm.com/developerworks/library/os-weka2/ and
//
http://weka.wikispaces.com/Use+WEKA+in+your+Java+code#Classification-Train/test
set
cls.buildClassifier(trainingInstances);
Evaluation eval = new Evaluation(trainingInstances);
eval.evaluateModel(cls, testingInstances);
Where the various parameters for J48 are retrieved from the database at run
time and used to populate the options array.
However I find that if the confidence goes above 0.5, I get the repeating
‘WARNING: confidence value for pruning too high. Error estimate not
modified.’
Despite a successful classification (0.6 confidence in this case) as
Correctly Classified Instances 100 66.6667 %
Incorrectly Classified Instances 50 33.3333 %
Kappa statistic 0.539
Mean absolute error 0.1575
Root mean squared error 0.3167
Relative absolute error 52.2998 %
Root relative squared error 81.6067 %
Coverage of cases (0.95 level) 89.3333 %
Mean rel. region size (0.95 level) 40.9333 %
Total Number of Instances 150
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
0.667 0.007 0.909 0.667 0.769 0.855
Business and finance and economics
0.528 0.07 0.704 0.528 0.603 0.762
News and current affairs
0.3 0.067 0.529 0.3 0.383 0.769
Science and nature and technology
0.933 0 1 0.933 0.966 0.964
Sport
0.889 0.344 0.593 0.889 0.711 0.792
Entertainment and arts
Weighted Avg. 0.667 0.155 0.679 0.667 0.651 0.804
=== Confusion Matrix ===
a b c d e <-- classified as
10 1 1 0 3 | a = Business and finance and economics
1 19 4 0 12 | b = News and current affairs
0 3 9 0 18 | c = Science and nature and technology
0 1 0 14 0 | d = Sport
0 3 3 0 48 | e = Entertainment and arts
So to my question(s): can I ignore this message because if it is only a
warning? Or am I right in thinking that higher levels of confidence above
0.5 actually increase over-fitting because of less pruning? Therefore is it
better to use smaller confidence values or to remove pruning altogether?
Thanks
Martin O’Shea.
--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031.html
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Martin O'Shea
2014-02-14 10:21:58 UTC
Permalink
Thanks Eibe.

So if confidence is > 0.5, effectively no pruning of the training data is
carried out.

But I don’t quite understand your second paragraph: are you saying that
leaving collapsing on will cause pruning despite a high level of confidence?

Also, as described before my schema involves a form of temporal
cross-validation. But for both training and testing my data is provided with
a class label. Can J48 in Weka predict the classification of unlabelled
data?

I have tried this using a ? instead of the class but errors result.

Thanks

Martin O'Shea.




--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031p30056.html
Sent from the WEKA mailing list archive at Nabble.com.
Eibe Frank
2014-02-14 22:02:12 UTC
Permalink
Yes, leaving collapsing on means that J48 (and the original C4.5) will
still perform a form of simple pruning: parts of the tree that do not
improve classification error on the *training* data are discarded. For
example, if you have a node A with two attached leaf nodes that both
predict the same class value, then those leaf nodes are discarded and A
is made into a leaf node.

If you want to use your trees for a task other than classification, e.g.
class probability estimation, then it can sometimes be better to turn
collapsing off.

All classifiers in WEKA can output classifications for unlabelled data.
(Otherwise, they would be pretty useless!)

To get a classification for an unlabeled instance, just represent the
missing label with a missing value in the corresponding WEKA Instance
and feed it to the classifier to classify.

Cheers,
Eibe
Post by Martin O'Shea
Thanks Eibe.
So if confidence is > 0.5, effectively no pruning of the training data is
carried out.
But I don’t quite understand your second paragraph: are you saying that
leaving collapsing on will cause pruning despite a high level of confidence?
Also, as described before my schema involves a form of temporal
cross-validation. But for both training and testing my data is provided with
a class label. Can J48 in Weka predict the classification of unlabelled
data?
I have tried this using a ? instead of the class but errors result.
Thanks
Martin O'Shea.
--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031p30056.html
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Martin O'Shea
2014-02-14 22:22:44 UTC
Permalink
Thanks Eibe.

It is odd. Several sources have told me that if using J48 in Weka, class
labels were always needed. However, I have labels in both training and
testing data for my supervised use of J48 and other types.

But concerning 'To get a classification for an unlabeled instance, just
represent the missing label with a missing value in the corresponding WEKA
Instance and feed it to the classifier to classify.': if I have the iris
dataset as follows:

@RELATION iris

@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,
4.7,3.2,1.3,0.2,Iris-setosa

And instance two lacks a class, is it alright to leave the class value empty
or does it need a ? or other special character?

Thanks.



--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031p30062.html
Sent from the WEKA mailing list archive at Nabble.com.
Eibe Frank
2014-02-14 22:25:58 UTC
Permalink
No, you can't leave the class empty. You need a missing value ("?").

Cheers,
Eibe
Post by Martin O'Shea
Thanks Eibe.
It is odd. Several sources have told me that if using J48 in Weka, class
labels were always needed. However, I have labels in both training and
testing data for my supervised use of J48 and other types.
But concerning 'To get a classification for an unlabeled instance, just
represent the missing label with a missing value in the corresponding WEKA
Instance and feed it to the classifier to classify.': if I have the iris
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,
4.7,3.2,1.3,0.2,Iris-setosa
And instance two lacks a class, is it alright to leave the class value empty
or does it need a ? or other special character?
Thanks.
--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031p30062.html
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Martin O'Shea
2014-02-14 22:35:55 UTC
Permalink
So in a test data file, the instances with missing values classes should be:

@RELATION iris

@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.6,2.8,4.9,2.0,?
7.7,2.8,6.7,2.0,?

But if I run this file against the full iris data as training data, the
instances are not included. Why is this?



--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031p30064.html
Sent from the WEKA mailing list archive at Nabble.com.
Eibe Frank
2014-02-14 22:44:23 UTC
Permalink
Accuracy, etc., can obviously not be computed when the class values are
missing.

You need to turn on output of predictions under "More options..." in the
Classifier panel of the Explorer (similarly, in the KnowledgeFlow or
from the command-line, you need to tell WEKA to output predictions).

Cheers,
Eibe
Post by Martin O'Shea
@RELATION iris
@ATTRIBUTE sepallength REAL
@ATTRIBUTE sepalwidth REAL
@ATTRIBUTE petallength REAL
@ATTRIBUTE petalwidth REAL
@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
@DATA
5.6,2.8,4.9,2.0,?
7.7,2.8,6.7,2.0,?
But if I run this file against the full iris data as training data, the
instances are not included. Why is this?
--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031p30064.html
Sent from the WEKA mailing list archive at Nabble.com.
_______________________________________________
Wekalist mailing list
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html
Martin O'Shea
2014-02-14 23:03:25 UTC
Permalink
Thanks Eibe. I will try this next week.



--
View this message in context: http://weka.8497.n7.nabble.com/Value-of-confidence-and-pruning-in-J48-tp30031p30067.html
Sent from the WEKA mailing list archive at Nabble.com.

Continue reading on narkive:
Loading...