Calculating AUC from ROC curves for the NaiveBayesclassifier in WEKA

Hans van Rijnberk , Assort Vision, Utrecht

2004-01-14 16:09:51 UTC

When weka is used on the Java level a method getROCArea(Instances tcurve) of
the weka.classifiers.evaluation class is available. The Java level itself is
a problem to overcome but on top of that this method is using the
trapezoidmeth to estimate the AUC. A parametric approximation (Metz) is not
generally applicalble, but there is very good and general approximation
which easier to calculate. Thus far I did the calculation with the help of
excel. At the moment I am trying to implement it in Java. First I am working
on more understanding about Java and the weka implementation.

The calculation method is as follows:
I used the threshold curve from Weka for the (cross-)validation results. The
data for this curve can be explicitely saved in the results panel (chooce
threhold curve). This provides a csv file which can be imported in Excel or
processed by other software. From there

1. extract TP and FP rates (forming the ROC curve), keep them in threshold
(= P) order

2. Use the nonparametric AUC (Area Under the Curve) approximation with the
Mann-Whitney rank sum test as follows. Before that add the points (TP,FP) =
(0,0) and (TP,FP) = (1,1) of the
ROC (Weka leaves them out of the curve) and link them to the most extreme
threshold values (mostly 0 and 1 if it
scales between 0 and 1).

a. The threshold order (0 to 1) determines the rank (assign an average rankorder
over obeservations with the same threshold to correct for ties)

b. Use the ranks to calculate the Mann-Whitney statistic
U = Npos*Nneg + Nneg(Nneg+1)/2 - R,
with
R = the ranksum of the negative sample,
Npos= number of positives in the sample and
Nneg= the number of negatives in the sample

c. AUC = U/(Npos + Nneg)
---
3. The standard error for AUC can be approximated Using the Hanley MacNeil
test following:
--------------
/* For clarity, the same symbols are used as in the Hanley-McNeil
paper. */
n_A = Npos;
n_N = Nneg;
theta = AUC;
//
theta2 = theta * theta;
Q1 = theta / (2 - theta);
Q2 = 2 * theta2 / (1 + theta);
se2 = (theta * (1 - theta) + (n_A - 1) * (Q1 - theta2) + (n_N -
1) * (Q2
- theta2)) / (n_A * n_N);
//
SE_auc = squareRoot (se2);
Assumption:
0.0 > AUC < 1.0 , Npos > 0 and Npos > 0

The threshold values for every observation can also be obtained from the
classification function if made external (like for weka logistic regression),
and the TP and FP fraction from the cummulative frequency for negatives and
positives when sorted by the classification function values.

Hopefully this helps.

Kind regards
Hans van Rijnberk

Hi. Is it possible in WEKA to calculate AUC (Area Under Curve) from the
ROC curves (the threshold curves for NaiveBayes)?
Is someone who knows how that can be done if possible?
Sincerely yours, Thora Jonsdottir
Þóra Jónsdóttir, MSc., eðlisfræðingur
Krabbameinsmiðstöð Landspítala-háskólasjúkrahúss
Skógarhlíð 12
105 Reykjavík
s. 543 6901
www.km.lsh.is
 Hi. Is it possible in WEKA to calculate
AUC (Area Under Curve) from the ROC curves (the threshold curves for

NaiveBayes)?

Is someone who knows how that can be
done if possible?
 Sincerely yours, Thora Jonsdottir
 
 Þóra Jónsdóttir, MSc., eðlisfræðingur 
Krabbameinsmiðstöð Landspítala-háskólasjúkrahúss 
Skógarhlíð 12 
105 Reykjavík 
s. 543 6901 
www.km.lsh.is 
 
 
_______________________________________________
Wekalist mailing list
https://list.scms.waikato.ac.nz/mailman/listinfo/wekalist

Hans van Rijnberk

Assort Vision (machine vision software & information services)

Tirol 64
3524 KM Utrecht,
the Netherlands

031 (0)30 2148681 / 2889531
***@wanadoo.nl /(***@knoware.nl)