Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | benedict-king |
View: | 214 times |
Download: | 0 times |
1
Information Geometry on Classification
Logistic, AdaBoost, Area under ROC curve
Shinto Eguchi
– –
ISM seminoron 17/1/2001
This talk is based on one of joint work with Dr J Copas
2
Outline
Problem setting for classification
overview of classification methods
Dw classifications
Dw divergence of discriminant functions
definition from NP Lemma, expected and ovserved expressions
examples of Dwlogistic regression, adaboost, area under ROC curve, hit rate, credit scoring, medical screening
structure of Dw risk functions
optimal Dw under near-logisticimplement by cross-validation
Risk scores of skin cancer
area under ROC curve, comparisondiscussion on other methods
[ http://juban.ism.ac.jp/ ]
3
Standard methods
Fisher linear discriminant analysis [4]
Logistic regression [ Cornfield, 1962]
Multilayer perception
[ http://juban.ism.ac.jp/file_ppt/ 公開講座 ( ニューラル ).
ppt] New approaches
Boostimg – combining weak learners –
AdaBoost
[http://juban.ism.ac.jp/file_ppt/ 公開講座( Boost ) .ppt]
Support vector machine – VCdimension –
[http://juban.ism.ac.jp/file_ppt/open-svm12-21.ppt]
Kernel method – Mercer theorem –
[http://juban.ism.ac.jp/file_ppt/ 主成分発表原稿 .ppt]
4
Problem setting
input vector
output variable
Definition is a classifier if is onto.
(direct sum)
the k-th decision space
5
Joint distribution of , y :
where prior distribution
conditional distribution of given y
Probablistic model
Misclassification
error rate
hit rate
6
discriminant function
classifier
Bayes rule Given P(x, y),
Training data (examples)
i-th input i-th input
7
output variable
Reduction of our problem to binary classification
log-likelihood ratio
discriminant function
classifier
error rate
8
Other loss functions for classification
Credit scoring [5]
A cost model : a profit if y = 1; loss if y = 0.
General setting
Let be a cost of classify y as .
The expected cost is
9
hit
correct rejection false negative
false positive
ROC (Reciever Operating Characteristic) curve
10
Main story
linear discriminant function
Given a training data
objective function
proposed estimator
What (U ,V ) is ?
Logistic is OK.
11
log-likelihood ratio
discriminant function
A reinterpretation of Neyman-Pearson Lemma
Proposition
Remark
12
Proof of Proposition
13
Divergence Dw of discriminant function
Def.
Expectation expression
14
Proof
15
Sample expression given a set of training data
Minimum Dw method
for a statistical model F
16
Examples of Dw divergence
(1) logistic regression
(2) Hit rate, Credit scoring, medical screening
17
This Dw is the loss function of AdaBoost, cf. [7], [8].
(3) Area under ROC curve
(4) AdaBoost
18
Structure of Dw risk functions
optimal Dw under near-logisticimplement by cross-validation
Logistic(linear)-parametric model
model distribution of , y :
19
Estimating equation of minimum Dw methods
Remark
20
Cauchy-Schwartz inequality
Prametric assumption
21
Near-Parametric assumption
22
Our risk function of an estimator is
But our situation is
Let
Cross + varianced Risk estimate
the bias term is
where
variance term is
where is the estimate from the training date by leaving thei th-example out.
23
24
Outlier
For
25
Note :
where
26
27
28
29
30
31
32
33
34
References
[1] Begg, C. B., Satogopan, J. M. and Berwick, M. (1998). A new strategy for evaluating the impact of epidemiologic risk factors for cancerwith applications to melanoma. J. Amer. Statist. Assoc. 93, 415-426.
[2] Berwick, M, Begg, C. B., Fine, J. A., Roush, G. C. and Barnhill, R. L. (1996). Screening for cutaneous melanoma by self skin examination. J. National Cancer Inst., 88, 17-23.
[3] Eguchi, S and Copas, J. (2000). A Class of Logistic-type Discriminant Functions. Technical Report of Department of Statistics, University of Warwick.
[4] Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188.
[5] Hand, D. J. and Henley, W. E. (1997). Statistical classification methods in consumer credit scoring: a review. J. Roy. Statist. Soc., A, 160, 523-541.
[6] McLachlan, G. J. (1992). Discriminant Analysis and Statistical Pattern Recognition. Wiley: New York.
[7] Schapire R., Freund, Y., Bartlett, P. and Lee, W. S. (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist. 26, 1651-1686.
[8] Vapnik, V. N. (1999). The Nature of Statistical Learning Theory. Springer: New York.