Robust Artificial IntelligenceReading Seminar; Tsinghua University
Thomas G. Dietterich, Oregon State [email protected]
1
Lecture 2: Rejection
• Given:• Training data 𝑥𝑥1,𝑦𝑦1 , … , (𝑥𝑥𝑁𝑁,𝑦𝑦𝑁𝑁)• Target accuracy level 1 − 𝜖𝜖• Learn a classifier 𝑓𝑓 and a rejection rule 𝑟𝑟
• At run time• Given query 𝑥𝑥𝑞𝑞• If 𝑟𝑟 𝑥𝑥𝑞𝑞 < 0, REJECT• Else classify 𝑓𝑓 𝑥𝑥𝑞𝑞
2
Papers for Today
• Cortes, C., DeSalvo, G., & Mohri, M. (2016). Learning with rejection. Lecture Notes in Artificial Intelligence, 9925 LNAI, 67–82. http://doi.org/10.1007/978-3-319-46379-7_5
• Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9, 371–421. Retrieved from http://arxiv.org/abs/0706.3188
• Papadopoulos, H. (2008). Inductive Conformal Prediction: Theory and Application to Neural Networks. Book chapter. https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks
3
Basic Theory
• Suppose 𝑓𝑓∗ 𝑥𝑥,𝑦𝑦 = 𝑃𝑃(𝑦𝑦|𝑥𝑥) is the optimal probabilistic classifier
• Best prediction is �𝑦𝑦 = arg max𝑦𝑦
𝑓𝑓∗(𝑥𝑥,𝑦𝑦)
• Then the optimal rejection rule is to REJECT if 𝑓𝑓∗ 𝑥𝑥, �𝑦𝑦 < 1 − 𝜖𝜖
• (Chow 1970)
4
1.0
0.5
0.0
𝜖𝜖
Non-Optimal Case
• If 𝑓𝑓 is not optimal, we can still determine a threshold with performance guarantees
• Let 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖 , 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑦𝑦𝑖𝑖 be a set of calibration data points 𝑖𝑖 = 1, … ,𝑁𝑁
• Sort them by �̂�𝑝 �𝑦𝑦𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖• Choose the smallest threshold 𝜏𝜏 such
that if 𝑓𝑓 𝑥𝑥𝑖𝑖 , �𝑦𝑦𝑖𝑖 > 𝜏𝜏 then the fraction of correct predictions is 1 − 𝜖𝜖
5
�̂�𝑝( �𝑦𝑦, 𝑥𝑥)
�𝑃𝑃( �𝑦𝑦, 𝑥𝑥)
1 − 𝜖𝜖
𝜏𝜏
Finite Sample (PAC) Guarantee
• 𝑃𝑃 𝑛𝑛 sup𝑥𝑥
�𝐹𝐹𝑛𝑛 𝑥𝑥 − 𝐹𝐹 𝑥𝑥 > 𝜆𝜆 ≤ 2 exp −2𝜆𝜆2 Massart (1990)
• Set 𝑥𝑥 ≔ 𝜏𝜏
• 𝑃𝑃 𝜂𝜂 > 𝜆𝜆𝑛𝑛
= 2 exp −2𝜆𝜆2
• Set 𝜆𝜆𝑛𝑛
= 𝜂𝜂 and 𝛿𝛿 = 2 exp −2𝜆𝜆2 ; solve for 𝑛𝑛
• 𝜆𝜆 = 𝜂𝜂 𝑛𝑛
• 𝛿𝛿 = 2 exp −2𝜂𝜂2𝑛𝑛
• log 𝛿𝛿2
= −𝜂𝜂2𝑛𝑛
• 𝑛𝑛 = 1𝜂𝜂2
log 2𝛿𝛿
• If 𝑛𝑛 > 1𝜂𝜂2
log 2𝛿𝛿
then w.p. 1 − 𝛿𝛿, the true error rate will be bounded by 1 − 𝜖𝜖 + 𝜂𝜂
6
�̂�𝑝( �𝑦𝑦, 𝑥𝑥)
�𝑃𝑃( �𝑦𝑦, 𝑥𝑥)
1 − 𝜖𝜖
𝜏𝜏
Related Work
• Geifman & El Yaniv (2017)• Develop confidence scores
based on either the softmax(“SR”) or Monte Carlo dropout (“MC-dropout”)
• Binary search for the threshold• Use an exact Binomial
confidence interval instead of Massart’s bound
• Union bound over the binary search queries
7
Cost-Sensitive Rejection
• Cost Matrix• Optimal Classifier
• For �̂�𝑝 𝑦𝑦 = 1 𝑥𝑥 ≥ 𝜏𝜏1, predict 1• For �̂�𝑝 𝑦𝑦 = 2 𝑥𝑥 ≥ 𝜏𝜏2, predict 2• Else REJECT
• Search all pairs 𝜏𝜏1, 𝜏𝜏2 to minimize expected cost
• Pietraszek (2005) provides a fast algorithm based on (a) isotonic regression and (b) computing the slopes on the ROC curve corresponding to 𝜏𝜏1 and 𝜏𝜏2
8
Actions
Probabilities Predict 1 Predict 2 Reject
𝑃𝑃(𝑦𝑦 = 1|𝑥𝑥) 0 𝑐𝑐12 𝑐𝑐1𝑟𝑟𝑃𝑃(𝑦𝑦 = 2|𝑥𝑥) 𝑐𝑐21 0 𝑐𝑐2𝑟𝑟
Support Vector Machines
• Key insight: Maximize the Margin around the Decision Boundary• Three strategies:
• Fit standard SVM, then calibrate or threshold• Fit a double-hinge loss (DHL) SVM that maximizes margin around the rejection
thresholds• Fit two separate functions (classifier and rejection function) that maximize
margins around the rejection thresholds
9
Reminder: Standard SVM• Linear classifier that maximizes the margin
between positive and negative examples
• 𝑦𝑦 ∈ {+1,−1} so 𝑦𝑦𝑖𝑖𝑓𝑓 𝑥𝑥𝑖𝑖 > 0 means 𝑥𝑥𝑖𝑖 is classified correctly
min𝑤𝑤,𝑏𝑏,𝜉𝜉
𝐶𝐶 𝑤𝑤 2 + ∑𝑖𝑖 𝜉𝜉𝑖𝑖 subject to
𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝜉𝜉𝑖𝑖 ≥ 1 ∀𝑖𝑖
• The 𝜉𝜉𝑖𝑖 are “slack variables” the measure how “wrong” we are classifying 𝑥𝑥𝑖𝑖
• 𝐶𝐶 is the regularization parameter10
Double Hinge Loss (Herbei & Wegkamp, 2006; Bartlett & Wegkamp, 2008)
• Assume cost of rejection is 𝑐𝑐• Reject if 𝑓𝑓 𝑥𝑥 < 𝛿𝛿• Loss function ℓ𝑐𝑐 𝑦𝑦𝑓𝑓 𝑥𝑥
• if 𝑦𝑦𝑓𝑓 𝑥𝑥 < −𝛿𝛿 ℓ𝑐𝑐 = 1• if 𝑦𝑦𝑓𝑓 𝑥𝑥 ∈ −𝛿𝛿, +𝛿𝛿 ℓ𝑐𝑐 = 𝑐𝑐• if 𝑦𝑦𝑓𝑓 𝑥𝑥 > +𝛿𝛿 ℓ𝑐𝑐 = 0
• Convex upper bound 𝜙𝜙𝑐𝑐• if 𝑦𝑦𝑓𝑓 𝑥𝑥 < 0 𝜙𝜙𝑐𝑐 = 1 − 𝑎𝑎𝑦𝑦𝑓𝑓(𝑥𝑥)• if 𝑦𝑦𝑓𝑓 𝑥𝑥 ∈ 0,1 𝜙𝜙𝑐𝑐 = 1 − 𝑦𝑦𝑓𝑓(𝑥𝑥)• if 𝑦𝑦𝑓𝑓 𝑥𝑥 > 1 𝜙𝜙𝑐𝑐 = 0
11
−𝛿𝛿 +𝛿𝛿0
0𝑐𝑐
1
𝑦𝑦𝑓𝑓(𝑥𝑥)1
𝜙𝜙𝑐𝑐
ℓ𝑐𝑐
DHL Optimization Problem
• min𝑤𝑤,𝑏𝑏,𝜉𝜉,𝛾𝛾
∑𝑖𝑖 𝜉𝜉𝑖𝑖 + 1−2𝑐𝑐𝑐𝑐
𝛾𝛾𝑖𝑖 subject to
• 𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝜉𝜉𝑖𝑖 ≥ 1• 𝑦𝑦𝑖𝑖 𝑤𝑤⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏 + 𝛾𝛾𝑖𝑖 ≥ 0• 𝜉𝜉𝑖𝑖 ≥ 0; 𝛾𝛾𝑖𝑖 ≥ 0• ∑𝑖𝑖,𝑗𝑗 𝛼𝛼𝑖𝑖𝛼𝛼𝑗𝑗(𝑥𝑥𝑖𝑖 ⋅ 𝑥𝑥𝑗𝑗) ≤ 𝑟𝑟2 (regularization constraint)
• This is a quadratically-constrained quadratic program, so it can be solved, but it is not easy
12
Non-Optimal Case (2)• Defining the rejection function in
terms of ℎ assumes that the probability of error is monotonically related to �̂�𝑝(𝑦𝑦|𝑥𝑥).
• We saw last lecture that this is not necessarily true
• We can try to fix ℎ or we can learn a more complex 𝑟𝑟 function
• Unlikely to be a problem for flexible models, but could be a problem for linear and SVM methods
13
Method 3: Learn (𝑓𝑓, 𝑟𝑟) pair(Cortes, DeSalvo & Mohri, 2016)
• Two-dimensional loss function• If 𝑟𝑟 𝑥𝑥 ≥ 0 and 𝑦𝑦𝑓𝑓 𝑥𝑥 ≥ 0 loss = 0• If 𝑟𝑟 𝑥𝑥 ≥ 0 and 𝑦𝑦𝑓𝑓 𝑥𝑥 < 0 loss = 1• If 𝑟𝑟 𝑥𝑥 < 0 loss = 𝑐𝑐
14
𝑦𝑦𝑓𝑓(𝑥𝑥)
0
0
𝑟𝑟(𝑥𝑥)𝑐𝑐 𝑐𝑐
01
Convex Upper Bound
• 𝐿𝐿𝑀𝑀𝑀𝑀 𝑟𝑟, 𝑓𝑓, 𝑥𝑥,𝑦𝑦 = max 1 + 12𝑟𝑟 𝑥𝑥 − 𝑦𝑦𝑓𝑓 𝑥𝑥 , 𝑐𝑐 1 − 1
1−2𝑐𝑐𝑟𝑟 𝑥𝑥 , 0
15
CHR Optimization Problem
• 𝑓𝑓 𝑥𝑥 = 𝑤𝑤⊤𝑥𝑥 + 𝑏𝑏• 𝑟𝑟 𝑥𝑥 = 𝑢𝑢⊤𝑥𝑥 + 𝑏𝑏𝑏
• min𝑤𝑤,𝑢𝑢,𝜉𝜉
𝜆𝜆2𝑤𝑤 2 + 𝜆𝜆′
2𝑢𝑢 2 + ∑𝑖𝑖 𝜉𝜉𝑖𝑖 subject to
• 𝑐𝑐 1 − 11−2𝑐𝑐
𝑢𝑢⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏′ ≤ 𝜉𝜉𝑖𝑖
• 12𝑢𝑢⊤𝑥𝑥𝑖𝑖 + 𝑏𝑏′ − 𝑦𝑦𝑖𝑖𝑤𝑤⊤𝑥𝑥𝑖𝑖 − 𝑏𝑏 ≤ 1 + 𝜉𝜉𝑖𝑖
• 0 ≤ 𝜉𝜉𝑖𝑖• By minimizing 𝜉𝜉𝑖𝑖 we are minimize the max of these three terms
16
Experimental Tests
17
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
liver bank skin pima australian cod haber
Tota
l Los
s
Data Set
Total Loss (Misclassification + Rejection)
DHL
CHR
Error on Non-Rejected Points
18
-0.1
0
0.1
0.2
0.3
0.4
0.5
liver bank skin pima australian cod haber
Misc
lass
ifica
tion
Loss
Data Set
Non-Rejected Error
DHL
CHR
Note: DHL modified to reject the same number of points as CHR
Reject Option Conclusions
• Basic thresholding is easy and gives PAC guarantees• 2-class thresholding with differential costs is easy• 𝐾𝐾-class thresholding?• Thresholding SVMs is interesting
• Focus the “margin” on the reject boundaries• Learning a 𝑓𝑓, 𝑟𝑟 pair is better than optimizing the double hinge loss
• Open question: How to jointly train DNNs and a rejection function
19
Conformal Prediction (online version)
• Given:• Training data 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛−1 where 𝑧𝑧𝑖𝑖 = 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖• Classifier 𝑓𝑓 trained on the training data• Nonconformity measure 𝐴𝐴𝑛𝑛:𝒵𝒵𝑛𝑛−1 × 𝒵𝒵 ↦ ℝ• Query 𝑥𝑥𝑛𝑛• Accuracy level 𝛿𝛿
• Find:• A set 𝐶𝐶 𝑥𝑥𝑞𝑞 ⊆ 1, … ,𝐾𝐾 such that 𝑦𝑦𝑞𝑞 ∈ 𝐶𝐶 𝑥𝑥𝑞𝑞 with probability 1 − 𝛿𝛿
• Method:• For each 𝑘𝑘, let 𝑧𝑧𝑛𝑛𝑘𝑘 = (𝑥𝑥𝑞𝑞, 𝑘𝑘)
• ∀𝑖𝑖 𝛼𝛼𝑖𝑖𝑘𝑘 ≔ 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖 “how different is 𝑧𝑧𝑖𝑖 from the rest of the 𝑧𝑧 values?• Let 𝑝𝑝𝑘𝑘 = fraction of 𝛼𝛼1𝑘𝑘 , … ,𝛼𝛼𝑛𝑛𝑘𝑘 that are ≥ 𝛼𝛼𝑛𝑛𝑘𝑘
• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝑝𝑝𝑘𝑘 ≥ 𝛿𝛿• Output 𝐶𝐶 𝑥𝑥𝑞𝑞
20
Examples of Nonconformity Measures
• Conditional probability method:• Train a probabilistic classifier 𝑓𝑓 on 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛• Then compute 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖 = −log 𝑓𝑓 𝑧𝑧𝑖𝑖
• Nearest neighbor nonconformity• 𝐴𝐴 𝐵𝐵, 𝑧𝑧 = distance to nearest 𝑧𝑧′∈𝐵𝐵 in same class
distance to nearest 𝑧𝑧′∈𝐵𝐵 in different class
21
Additional Information
• In addition to outputting 𝐶𝐶 𝑥𝑥𝑞𝑞 , we can output• �𝑦𝑦𝑞𝑞 = arg max
𝑘𝑘𝑝𝑝𝑘𝑘 (the best prediction)
• 𝑝𝑝𝑞𝑞 = max𝑘𝑘
𝑝𝑝𝑘𝑘 (the p-value of the best prediction)
• 1 − max𝑘𝑘;𝑘𝑘≠ �𝑦𝑦𝑞𝑞
𝑝𝑝𝑘𝑘 (the “confidence”. We have more confidence if the second-best
p-value is small)
22
Batch (“inductive”) Conformal Prediction
• Divide data into training and calibration• Train 𝑓𝑓 on the training data• Let 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 be the validation data• Let 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛 be the non-conformity scores of the validation data
• 𝛼𝛼𝑖𝑖 ≔ 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖+1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑖𝑖• Given query 𝑥𝑥𝑞𝑞
• For 𝑘𝑘 = 1, … ,𝐾𝐾• Let 𝑧𝑧𝑞𝑞𝑘𝑘 = (𝑥𝑥𝑞𝑞 ,𝑘𝑘)• let 𝛼𝛼𝑞𝑞𝑘𝑘 = 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑞𝑞𝑘𝑘• Let 𝑝𝑝𝑘𝑘 = fraction of 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛,𝛼𝛼𝑞𝑞𝑘𝑘 that are ≥ 𝛼𝛼𝑞𝑞𝑘𝑘
• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝑝𝑝𝑘𝑘 ≥ 𝛿𝛿
• Key difference: 𝑧𝑧𝑞𝑞𝑘𝑘 does not affect the other non-conformity scores
23
Almost Equivalent to Learning a Threshold
• Let 𝜏𝜏 = the 𝛿𝛿 quantile of 𝛼𝛼1, … ,𝛼𝛼𝑛𝑛• Given query 𝑥𝑥𝑞𝑞
• For 𝑘𝑘 = 1, … ,𝐾𝐾• Let 𝑧𝑧𝑞𝑞𝑘𝑘 = (𝑥𝑥𝑞𝑞, 𝑘𝑘)• let 𝛼𝛼𝑞𝑞𝑘𝑘 = 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑛𝑛 , 𝑧𝑧𝑞𝑞𝑘𝑘
• 𝐶𝐶 𝑥𝑥𝑞𝑞 = 𝑘𝑘 𝛼𝛼𝑞𝑞𝑘𝑘 ≥ 𝜏𝜏
• Additional difference: 𝜏𝜏 is computed without considering 𝛼𝛼𝑞𝑞𝑘𝑘
• IF 𝑛𝑛 is large enough, this does not matter
24
Experimental Results
• Non-conformity Measures• Resubstitution:
• Train 𝑓𝑓 on all data• Let �𝑦𝑦𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖• 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁 , 𝑥𝑥𝑖𝑖 ,𝑘𝑘 = 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑘𝑘
• Leave One Out:• Train 𝑓𝑓 on 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁• Let �𝑦𝑦𝑖𝑖 = 𝑓𝑓 𝑥𝑥𝑖𝑖• 𝐴𝐴 𝑧𝑧1, … , 𝑧𝑧𝑖𝑖−1, 𝑧𝑧𝑖𝑖 , … , 𝑧𝑧𝑁𝑁 , 𝑥𝑥𝑖𝑖 ,𝑘𝑘 = 𝐼𝐼 �𝑦𝑦𝑖𝑖 = 𝑘𝑘
25
Satellite
26
Resubstitution
Leave one out
Error:𝑦𝑦𝑖𝑖 ∉ 𝐶𝐶 𝑥𝑥𝑖𝑖
Shuttle
27
Resubstitution
Leave one out
Segmentation
28
Resubstitution
Leave one out
Pendigits + Random Forest(Dietterich, unpublished)
• Train a random forest on half of UCI Training Set• Use the predicted class probability 𝑃𝑃 𝑦𝑦 = 𝑘𝑘 𝑥𝑥 as the
(non)conformity score• Compute 𝜏𝜏 values using other half of Training Set• Compute 𝐶𝐶 on the Test Set
29
Cumulative Distribution Function for Class “9”
30
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
ecdf(p.val[, 10])
x
Fn(x
)
Empirical CDF
𝐶𝐶(𝑥𝑥, "9")
Pendigits Results
• All 𝜏𝜏 values were 0 (for 𝜖𝜖 = 0.001)• Probability 𝑦𝑦 ∈ Γ 𝑥𝑥 = 0.9997• Abstention rate = 0.72• Sizes of prediction sets Γ:
31
0.00
0.28
0.18
0.120.11 0.11
0.070.06
0.040.02
0.000
0.05
0.1
0.15
0.2
0.25
0.3
0 1 2 3 4 5 6 7 8 9 10
Frac
tion
of T
est P
oint
sSize of the Confidence Set
Simple Thresholding of max𝑘𝑘
�̂�𝑝 𝑦𝑦 = 𝑘𝑘 𝑥𝑥
32
0
20
40
60
80
100
120
140
160
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Num
ber o
f Val
idat
ion
Erro
rs
Threshold on p.max
Zoomed In: 𝜏𝜏 = 0.87 for 𝛿𝛿 = 0.05
33
0
1
2
3
4
5
6
7
8
0.7 0.75 0.8 0.85 0.9 0.95 1
Num
ber o
f Val
idat
ion
Erro
rs
Threshold on p.max
Test Set Results
• Probability of correct classification: 0.9987• Rejection rate: 33.4%
• [Conformal prediction was 72%]
34
Another Use Case: Lexicon Reduction
• US Postal Service Address Reading Task• (Madhvanath, Kleinberg, Govindaraju, 1997)
• Two classifiers• Method 1: Fast but not always accurate• Method 2: Slower but more accurate
• Can only afford to run on 1/3 of envelopes• Faster if it can be focused on a subset of the classes
• Apply conformal prediction using Method 1• Eliminate as many classes as possible• Apply Method 2 if 𝐶𝐶 𝑥𝑥 > 1
35
Summary
• Lecture 1: Calibration• Lecture 2: Rejection
• Method 1: Threshold 𝑓𝑓 with single or multiple thresholds• Multiple thresholds requires a change in the SVM methodology
• Method 2: Learn a separate rejection function and threshold it• Method 3: Conformal: Use thresholding to construct a confidence set
• Reject if 𝐶𝐶 𝑥𝑥𝑞𝑞 ≠ 1• Can perform “lexicon reduction”
• In my experience, Conformal Prediction is not good for Rejection, but more experiments are needed
36
Next Lecture
• All of these methods assume a closed world• What happens when queries may belong to “alien” classes not
observed during training?• Papers:
• Bendale, A., & Boult, T. (2016). Towards Open Set Deep Networks. In CVPR 2016 (pp. 1563–1572). http://doi.org/10.1109/CVPR.2016.173
• Liu, S., Garrepalli, R., Dietterich, T. G., Fern, A., & Hendrycks, D. (2018). Open Category Detection with PAC Guarantees. Proceedings of the 35th International Conference on Machine Learning, PMLR, 80, 3169–3178. http://proceedings.mlr.press/v80/liu18e.html
37
Citations• Bartlett, P., Wegkamp, M. (2008). Classification with a reject option using a hinge loss. JMLR, 2008.• Chow (1970). On optimum recognition error and reject trade-off. IEEE Transactions on Computing.• Cortes, C., DeSalvo, G., & Mohri, M. (2016). Learning with rejection. Lecture Notes in Artificial Intelligence,
9925 LNAI, 67–82. http://doi.org/10.1007/978-3-319-46379-7_5• Geifman, Y., El Yaniv, R. (2017) Selective Classification for Deep Neural Networks. NIPS 2017. arXiv:
1705.08500• Herbei, R., Wegkamp, M. (2005). Classification with reject option. Canadian Journal of Statistics. • Madhvanath, S., Kleinberg, E., Govindaraju, V. (1997). Empirical Design of a Multi-Classifier
Thresholding/Control Strategy for Recognition of Handwritten Street Names. International Journal of Pattern Recognition and Artificial Intelligence, 11(6):933-946. https://doi.org/10.1142/S0218001497000421
• Papadopoulos, H. (2008). Inductive Conformal Prediction: Theory and Application to Neural Networks. Book chapter. https://www.researchgate.net/publication/221787122_Inductive_Conformal_Prediction_Theory_and_Application_to_Neural_Networks
• Pietraszek, T. (2005). Optimizing abstaining classifiers using ROC analysis. In ICML, 2005• Shafer, G., & Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research, 9,
371–421. Retrieved from http://arxiv.org/abs/0706.3188
38