+ All Categories
Home > Engineering > SIGKDD 2016 - SHUAI LI

SIGKDD 2016 - SHUAI LI

Date post: 22-Jan-2018
Category:
Upload: shuai-li
View: 104 times
Download: 1 times
Share this document with a friend
1
1. Solving a classification problem first may be wasteful 2. Need to address class distribution drift in test sets Quantification Performance Measures 1. Capture quantification goals directly, OR 2. Balance quantification and classification goals (hybrid) 3. Challenging to optimize on voluminous, streaming data 1. Receive a data point 2. Fix dual variables, take SGD step to update model 3. Fix model, take SGD steps to update dual variables 4. Updates extremely cheap: closed form for dual variables Goal: Estimate the relative prevalence of classes of interest in large unlabeled populations in online, streaming settings Applications of Quantification Sentiment Analysis KatyCipriano The best part of the meal is the dessert which they dont make themselves just sayin . @ bouzagloabc 2 hours ago Tweet JuliaChild Loved the food worth the 45 minute wait! Can’t wait for my Sunday brunch at ABC . @ bouzagloabc 1 hours ago Tweet GordonRamsay It was RAAAAW. @ bouzagloabc 3 days ago Tweet PaulaDeen @ GordonRamsay Samy the owner threw me out just for pointing that out! Disastrous service 2 days ago Tweet Several applications directly require estimates of class ratios a.k.a. Counting, Class probability re-estimation, Class prior estimation Epidemiology Challenges Online Optimization Methods for the Quantification Problem Purushottam Kar¹, Shuai Li², Harikrishna Narasimhan³, Sanjay Chawla⁴, Fabrizio Sebastiani⁴ ¹IIT Kanpur, India, ²University of Insubria, Italy, ³Harvard University, USA, ⁴QCRI-HBKU, Qatar Full Paper: http://tinyurl.com/quantonline 22 nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining Quantification Performance Measures Quantification Performance Measure Hybrid Performance Measure Nested Concave Measures Pseudo-concave Measures NegKLD† QMeasure‡ BAKLD‡ CQReward‡ BKReward‡ Nested Concave Measures Normalized Square Score† 1. Dual computation of nested functions difficult, costly updates 2. Solution: apply duality to nested functions in nested manner! Key Idea 1. Use the level set function as a proxy objective function 2. Exploit the fact that the level set functions are concave Key Idea Fenchel Duality Level Set Structure Level sets are convex Fenchel “dual” Dual variables Any ccv function Linear in TPR and TNR for fixed values of dual variables! NEMSIS (streaming) Pseudo Concave Measures CAN (non-streaming) Guarantee for NEMSIS, SCAN 1. Execute E and M steps approximately in “streaming epochs” 2. E epochs use streaming data to estimate 3. M epochs execute NEMSIS on streaming data - optimize proxy 4. Epochs made progressively longer: more accurate E,M steps SCAN (streaming) Find new level Optimize proxy Progress in proxy provably linked to progress in perf. Level function ccv cvx ccv E M E M E M E M Guarantee for CAN Experimental Results ccv: concave cvx: convex Superior accuracies and training times across quant and hybrid measures as well as datasets NS: dual updates made using actual TPR/TNR values not surrogates KDD08 PPI Covertype KDD08 Adult Cod-RNA Covertype Adult Attractive trade-off b/w quant/class performance using BAKLD perf. Robustness to drift in class proportions (smaller is better in PosKLD) Theoretical Guarantees Classification accuracy: 50% But … #False pos. = #False neg. Perfect quantification (Perfect classification impossible) Balanced Accuracy (BA) Observation: All quantification measures naturally nested concave or pseudo concave – exploit to optimize scalably? Psephology Cause-specific Mortality analysis Transfer Learning
Transcript
Page 1: SIGKDD 2016 - SHUAI LI

1. Solving a classification problem first may be wasteful

2. Need to address class distribution drift in test sets

Quantification Performance Measures1. Capture quantification goals directly, OR2. Balance quantification and classification goals (hybrid)3. Challenging to optimize on voluminous, streaming data

1. Receive a data point 2. Fix dual variables, take SGD step to update model3. Fix model, take SGD steps to update dual variables4. Updates extremely cheap: closed form for dual variables

Goal: Estimate the relative prevalence of classes of interestin large unlabeled populations in online, streaming settings

Applications of Quantification

Sentiment Analysis

KatyCipriano

The best part of the meal isthe dessert which they dontmake themselves – justsayin. @bouzagloabc

2 hours ago

Tweet

JuliaChild

Loved the food – worth the45 minute wait! Can’t waitfor my Sunday brunch atABC. @bouzagloabc

1 hours ago

Tweet

GordonRamsay

It was RAAAAW. @bouzagloabc

3 days ago

Tweet

PaulaDeen

@GordonRamsay Samy theowner threw me out just forpointing that out! Disastrousservice

2 days ago

Tweet

Several applications directly require estimates of class ratiosa.k.a. Counting, Class probability re-estimation, Class prior estimation

Epidemiology

Challenges

Online Optimization Methods for the Quantification ProblemPurushottam Kar¹, Shuai Li², Harikrishna Narasimhan³, Sanjay Chawla⁴, Fabrizio Sebastiani⁴

¹IIT Kanpur, India, ²University of Insubria, Italy, ³Harvard University, USA, ⁴QCRI-HBKU, Qatar

Full Paper: http://tinyurl.com/quantonline 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Quantification Performance Measures

† ‡Quantification Performance Measure Hybrid Performance Measure

Nested Concave Measures Pseudo-concave Measures

NegKLD†

QMeasure‡

BAKLD‡

CQReward‡

BKReward‡ Ne

ste

d C

on

cave

Me

asu

res

Normalized Square Score† 1. Dual computation of nested functions difficult, costly updates2. Solution: apply duality to nested functions in nested manner!

Key Idea

1. Use the level set function as a proxy objective function2. Exploit the fact that the level set functions are concave

Key Idea

Fenchel Duality

Level Set StructureLevel sets are convex

Fenchel “dual”Dual variablesAny ccv function

Linear in TPR and TNR for fixed values of dual variables!

NEMSIS (streaming)

Pse

ud

o C

on

cave

Me

asu

res

CAN (non-streaming)

Guarantee forNEMSIS, SCAN

1. Execute E and M steps approximately in “streaming epochs”2. E epochs use streaming data to estimate 3. M epochs execute NEMSIS on streaming data - optimize proxy4. Epochs made progressively longer: more accurate E,M steps

SCAN (streaming)

Find new level Optimize proxy

Progress in proxyprovably linked toprogress in perf.

Level function

ccv

cvx

ccv

E

M

E M E M E M …

Guarantee for CAN

Experimental Results

ccv: concave cvx: convex

Superior accuraciesand training timesacross quant andhybrid measures aswell as datasets

NS: dual updates made using actual TPR/TNR values not surrogates

KDD08

PPI

CovertypeKDD08

AdultCod-RNA

Covertype Adult

Attractive trade-off b/wquant/class performanceusing BAKLD perf.

Robustness to drift inclass proportions (smalleris better in PosKLD)

Theoretical Guarantees

Classification accuracy: 50% But … #False pos. = #False neg.⇒ Perfect quantification (Perfect classification impossible)

Balanced Accuracy (BA)

Observation: All quantification measures naturally nestedconcave or pseudo concave – exploit to optimize scalably?

Psephology

Cause-specific Mortality analysis

Transfer Learning

Recommended