Foundations For Learning in the Age of Big Datamunoz/schedule/2014/nina-nyu.pdf“Disagreement based...

Foundations For Learning in the Age of Big Data

Maria-Florina Balcan

Modern Machine Learning

New applications Explosion of data

http://images.google.com/imgres?imgurl=http://www.dynamicdrive.com/dynamicindex4/lightbox2/flower.jpg&imgrefurl=http://www.dynamicdrive.com/dynamicindex4/lightbox2/index.htm&usg=__6G24ltZqq0UGAerNdiENN4jOqqQ=&h=509&w=500&sz=46&hl=en&start=4&sig2=P_pNfXebhTzd8n0DrpgNhw&zoom=1&itbs=1&tbnid=--IQZoA0t4jdOM:&tbnh=131&tbnw=129&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF

http://images.google.com/imgres?imgurl=http://images.nymag.com/news/features/newface080811_lede_560.jpg&imgrefurl=http://nymag.com/news/features/48948/&usg=__cimnczuBny94AVQhDb04KyNUIm4=&h=482&w=560&sz=50&hl=en&start=11&sig2=4wGA_vH60nBqlHqvekl3lA&zoom=1&itbs=1&tbnid=gg1XN95D7aTn3M:&tbnh=114&tbnw=133&prev=/images?q=face&hl=en&gbv=2&tbs=isch:1&ei=dMtyTbuKL8OqlAecwcFN

http://images.google.com/imgres?imgurl=http://www.teachenglishinasia.net/files/u2/purple_lotus_flower.jpg&imgrefurl=http://www.teachenglishinasia.net/asiablog/asian-water-lilies-and-lotus-flowers&usg=__jQMsElfDOQGWm-hebjVtJDqL-40=&h=335&w=500&sz=28&hl=en&start=3&sig2=hmb0FR5kLCXdU2BdwjCNcA&zoom=1&itbs=1&tbnid=r_JpUcIxETAUoM:&tbnh=87&tbnw=130&prev=/images?q=flower&hl=en&gbv=2&tbs=isch:1&ei=Is1yTcKaOcGAlAe0wqGmAQ

http://images.google.com/imgres?imgurl=http://www.classicsavers.com/casablanca.jpg&imgrefurl=http://www.classicsavers.com/Casablanca.html&h=600&w=800&sz=72&tbnid=wSXOd5UUibIJ:&tbnh=106&tbnw=141&start=3&prev=/images?q=casablanca&hl=en&lr=&ie=UTF-8

http://images.google.com/imgres?imgurl=http://www.nrl.navy.mil/content_images/clem.jpg&imgrefurl=http://www.nrl.navy.mil/media/popular-images/&usg=__Cj15wZbDIqpRCeVmO7SsP5XIOzM=&h=1213&w=1720&sz=1347&hl=en&start=3&sig2=TRnaOP-O9wUtTBwS0_vDZA&zoom=1&itbs=1&tbnid=EprKgSAflWdsYM:&tbnh=106&tbnw=150&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF


Modern applications: massive amounts of raw data.

Only a tiny fraction can be annotated by human experts.

Billions of webpages Images Protein sequences

Modern ML: New Learning Approaches


http://images.google.com/imgres?imgurl=http://www.soccerballworld.com/images/Ball-UEFA-FINALE.jpg&imgrefurl=http://www.soccerballworld.com/UEFA-Champions-Finale.htm&usg=__IkP2FDq2F5yEyjqW0wDIthrifzc=&h=300&w=300&sz=15&hl=en&start=5&sig2=Sw_H6_M6TzTbVg5PqjP6lw&zoom=1&itbs=1&tbnid=E1Xl3YAzGDlvgM:&tbnh=116&tbnw=116&prev=/images?q=ball&hl=en&gbv=2&tbs=isch:1&ei=-8xyTbz0OISclge849RL

http://images.google.com/imgres?imgurl=http://ceoworld.biz/ceo/wp-content/uploads/2009/07/microsoft_logo.jpg&imgrefurl=http://ceoworld.biz/ceo/2009/07/13/microsoft-to-sell-razorfish-wpp-omnicom-publicis-groupe-interpublic-group-and-dentsu-possible-bidder&usg=__D-yYVRjhhUHbi2sYkh2hyP_5EK4=&h=299&w=300&sz=5&hl=en&start=2&sig2=4orOiNNbbV2i6Ii7SFMdmg&zoom=1&itbs=1&tbnid=Sk8jPaSR61onuM:&tbnh=116&tbnw=116&prev=/images?q=microsoft&hl=en&gbv=2&tbs=isch:1&ei=1MtyTbH9IYOglAeJrq2bAQ

http://images.google.com/imgres?imgurl=http://www.treehugger.com/ec-rnd-005.jpg&imgrefurl=http://www.treehugger.com/files/2008/07/17-electric-cars-overview-2005-to-2008.php&usg=__4G2QjNYXIHEYVXE6DoqIE5zzzrY=&h=322&w=468&sz=40&hl=en&start=9&sig2=P7jmw1vmKb715x8CELHezQ&zoom=1&itbs=1&tbnid=iGci7mkzT2KN2M:&tbnh=88&tbnw=128&prev=/images?q=car&hl=en&gbv=2&tbs=isch:1&ei=qMtyTcilGIOClAey9IGqCw

http://images.google.com/imgres?imgurl=http://images.nymag.com/news/features/newface080811_lede_560.jpg&imgrefurl=http://nymag.com/news/features/48948/&usg=__cimnczuBny94AVQhDb04KyNUIm4=&h=482&w=560&sz=50&hl=en&start=11&sig2=4wGA_vH60nBqlHqvekl3lA&zoom=1&itbs=1&tbnid=gg1XN95D7aTn3M:&tbnh=114&tbnw=133&prev=/images?q=face&hl=en&gbv=2&tbs=isch:1&ei=dMtyTbuKL8OqlAecwcFN



http://images.google.com/imgres?imgurl=http://www.flash-slideshow-maker.com/images/help_clip_image020.jpg&imgrefurl=http://www.flash-slideshow-maker.com/flash-images/&usg=__jMIGXrfx-_pXlchniUcjT2UYT9k=&h=422&w=554&sz=35&hl=en&start=1&sig2=YCfTn6kxhcufspbxJdXhrQ&zoom=1&itbs=1&tbnid=_ruOuQ2EjL3wOM:&tbnh=101&tbnw=133&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF



http://images.google.com/imgres?imgurl=http://hirise.lpl.arizona.edu/HiBlog/wp-content/uploads/2008/03/mex_phobos.jpg&imgrefurl=http://hirise.lpl.arizona.edu/HiBlog/category/hirise/news-events/special-images/&usg=__DLojkIcAQfQzGQ4oBrU4aT3OcHI=&h=400&w=400&sz=29&hl=en&start=15&sig2=cqA6Gj5KQHXSZIuOwlp0iQ&zoom=1&itbs=1&tbnid=QIVdyHeoWfEyFM:&tbnh=124&tbnw=124&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF

http://images.google.com/imgres?imgurl=http://www.mabima.net/company/images/Tucker_House_2003_640.JPG&imgrefurl=http://www.mabima.net/company/&usg=__yK2ELUzEORpde1XjSFJTrRDtFxc=&h=426&w=640&sz=69&hl=en&start=4&sig2=Q0e47SnpfYjO5jf5DQQk-Q&zoom=1&itbs=1&tbnid=GD54wp6WHlrhMM:&tbnh=91&tbnw=137&prev=/images?q=house&hl=en&gbv=2&tbs=isch:1&ei=KMtyTZnAE8aqlAegju1U

http://images.google.com/imgres?imgurl=http://www.popsci.com/files/imagecache/article_image_large/articles/ie2.jpg&imgrefurl=http://www.popsci.com/gadgets/article/2010-08/microsoft-fight-between-revenue-and-privacy-money-1-privacy-zero&usg=__wfaTvyWsRtJGRnAP5yJyb-kp3u0=&h=333&w=500&sz=22&hl=en&start=1&sig2=PTP24GqOtv3iVaxKrsl8dg&zoom=1&itbs=1&tbnid=_nE8aJVhvxfpbM:&tbnh=87&tbnw=130&prev=/images?q=internet&hl=en&gbv=2&tbs=isch:1&ei=FsxyTaDEPMWAlAePlcFS

Modern applications: massive amounts of raw data.


Expert

• Semi-supervised Learning, (Inter)active Learning.

Techniques that best utilize data, minimizing need for

expert/human intervention.

Modern applications: massive amounts of data

distributed across multiple locations.


http://cdn.wikimg.net/strategywiki/images/f/fa/Globe.svg


• scientific data

Key new resource communication.

• video data

E.g.,

Modern applications: massive amounts of data

distributed across multiple locations.

• Interactive Learning

• Noise tolerant poly time active learning algos.

• Distributed Learning

• Learning with richer interaction.

Outline of the talk

• Model communication as key resource.

• Communication efficient algos.

• Implications to passive learning.

Supervised Learning • E.g., which emails are spam and which are important.

Not spam spam

• E.g., classify objects as chairs vs non chairs.

Not chair chair

Labeled Examples

Learning Algorithm

Expert / Oracle

Data Source

c* : X ! {0,1}

h : X ! {0,1}

(x1,c*(x1)),…, (xm,c*(xm))

• Algo sees (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D

Distribution D on X

Statistical / PAC learning model

+

+

- - + + - -

-

• Does optimization over S, finds hypothesis h 2 C.

• Goal: h has small error, err(h)=Prx 2 D(h(x) c*(x))

• c* in C, realizable case; else agnostic

12

Two Main Aspects in Classic Machine Learning

Algorithm Design. How to optimize?

Automatically generate rules that do well on observed data.

Generalization Guarantees, Sample Complexity

Confidence for rule effectiveness on future data.

E.g., Boosting, SVM, etc.

O1

ϵVCdim C log

1

ϵ+ log

1

δ

Interactive Machine Learning

• Active Learning

• Learning with more general queries; connections

Active Learning

face

O

O

O

Expert Labeler

raw data

Classifier

not face


http://images.google.com/imgres?imgurl=http://www.socksoff.co.uk/00001/page05/Lone_Tree_1600.jpg&imgrefurl=http://www.hongkiat.com/blog/100-absolutely-beautiful-nature-wallpapers-for-your-desktop/&usg=__dsyO8OzB7rmE16btYKAa2WrNJtQ=&h=1200&w=1600&sz=693&hl=en&start=9&sig2=N22ozJ4mtT9DuMR9P-wNOg&zoom=1&itbs=1&tbnid=UU66VMJn9kPePM:&tbnh=113&tbnw=150&prev=/images?q=tree&hl=en&gbv=2&tbs=isch:1&ei=YsxyTearFsSclge1p_hf

http://images.google.com/imgres?imgurl=http://www.soccerballworld.com/images/Ball-UEFA-FINALE.jpg&imgrefurl=http://www.soccerballworld.com/UEFA-Champions-Finale.htm&usg=__IkP2FDq2F5yEyjqW0wDIthrifzc=&h=300&w=300&sz=15&hl=en&start=5&sig2=Sw_H6_M6TzTbVg5PqjP6lw&zoom=1&itbs=1&tbnid=E1Xl3YAzGDlvgM:&tbnh=116&tbnw=116&prev=/images?q=ball&hl=en&gbv=2&tbs=isch:1&ei=-8xyTbz0OISclge849RL

http://images.google.com/imgres?imgurl=http://www.popsci.com/files/imagecache/article_image_large/articles/ie2.jpg&imgrefurl=http://www.popsci.com/gadgets/article/2010-08/microsoft-fight-between-revenue-and-privacy-money-1-privacy-zero&usg=__wfaTvyWsRtJGRnAP5yJyb-kp3u0=&h=333&w=500&sz=22&hl=en&start=1&sig2=PTP24GqOtv3iVaxKrsl8dg&zoom=1&itbs=1&tbnid=_nE8aJVhvxfpbM:&tbnh=87&tbnw=130&prev=/images?q=internet&hl=en&gbv=2&tbs=isch:1&ei=FsxyTaDEPMWAlAePlcFS


http://images.google.com/imgres?imgurl=http://www.flash-slideshow-maker.com/images/help_clip_image020.jpg&imgrefurl=http://www.flash-slideshow-maker.com/flash-images/&usg=__jMIGXrfx-_pXlchniUcjT2UYT9k=&h=422&w=554&sz=35&hl=en&start=1&sig2=YCfTn6kxhcufspbxJdXhrQ&zoom=1&itbs=1&tbnid=_ruOuQ2EjL3wOM:&tbnh=101&tbnw=133&prev=/images?q=images&hl=en&sa=X&gbv=2&tbas=0&tbs=isch:1&ei=n8hyTej5N8X_lgffrdVF

http://images.google.com/imgres?imgurl=http://www.wallpaperbase.com/wallpapers/animals/fish/fish_4.jpg&imgrefurl=http://www.wallpaperbase.com/animals-fish.shtml&usg=__2GMQzBfy-zcwD5e3ceBpUaGJGsE=&h=768&w=1024&sz=115&hl=en&start=6&sig2=86Ql8saH12Fk6g3FoyAagw&zoom=1&itbs=1&tbnid=vPHzH3MqjP6xOM:&tbnh=113&tbnw=150&prev=/images?q=fish&hl=en&gbv=2&tbs=isch:1&ei=TstyTb3AB8Oclget89ChAw






Active Learning in Practice

• Text classification: active SVM (Tong & Koller, ICML2000).

• e.g., request label of the example closest to current separator.

• Video Segmentation (Fathi-Balcan-Ren-Regh, BMVC 11).

w

+ -

Exponential improvement.

• Sample with 1/ unlabeled examples; do binary search. -

Active: only O(log 1/) labels.

Passive supervised: (1/) labels to find an -accurate threshold.

+ -

Active Algorithm

• Canonical theoretical example [CAL92, Dasgupta04]

Provable Guarantees, Active Learning

Disagreement Based Active Learning

“Disagreement based ” algos: query points from current region of disagreement, throw out hypotheses when statistically confident they are suboptimal.

First analyzed in [Balcan, Beygelzimer, Langford’06] for A2 algo.

Lots of subsequent work: [Hanneke07, DasguptaHsuMontleoni’07, Wang’09 , Fridman’09, Koltchinskii10, BHW’08, BeygelzimerHsuLangfordZhang’10, Hsu’10, Ailon’12, …]

Generic (any class), adversarial label noise.

Suboptimal in label complex & computationally prohibitive.

Poly Time, Noise Tolerant, Label Optimal AL Algos.

Margin Based Active Learning

• Realizable: exponential improvement, only O(d log 1/) labels to find w error when D logconcave.

• Agnostic & malicious noise: poly-time AL algo outputs w with err(w) =O(´) , ´ =err( best lin. sep).

• First poly time AL algo in noisy scenarios!

• Improves on noise tolerance of previous best passive [KKMS’05], [KLS’09] algos too!

• First for malicious noise [Val85] (features corrupted too).

• Resolves an open question on sample complex. of ERM.

[Balcan-Long COLT’13] [Awasthi-Balcan-Long STOC’14]

Learning linear separators, when D logconcave in Rd.

+

+

- - + + - -

-

Margin Based Active-Learning, Realizable Case

Draw m1 unlabeled examples, label them, add them to W(1). iterate k = 2, …, s

• find a hypothesis wk-1 consistent with W(k-1).

• W(k)=W(k-1).

• sample mk unlabeled samples x

satisfying |wk-1 ¢ x| · k-1

• label them and add them to W(k).

w1

1

w2

2

w3

Theorem If then err(ws)· D log-concave in Rd.

after

Active learning Passive learning

rounds using

label requests label requests

unlabeled examples

labels per round.


Log-concave distributions: log of density fnc concave

• wide class: uniform distr. over any convex set, Gaussian, Logistic, etc

• major role in sampling & optimization [LV’07, KKMS’05,KLT’09]

Linear Separators, Log-Concave Distributions

u

v

(u,v) Fact 1

Proof idea:

• project the region of disagreement in the space given by u and v

• use properties of log-concave distributions in 2 dimensions.

Fact 2

v

Linear Separators, Log-Concave Distributions

If and Fact 3 v u

v


Induction: all w consistent with W(k) have error · 1/2k; so, wk has error · 1/2k.

Proof Idea

wk-1

w

k-1

w*

For · 1/2k+1

iterate k=2, … ,s

• find a hypothesis wk-1 consistent with W(k-1).

• W(k)=W(k-1).


satisfying |wk-1 ¢ x| · k-1

• label them and add them to W(k).

Proof Idea

Under logconcave distr. for

· 1/2k+1

wk-1

w

k-1

w*

Proof Idea

Enough to ensure

Can do with only

· 1/2k+1

labels.

wk-1

w

k-1

w*

Under logconcave distr. for

Margin Based Analysis

D log-concave in Rd only O(d log 1/) labels to find w, err(w) · ².

Theorem: (Passive, Realizable)

Theorem: (Active, Realizable)

Any w consistent with

labeled examples satisfies err(w) · ², with prob. 1-±.

[Balcan-Long, COLT13]

• First tight bound for poly-time PAC algos for an infinite class of fns under a general class of distributions.

• Solves open question for the uniform distr. [Long’95,’03], [Bshouty’09]

[Ehrenfeucht et al., 1989; Blumer et al., 1989]

Also leads to optimal bound for ERM passive learning

Margin Based Active-Learning, Agnostic Case

Draw m1 unlabeled examples, label them, add them to W.

iterate k=2, …, s

• find wk-1 in B(wk-1, rk-1) of small

¿k-1 hinge loss wrt W.

• Clear working set.


satisfying |wk-1 ¢ x| · k-1 ;

• label them and add them to W.

end iterate

Localization in concept space.

Margin Based Active-Learning, Agnostic Case

Draw m1 unlabeled examples, label them, add them to W.

Localization in instance space.

iterate k=2, …, s

• find wk-1 in B(wk-1, rk-1) of small

¿k-1 hinge loss wrt W.

• Clear working set.


satisfying |wk-1 ¢ x| · k-1 ;

• label them and add them to W.

end iterate

Analysis: the Agnostic Case

Theorem

If , ,

Key ideas:

• As before need

• For w in B(wk-1, rk-1) we have

• sufficient to set

• Careful variance analysis leads

Infl. Noisy points

Hinge loss over clean examples

D log-concave in Rd.

err(ws)· . , ,

Analysis: Malicious Noise

Theorem

If , ,

Key ideas:

• As before need

D log-concave in Rd.

err(ws)· . , ,

• Soft localized outlier removal and careful variance analysis.

The adversary can corrupt both the label and the feature part.

Improves over Passive Learning too!

Passive Learning

Prior Work

Our Work

Malicious

Agnostic

Active Learning [agnostic/malicious]

[KLS’09]

NA

[KLS’09]

Improves over Passive Learning too!

Passive Learning

Prior Work

Our Work

Malicious

Agnostic

Active Learning [agnostic/malicious]

Info theoretic optimal

[KKMS’05]

[KLS’09]

[KKMS’05]

NA Info theoretic optimal

Slightly better results for the uniform distribution case.

Useful for active and passive learning!

Localization both algorithmic and analysis tool!

Important direction: richer interactions with the expert.

Expert

Fewer queries

Natural interaction

Better Accuracy

43

raw data

Expert Labeler

Classifier

New Types of Interaction [Balcan-Hanneke COLT’12]

Class Conditional Query

Mistake Query raw data

Expert Labeler

) Classifier

dog cat penguin

wolf


http://images.google.com/imgres?imgurl=http://static.howstuffworks.com/gif/house-selling-1.jpg&imgrefurl=http://home.howstuffworks.com/real-estate/house-selling.htm&usg=__qnE74MTnGIEKpohCSP6iTM9E8EA=&h=327&w=400&sz=24&hl=en&start=3&sig2=G6Bq69GdXS9x6hMpDk3HfA&zoom=1&itbs=1&tbnid=3fiW8K30_ygY1M:&tbnh=101&tbnw=124&prev=/images?q=house&hl=en&gbv=2&tbs=isch:1&ei=RF9sTcWkOsL88AbtremUBQ

http://images.google.com/imgres?imgurl=http://beautiful-island.50webs.com/beautiful-island/sunny-beach-palm.jpg&imgrefurl=http://beautiful-island.50webs.com/&usg=__C0upeEPAbXCnDz2nt6iLsdArUW4=&h=467&w=600&sz=31&hl=en&start=2&sig2=IuliF_QG6NjIn68oHyMX-g&zoom=1&itbs=1&tbnid=hLcdvK7I-FvOmM:&tbnh=105&tbnw=135&prev=/images?q=beach&hl=en&gbv=2&tbs=isch:1&ei=rGltTe-0KMH88AbLg6GODQ

http://images.google.com/imgres?imgurl=http://www.harlemfur.com/images/Dog_Olive.jpg&imgrefurl=http://www.harlemfur.com/dogs/&usg=__JozjGZc1FsasEHphia0uB0XqbX8=&h=530&w=600&sz=137&hl=en&start=2&sig2=zaKS235mWgElfTu3cHnq0w&zoom=1&itbs=1&tbnid=ENH8mUtG9B1SEM:&tbnh=119&tbnw=135&prev=/images?q=dog&hl=en&gbv=2&tbs=isch:1&ei=wmltTbvHNcP78Ab4rqyODQ

http://images.google.com/imgres?imgurl=http://www.imageof.net/bulkupload/wallpapers1/Animals/Grey_Wolf.jpg&imgrefurl=http://caligagan.com/images/index.php?q=grey+wolf+life+cycle&usg=__uURVHNJhQN9ahCWQPUpcKPJNGqk=&h=1024&w=1280&sz=378&hl=en&start=6&zoom=1&itbs=1&tbnid=tVdevTk5HwbM1M:&tbnh=120&tbnw=150&prev=/images?q=wolf&hl=en&gbv=2&tbs=isch:1&ei=DIB2Tc-2BNG90QHQ66W1Bw

http://images.google.com/imgres?imgurl=http://www.catfacts.org/cat-facts.jpg&imgrefurl=http://www.catfacts.org/&usg=__29ES21S16jR7j4jQvqkSF3S1w3Q=&h=320&w=450&sz=24&hl=en&start=3&zoom=1&itbs=1&tbnid=fXHhKgslCgucdM:&tbnh=90&tbnw=127&prev=/images?q=cat&hl=en&gbv=2&tbs=isch:1&ei=sYF2TZv7J42E0QHfvomzBw

http://images.google.com/imgres?imgurl=http://www.imageof.net/bulkupload/wallpapers1/Animals/Grey_Wolf.jpg&imgrefurl=http://caligagan.com/images/index.php?q=grey+wolf+life+cycle&usg=__uURVHNJhQN9ahCWQPUpcKPJNGqk=&h=1024&w=1280&sz=378&hl=en&start=6&zoom=1&itbs=1&tbnid=tVdevTk5HwbM1M:&tbnh=120&tbnw=150&prev=/images?q=wolf&hl=en&gbv=2&tbs=isch:1&ei=DIB2Tc-2BNG90QHQ66W1Bw

44

Class Conditional & Mistake Queries

• Used in practice, e.g. Faces in IPhoto.

• Lack of theoretical understanding.

• Realizable (Folklore): much fewer queries than label requests.

Balcan-Hanneke, COLT’12

Tight bounds on the number of CCQs to learn in the presence of noise (agnostic and bounded noise)


http://images.google.com/imgres?imgurl=http://static.howstuffworks.com/gif/house-selling-1.jpg&imgrefurl=http://home.howstuffworks.com/real-estate/house-selling.htm&usg=__qnE74MTnGIEKpohCSP6iTM9E8EA=&h=327&w=400&sz=24&hl=en&start=3&sig2=G6Bq69GdXS9x6hMpDk3HfA&zoom=1&itbs=1&tbnid=3fiW8K30_ygY1M:&tbnh=101&tbnw=124&prev=/images?q=house&hl=en&gbv=2&tbs=isch:1&ei=RF9sTcWkOsL88AbtremUBQ

http://images.google.com/imgres?imgurl=http://beautiful-island.50webs.com/beautiful-island/sunny-beach-palm.jpg&imgrefurl=http://beautiful-island.50webs.com/&usg=__C0upeEPAbXCnDz2nt6iLsdArUW4=&h=467&w=600&sz=31&hl=en&start=2&sig2=IuliF_QG6NjIn68oHyMX-g&zoom=1&itbs=1&tbnid=hLcdvK7I-FvOmM:&tbnh=105&tbnw=135&prev=/images?q=beach&hl=en&gbv=2&tbs=isch:1&ei=rGltTe-0KMH88AbLg6GODQ

Important direction: richer interactions with the expert.

Expert

Fewer queries

Natural interaction

Better Accuracy

Interactive Learning

• First noise tolerant poly time, label efficient algos for high dim. cases. [BL’13] [ABL’14]

• Learning with more general queries.

• Active & Differentially Private [Balcan-Feldman, NIPS’13]

[BH’12]

• Communication complexity, distributed learning.

Cool Implications:

• Sample & computational complexity of passive learning

Summary:

Related Work:

Distributed Learning


E.g., medical data

Data distributed across multiple locations.


• Data distributed across multiple locations.

• Each has a piece of the overall data pie.

Important question: how much communication?

Plus, privacy & incentives.

• To learn over the combined D, must communicate.

• Communication is expensive.

President Obama cites Communication-Avoiding Algorithms in FY 2012 Department of Energy Budget Request to Congress

Distributed PAC learning

Goal: learn good h over D, as little communication as possible

• X – instance space. k players.

• Player i can sample from Di, samples labeled by c*. • Goal: find h that approximates c* w.r.t. D=1/k (D1 + … + Dk)

[Balcan-Blum-Fine-Mansour, COLT 2012] Runner UP Best Paper

• Generic bounds on communication.

• Tight results for interesting cases [intersection closed, parity fns, linear separators over “nice” distrib].

• Broadly applicable communication efficient distr. boosting.

Main Results

• Privacy guarantees.

Interesting special case to think about

k=2. One has the positives and one has the negatives.

• How much communication, e.g., for linear separators?

Player 1 Player 2

+ + +

+

+ + +

+

- -

- -

- -

- - - -

- -

- -

- -

+ + +

+

+ + +

+

Active learning algos with good

label complexity

Distributed learning algos with good communication complexity

So, if linear sep., log-concave distr. only d log(1/²) examples communicated.

Generic Results

• Each player sends d/(²k) log(1/²) examples to player 1.

• Player 1 finds consistent h 2 C, whp error · ² wrt D

d/² log(1/²) examples, 1 round of communication Baseline

Distributed Boosting

Only O(d log 1/²) examples of communication

Key Properties of Adaboost

• For t=1,2, … ,T

• Construct Dt on {x1, …, 𝑥m}

• Run weak algo A on Dt , get ht

• D1 uniform on {x1, …, xm}

• Dt+1 increases weight on xi if ht incorrect on xi ; decreases it on xi if ht correct.

Key points:

+ + +

+

+ + +

+

- -

- -

- -

- -

ht−1

• Dt+1(xi) depends on h1(xi), … , ht(xi) and normalization factor that can be communicated efficiently.

• To achieve weak learning it suffices to use O(d) examples.

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡 e −𝛼𝑡 if 𝑦𝑖 = ℎ𝑡 𝑥𝑖

𝐷𝑡+1 𝑖 =𝐷𝑡 𝑖

𝑍𝑡 e 𝛼𝑡 if 𝑦𝑖 ≠ ℎ𝑡 𝑥𝑖

Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}

Output H_final=sgn( 𝛼𝑡 ℎ𝑡)

Distributed Adaboost

For t=1,2, … ,T

Each player i has a sample Si from Di.

• Player 1 broadcasts ht to others.

• Each player sends player 1 data to produce weak ht. [For t=1, O(d/k) examples each.]

• Player i reweights its own distribution on Si using ht and sends the sum of its weights wi,t to player 1.

• Player 1 determines # of samples to request from each i [samples O(d) times from the multinomial given by wi,t/Wtto get ni,t+1].

D1 D2 … Dk S1 S2 … Sk

ht

+ +

+

+ -

- - + +

- +

+ +

+ -

- -

+ +

+

+ -

- - - - + +

+

w1,t w2,t wk,t n1,t+1 n2,t+1 nk,t+1

Learn any class C with O(log(1/²)) rounds using O(d) examples + 1 hypothesis per round.

In the agnostic case, can learn to error O(OPT)+𝜖 using only O(k log|C| log(1/²)) examples.

Communication: fundamental resource in DL

Theorem

Theorem

• Key: in Adaboost, O(log 1/²) rounds to achieve error 𝜖.

• Key: distributed implementation of Robust Halving developed for learning with mistake queries [Balcan-Hanneke’12].

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013]

• Key idea: use coresets, short summaries capturing relevant info w.r.t. all clusterings.

k-median: find center pts c1, c2, …, cr to minimize x mini d(x,ci)

k-means: find center pts c1, c2, …, cr to minimize x mini d2(x,ci)

z x

y c1 c2

s c3

• [Feldman-Langberg STOC’11] show that in centralized setting one can construct a coreset of size

• By combining local coresets, we get a global coreset – the size goes up multiplicatively by #sites.

• In [Balcan-Ehrlich-Liang, NIPS 2013] show a 2 round procedure with communication only

[As opposed to ]

Distributed Clustering [Balcan-Ehrlich-Liang, NIPS 2013]

k-means: find center pts c1, c2, …, ck to minimize x mini d2(x,ci)

Discussion

• Other learning or optimization tasks.

• Refined trade-offs between communication complexity, computational complexity, and sample complexity.

• Analyze such issues in the context of transfer learning of large collections of multiple related tasks (e.g., NELL).

• Communication as a fundamental resource.

Open Questions

Date post:	23-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Foundations For Learning in the Age of Big Datamunoz/schedule/2014/nina-nyu.pdf“Disagreement based...

Documents