+ All Categories
Home > Science > What is wrong with data challenges

What is wrong with data challenges

Date post: 21-Feb-2017
Category:
Upload: balazs-kegl
View: 1,322 times
Download: 2 times
Share this document with a friend
56
Center for Data Science Paris-Saclay 1 CNRS & University Paris Saclay Center for Data Science BALÁZS KÉGL WHAT IS WRONG WITH DATA CHALLENGES THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY
Transcript
Page 1: What is wrong with data challenges

Center for Data ScienceParis-Saclay1

CNRS & University Paris Saclay Center for Data Science

BALÁZS KÉGL

WHAT IS WRONG WITH DATA CHALLENGES

THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY

Page 2: What is wrong with data challenges

2

Why am I so critical? !

Why do I mitigate our own success with the HiggsML?

Page 3: What is wrong with data challenges

3

Because I believe that there is enormous potential in

open innovation/crowdsourcing in science.

!

The current data challenge format is a single point in the landscape.

Page 4: What is wrong with data challenges

4Olga Kokshagina 2015

INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS

!  Crowdsourcing !  is a model leveraging

on novel technologies (web 2.0, mobile apps, social networks)

!  To build content and a

structured set of information by gathering contributions from large groups of individuals

5

Page 5: What is wrong with data challenges

Center for Data ScienceParis-Saclay

CROWDSOURCING ANNOTATION

5

Page 6: What is wrong with data challenges

Center for Data ScienceParis-Saclay

CROWDSOURCING COLLECTION AND ANNOTATION

6

Page 7: What is wrong with data challenges

Center for Data ScienceParis-Saclay

CROWDSOURCING MATH

7

Page 8: What is wrong with data challenges

Center for Data ScienceParis-Saclay

CROWDSOURCING ANALYTICS

8

Page 9: What is wrong with data challenges

Center for Data ScienceParis-Saclay

OPEN SOURCE

9

Page 10: What is wrong with data challenges

Center for Data ScienceParis-Saclay

NEW PUBLICATION MODELS

10

Page 11: What is wrong with data challenges

Center for Data ScienceParis-Saclay

THE BOOK TO READ

11

Page 12: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• Summary of our conclusions after the HiggsML challenge

• The good, the bad and the ugly

• Elaborating on some of the points

• Rapid Analytics and Model Prototyping

• an experimental format we have been developing

12

OUTLINE

Page 13: What is wrong with data challenges

Center for Data ScienceParis-Saclay13

CIML WORKSHOP TOMORROW

Page 14: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• Publicity, awareness

• both in physics (about the technology) and in ML (about the problem)

• Triggering open data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• Learning a lot from Gábor on how to win a challenge

• Gábor getting hired by Google Deep Mind

• Benchmarking

• Tool dissemination (xgboost, keras)

14

THE GOOD

Page 15: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• No direct access to code

• No direct access to data scientists

• No fundamentally new ideas

• No incentive to collaborate

15

THE BAD

Page 16: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• 18 months to prepare

• legal issues, access to data

• problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource

• once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - Gael Varoquaux)

16

THE UGLY

Page 17: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• We asked the wrong question, on purpose!

• because the right questions are complex and don’t fit the challenge setup

• would have led to way less participation

• would have led to bitterness among the participants, bad (?) for marketing

17

THE UGLY

Page 18: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson

18

PUBLICITY, AWARENESS

Page 19: What is wrong with data challenges

Center for Data ScienceParis-Saclay

PUBLICITY, AWARENESS

19

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

14

Page 20: What is wrong with data challenges

Center for Data ScienceParis-Saclay

AWARENESS DYNAMICS

20

• HEPML workshop @NIPS14

• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42

• CERN Open Data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• DataScience@LHC

• http://indico.cern.ch/event/395374/

• Flavors of physics challenge

• https://www.kaggle.com/c/flavours-of-physics

Page 21: What is wrong with data challenges

Center for Data ScienceParis-Saclay

LEARNING FROM THE WINNER

21

https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

Page 22: What is wrong with data challenges

Center for Data ScienceParis-Saclay

LEARNING FROM THE WINNER

22

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked

https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

Page 23: What is wrong with data challenges

Center for Data ScienceParis-Saclay

BENCHMARKING

23

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

15

Page 24: What is wrong with data challenges

Center for Data ScienceParis-Saclay

BENCHMARKING

24

But what score did we optimize?

!

And why?

Page 25: What is wrong with data challenges

Center for Data ScienceParis-Saclay

count (per year)

background

signal

probability

background

signal

CLASSIFICATION FOR DISCOVERY

25

Goal: optimize the expected discovery significance

flux × time

selectionexpected background

say, b = 100 events

total count, say, 150 events

excess is s = 50 events

AMS = = 5 sigma

approaches a simple asymptotic form related to the chi-squared distribution in the large-samplelimit. In practice the asymptotic formulae are found to provide a useful approximation even formoderate data samples (see, e.g., [6]). Assuming that these hold, the p-value of the background-only hypothesis from an observed value of q0 is found to be

p = 1 � F (p

q0) , (11)

where F is the standard Gaussian cumulative distribution.In particle physics it is customary to convert the p-value into the equivalent significance Z,

defined asZ = F�1(1 � p), (12)

where F�1 is the standard normal quantile. Eqs. (11) and (12) lead therefore to the simple result

Z =p

q0 =

s

2✓

n ln✓

nµb

◆� n + µb

◆(13)

if n > µb and Z = 0 otherwise. The quantity Z measures the statistical significance in unitsof standard deviations or “sigmas”. Often in particle physics a significance of at least Z = 5 (afive-sigma effect) is regarded as sufficient to claim a discovery. This corresponds to finding thep-value less than 2.9 ⇥ 10�7.11

4.2 The median discovery significanceEq. (13) represents the significance that we would obtain for a given number of events n observedin the search region G, knowing the background expectation µb. When optimizing the design ofthe classifier g which defines the search region G = {x : g(x) = s}, we do not know n and µb. Asusual in empirical risk minimization [9], we estimate the expectation µb by its empirical counter-part b from Eq. (5). We then replace n by s + b to obtain the approximate median significance

AMS2 =

r2⇣(s + b) ln

⇣1 +

sb

⌘� s

⌘. (14)

Taking into consideration that (x + 1) ln(x + 1) = x + x2/2 +O(x3), AMS2 can be rewritten as

AMS2 = AMS3 ⇥s

1 +O✓⇣ s

b

⌘3◆

,

whereAMS3 =

spb

. (15)

The two criteria Eqs. (14) and (15) are practically indistinguishable when b � s. This approxima-tion often holds in practice and may, depending on the chosen search region, be a valid surrogatein the Challenge.

In preliminary runs it happened sometimes that AMS2 was maximized in small selectionregions G, resulting in a large variance of the AMS. While large variance in the real analysis isnot necessarily a problem, it would make it difficult to reliably compare the participants of theChallenge if the optimal region was small. So, in order to decrease the variance of the AMS, wedecided to bias the optimal selection region towards larger regions by adding and artificial shiftbreg to b. The value breg = 10 was determined using preliminary experiments.

11This extremely high threshold for statistical significance is motivated by a number of factors related to multipletesting, accounting for mismodeling and the high standard one would like to require for an important discovery.

10

selection thresholdselection threshold

Page 26: What is wrong with data challenges

Center for Data ScienceParis-Saclay

How to handle systematic (model) uncertainties?• OK, so let’s design an objective function that can take background

systematics into consideration

• Likelihood with unknown background b ⇠ N (µb,�b)

L(µs, µb) = P (n, b|µs, µb,�b) =(µs + µb)n

n!e�(µs+µb) 1p

2⇡�be�(b�µb)

2/2�b2

• Profile likelihood ratio �(0) =L(0, ˆ̂µb)

L(µ̂s, µ̂b)

• The new Approximate Median Significance (by Glen Cowan)

AMS =

s

2

✓(s+ b) ln

s+ b

b0� s� b+ b0

◆+

(b� b0)2

�b2

whereb0 =

1

2

⇣b� �b

2 +p(b� �b

2)2 + 4(s+ b)�b2⌘

1 / 126

Page 27: What is wrong with data challenges

Center for Data ScienceParis-Saclay

HOW TO HANDLE SYSTEMATIC UNCERTAINTIES

27

Why didn’t we use it?

Page 28: What is wrong with data challenges

Center for Data ScienceParis-Saclay28

How to handle systematic (model) uncertainties?•

The new Approximate Median Significance

AMS =

s

2

✓(s+ b) ln

s+ b

b0� s� b+ b0

◆+

(b� b0)2

�b

2

where

b0 =1

2

⇣b� �

b

2 +p(b� �

b

2)2 + 4(s+ b)�b

2⌘

1 / 1

New AMS

ATLAS

Old AMS

Page 29: What is wrong with data challenges

Center for Data ScienceParis-Saclay

LEARNING FROM THE WINNER

29

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked

Page 30: What is wrong with data challenges

Center for Data ScienceParis-Saclay

THE TWO MOST COMMON DATA CHALLENGE KILLERS

30

Leakage

Variance of the test score

Page 31: What is wrong with data challenges

Center for Data ScienceParis-Saclay

VARIANCE OF THE TEST SCORE

31

Page 32: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• Challenges are useful for

• generating visibility in the data science community about novel application domains

• benchmarking in a fair way state-of-the-art techniques on well-defined problems

• finding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended data science problems in realistic environments

• no direct access to solutions and data scientist

• no incentive to collaboration

32

DATA CHALLENGES

Page 33: What is wrong with data challenges

33

We decided to design something better

Page 34: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• Direct access to code, prototyping

• Incentivizing diversity

• Incentivizing collaboration

• Training

• Networking

34

RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)

Page 35: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• Our experience with the HiggsML challenge

• Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science

• Collaboration with management scientists specializing in managing innovation

• Michel Nielsen’s book: Reinventing Discovery

• 5+ iterations so far35

WHERE DOES IT COME FROM?

Page 36: What is wrong with data challenges

Center for Data ScienceParis-Saclay

UNIVERSITÉ PARIS-SACLAY

36

+ horizontal multi-disciplinary and multi-partner initiatives to create cohesion

Page 37: What is wrong with data challenges

Center for Data ScienceParis-Saclay37

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay

http://www.datascience-paris-saclay.fr/

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

Page 38: What is wrong with data challenges

38

THE DATA SCIENCE LANDSCAPE

Domain scienceenergy and physical sciences

health and life sciences Earth and environment

economy and society brain

Data scientist

Data trainer

Applied scientist

Domain scientistSoftware engineer

Data engineer

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

Tool building software engineering

clouds/grids high-performance

computing optimization

Page 39: What is wrong with data challenges

Center for Data ScienceParis-Saclay39

https://medium.com/@balazskegl

Page 40: What is wrong with data challenges

Center for Data ScienceParis-Saclay

TOOLS: LANDSCAPE TO ECOSYSTEM

40

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges

• coding sprints • Open Software Initiative • code consolidator and engineering projects

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

• data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform

Page 41: What is wrong with data challenges

Center for Data ScienceParis-Saclay

• Modularizing the collaboration

• independent subtasks

• reduces barriers

• broadens the range of available expertise

• Encouraging small contributions

• Rich and well-structured information commons

• so people can build on earlier work

41

NIELSEN’S CROWDSOURCING PRINCIPLES

Page 42: What is wrong with data challenges

Center for Data ScienceParis-Saclay42

RAMPS

• Single-day coding sessions

• 20-40 participants

• preparation is similar to challenges

• Goals

• focusing and motivating top talents

• promoting collaboration, speed, and efficiency

• solving (prototyping) real problems

Page 43: What is wrong with data challenges

43

TRAINING SPRINTS

• Single-day training sessions

• 20-40 participants

• focusing on a single subject (deep learning, model tuning, functional data, etc.)

• preparing RAMPs

Page 44: What is wrong with data challenges

44

ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE

Page 45: What is wrong with data challenges

Center for Data ScienceParis-Saclay45

ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE

Page 46: What is wrong with data challenges

Center for Data ScienceParis-Saclay

ANALYTICS TOOLS TO MONITOR PROGRESS

46

Page 47: What is wrong with data challenges

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Jan 15 The HiggsML challenge

47

Page 48: What is wrong with data challenges

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Apr 10 Classifying variable stars

48

Page 49: What is wrong with data challenges

Center for Data ScienceParis-Saclay

VARIABLE STARS

49

Page 50: What is wrong with data challenges

Learning to discoverB. Kégl / CNRS - Saclay

VARIABLE STARS

50

accuracy improvement: 89% to 96%

Page 51: What is wrong with data challenges

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 June 16 and Sept 26 Predicting El Nino

51

Page 52: What is wrong with data challenges

52

RAPID ANALYTICS AND MODEL PROTOTYPING

RMSE improvement: 0.9˚C to 0.4˚C

Page 53: What is wrong with data challenges

53

2015 October 8 Insect classification

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 54: What is wrong with data challenges

54

RAPID ANALYTICS AND MODEL PROTOTYPING

accuracy improvement: 30% to 70%

Page 55: What is wrong with data challenges

55

CONCLUSIONS

• Explore the open innovation space

• read Nielsen’s book

• Drop me a mail ([email protected]) if you are interested in beta-testing the RAMP tool

• Come to our CIML WS tomorrow

Page 56: What is wrong with data challenges

Center for Data ScienceParis-Saclay56

THANK YOU!


Recommended