What is wrong with data challenges

Post on 21-Feb-2017

1,322 views 2 download

transcript

Center for Data ScienceParis-Saclay1

CNRS & University Paris Saclay Center for Data Science

BALÁZS KÉGL

WHAT IS WRONG WITH DATA CHALLENGES

THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY

2

Why am I so critical? !

Why do I mitigate our own success with the HiggsML?

3

Because I believe that there is enormous potential in

open innovation/crowdsourcing in science.

!

The current data challenge format is a single point in the landscape.

4Olga Kokshagina 2015

INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS

!  Crowdsourcing !  is a model leveraging

on novel technologies (web 2.0, mobile apps, social networks)

!  To build content and a

structured set of information by gathering contributions from large groups of individuals

5

Center for Data ScienceParis-Saclay

CROWDSOURCING ANNOTATION

5

Center for Data ScienceParis-Saclay

CROWDSOURCING COLLECTION AND ANNOTATION

6

Center for Data ScienceParis-Saclay

CROWDSOURCING MATH

7

Center for Data ScienceParis-Saclay

CROWDSOURCING ANALYTICS

8

Center for Data ScienceParis-Saclay

OPEN SOURCE

9

Center for Data ScienceParis-Saclay

NEW PUBLICATION MODELS

10

Center for Data ScienceParis-Saclay

THE BOOK TO READ

11

Center for Data ScienceParis-Saclay

• Summary of our conclusions after the HiggsML challenge

• The good, the bad and the ugly

• Elaborating on some of the points

• Rapid Analytics and Model Prototyping

• an experimental format we have been developing

12

OUTLINE

Center for Data ScienceParis-Saclay13

CIML WORKSHOP TOMORROW

Center for Data ScienceParis-Saclay

• Publicity, awareness

• both in physics (about the technology) and in ML (about the problem)

• Triggering open data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• Learning a lot from Gábor on how to win a challenge

• Gábor getting hired by Google Deep Mind

• Benchmarking

• Tool dissemination (xgboost, keras)

14

THE GOOD

Center for Data ScienceParis-Saclay

• No direct access to code

• No direct access to data scientists

• No fundamentally new ideas

• No incentive to collaborate

15

THE BAD

Center for Data ScienceParis-Saclay

• 18 months to prepare

• legal issues, access to data

• problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource

• once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - Gael Varoquaux)

16

THE UGLY

Center for Data ScienceParis-Saclay

• We asked the wrong question, on purpose!

• because the right questions are complex and don’t fit the challenge setup

• would have led to way less participation

• would have led to bitterness among the participants, bad (?) for marketing

17

THE UGLY

Center for Data ScienceParis-Saclay

• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson

18

PUBLICITY, AWARENESS

Center for Data ScienceParis-Saclay

PUBLICITY, AWARENESS

19

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

14

Center for Data ScienceParis-Saclay

AWARENESS DYNAMICS

20

• HEPML workshop @NIPS14

• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42

• CERN Open Data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• DataScience@LHC

• http://indico.cern.ch/event/395374/

• Flavors of physics challenge

• https://www.kaggle.com/c/flavours-of-physics

Center for Data ScienceParis-Saclay

LEARNING FROM THE WINNER

21

https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

Center for Data ScienceParis-Saclay

LEARNING FROM THE WINNER

22

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked

https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

Center for Data ScienceParis-Saclay

BENCHMARKING

23

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

15

Center for Data ScienceParis-Saclay

BENCHMARKING

24

But what score did we optimize?

!

And why?

Center for Data ScienceParis-Saclay

count (per year)

background

signal

probability

background

signal

CLASSIFICATION FOR DISCOVERY

25

Goal: optimize the expected discovery significance

flux × time

selectionexpected background

say, b = 100 events

total count, say, 150 events

excess is s = 50 events

AMS = = 5 sigma

approaches a simple asymptotic form related to the chi-squared distribution in the large-samplelimit. In practice the asymptotic formulae are found to provide a useful approximation even formoderate data samples (see, e.g., [6]). Assuming that these hold, the p-value of the background-only hypothesis from an observed value of q0 is found to be

p = 1 � F (p

q0) , (11)

where F is the standard Gaussian cumulative distribution.In particle physics it is customary to convert the p-value into the equivalent significance Z,

defined asZ = F�1(1 � p), (12)

where F�1 is the standard normal quantile. Eqs. (11) and (12) lead therefore to the simple result

Z =p

q0 =

s

2✓

n ln✓

nµb

◆� n + µb

◆(13)

if n > µb and Z = 0 otherwise. The quantity Z measures the statistical significance in unitsof standard deviations or “sigmas”. Often in particle physics a significance of at least Z = 5 (afive-sigma effect) is regarded as sufficient to claim a discovery. This corresponds to finding thep-value less than 2.9 ⇥ 10�7.11

4.2 The median discovery significanceEq. (13) represents the significance that we would obtain for a given number of events n observedin the search region G, knowing the background expectation µb. When optimizing the design ofthe classifier g which defines the search region G = {x : g(x) = s}, we do not know n and µb. Asusual in empirical risk minimization [9], we estimate the expectation µb by its empirical counter-part b from Eq. (5). We then replace n by s + b to obtain the approximate median significance

AMS2 =

r2⇣(s + b) ln

⇣1 +

sb

⌘� s

⌘. (14)

Taking into consideration that (x + 1) ln(x + 1) = x + x2/2 +O(x3), AMS2 can be rewritten as

AMS2 = AMS3 ⇥s

1 +O✓⇣ s

b

⌘3◆

,

whereAMS3 =

spb

. (15)

The two criteria Eqs. (14) and (15) are practically indistinguishable when b � s. This approxima-tion often holds in practice and may, depending on the chosen search region, be a valid surrogatein the Challenge.

In preliminary runs it happened sometimes that AMS2 was maximized in small selectionregions G, resulting in a large variance of the AMS. While large variance in the real analysis isnot necessarily a problem, it would make it difficult to reliably compare the participants of theChallenge if the optimal region was small. So, in order to decrease the variance of the AMS, wedecided to bias the optimal selection region towards larger regions by adding and artificial shiftbreg to b. The value breg = 10 was determined using preliminary experiments.

11This extremely high threshold for statistical significance is motivated by a number of factors related to multipletesting, accounting for mismodeling and the high standard one would like to require for an important discovery.

10

selection thresholdselection threshold

Center for Data ScienceParis-Saclay

How to handle systematic (model) uncertainties?• OK, so let’s design an objective function that can take background

systematics into consideration

• Likelihood with unknown background b ⇠ N (µb,�b)

L(µs, µb) = P (n, b|µs, µb,�b) =(µs + µb)n

n!e�(µs+µb) 1p

2⇡�be�(b�µb)

2/2�b2

• Profile likelihood ratio �(0) =L(0, ˆ̂µb)

L(µ̂s, µ̂b)

• The new Approximate Median Significance (by Glen Cowan)

AMS =

s

2

✓(s+ b) ln

s+ b

b0� s� b+ b0

◆+

(b� b0)2

�b2

whereb0 =

1

2

⇣b� �b

2 +p(b� �b

2)2 + 4(s+ b)�b2⌘

1 / 126

Center for Data ScienceParis-Saclay

HOW TO HANDLE SYSTEMATIC UNCERTAINTIES

27

Why didn’t we use it?

Center for Data ScienceParis-Saclay28

How to handle systematic (model) uncertainties?•

The new Approximate Median Significance

AMS =

s

2

✓(s+ b) ln

s+ b

b0� s� b+ b0

◆+

(b� b0)2

�b

2

where

b0 =1

2

⇣b� �

b

2 +p(b� �

b

2)2 + 4(s+ b)�b

2⌘

1 / 1

New AMS

ATLAS

Old AMS

Center for Data ScienceParis-Saclay

LEARNING FROM THE WINNER

29

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked

Center for Data ScienceParis-Saclay

THE TWO MOST COMMON DATA CHALLENGE KILLERS

30

Leakage

Variance of the test score

Center for Data ScienceParis-Saclay

VARIANCE OF THE TEST SCORE

31

Center for Data ScienceParis-Saclay

• Challenges are useful for

• generating visibility in the data science community about novel application domains

• benchmarking in a fair way state-of-the-art techniques on well-defined problems

• finding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended data science problems in realistic environments

• no direct access to solutions and data scientist

• no incentive to collaboration

32

DATA CHALLENGES

33

We decided to design something better

Center for Data ScienceParis-Saclay

• Direct access to code, prototyping

• Incentivizing diversity

• Incentivizing collaboration

• Training

• Networking

34

RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)

Center for Data ScienceParis-Saclay

• Our experience with the HiggsML challenge

• Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science

• Collaboration with management scientists specializing in managing innovation

• Michel Nielsen’s book: Reinventing Discovery

• 5+ iterations so far35

WHERE DOES IT COME FROM?

Center for Data ScienceParis-Saclay

UNIVERSITÉ PARIS-SACLAY

36

+ horizontal multi-disciplinary and multi-partner initiatives to create cohesion

Center for Data ScienceParis-Saclay37

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay

http://www.datascience-paris-saclay.fr/

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

38

THE DATA SCIENCE LANDSCAPE

Domain scienceenergy and physical sciences

health and life sciences Earth and environment

economy and society brain

Data scientist

Data trainer

Applied scientist

Domain scientistSoftware engineer

Data engineer

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

Tool building software engineering

clouds/grids high-performance

computing optimization

Center for Data ScienceParis-Saclay39

https://medium.com/@balazskegl

Center for Data ScienceParis-Saclay

TOOLS: LANDSCAPE TO ECOSYSTEM

40

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges

• coding sprints • Open Software Initiative • code consolidator and engineering projects

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

• data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform

Center for Data ScienceParis-Saclay

• Modularizing the collaboration

• independent subtasks

• reduces barriers

• broadens the range of available expertise

• Encouraging small contributions

• Rich and well-structured information commons

• so people can build on earlier work

41

NIELSEN’S CROWDSOURCING PRINCIPLES

Center for Data ScienceParis-Saclay42

RAMPS

• Single-day coding sessions

• 20-40 participants

• preparation is similar to challenges

• Goals

• focusing and motivating top talents

• promoting collaboration, speed, and efficiency

• solving (prototyping) real problems

43

TRAINING SPRINTS

• Single-day training sessions

• 20-40 participants

• focusing on a single subject (deep learning, model tuning, functional data, etc.)

• preparing RAMPs

44

ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE

Center for Data ScienceParis-Saclay45

ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE

Center for Data ScienceParis-Saclay

ANALYTICS TOOLS TO MONITOR PROGRESS

46

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Jan 15 The HiggsML challenge

47

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Apr 10 Classifying variable stars

48

Center for Data ScienceParis-Saclay

VARIABLE STARS

49

Learning to discoverB. Kégl / CNRS - Saclay

VARIABLE STARS

50

accuracy improvement: 89% to 96%

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 June 16 and Sept 26 Predicting El Nino

51

52

RAPID ANALYTICS AND MODEL PROTOTYPING

RMSE improvement: 0.9˚C to 0.4˚C

53

2015 October 8 Insect classification

RAPID ANALYTICS AND MODEL PROTOTYPING

54

RAPID ANALYTICS AND MODEL PROTOTYPING

accuracy improvement: 30% to 70%

55

CONCLUSIONS

• Explore the open innovation space

• read Nielsen’s book

• Drop me a mail (balazs.kegl@gmail.com) if you are interested in beta-testing the RAMP tool

• Come to our CIML WS tomorrow

Center for Data ScienceParis-Saclay56

THANK YOU!