What is wrong with data challenges

transcript

Center for Data ScienceParis-Saclay1

CNRS & University Paris Saclay Center for Data Science

BALÁZS KÉGL

WHAT IS WRONG WITH DATA CHALLENGES

THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY

Why am I so critical? !

Why do I mitigate our own success with the HiggsML?

Because I believe that there is enormous potential in

open innovation/crowdsourcing in science.

The current data challenge format is a single point in the landscape.

4Olga Kokshagina 2015

INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS

!  Crowdsourcing !  is a model leveraging

on novel technologies (web 2.0, mobile apps, social networks)

!  To build content and a

structured set of information by gathering contributions from large groups of individuals

Center for Data ScienceParis-Saclay

CROWDSOURCING ANNOTATION

CROWDSOURCING COLLECTION AND ANNOTATION

CROWDSOURCING MATH

CROWDSOURCING ANALYTICS

OPEN SOURCE

NEW PUBLICATION MODELS

THE BOOK TO READ

• Summary of our conclusions after the HiggsML challenge

• The good, the bad and the ugly

• Elaborating on some of the points

• Rapid Analytics and Model Prototyping

• an experimental format we have been developing

OUTLINE

CIML WORKSHOP TOMORROW

• Publicity, awareness

• both in physics (about the technology) and in ML (about the problem)

• Triggering open data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• Learning a lot from Gábor on how to win a challenge

• Gábor getting hired by Google Deep Mind

• Benchmarking

• Tool dissemination (xgboost, keras)

THE GOOD

• No direct access to code

• No direct access to data scientists

• No fundamentally new ideas

• No incentive to collaborate

THE BAD

• 18 months to prepare

• legal issues, access to data

• problem formulation: intellectually way more interesting than the challenge itself, but difficult to “market” or to crowdsource

• once a problem is formalized/formatted to challenge, the problem is solved (“learning is easy” - Gael Varoquaux)

THE UGLY

• We asked the wrong question, on purpose!

• because the right questions are complex and don’t fit the challenge setup

• would have led to way less participation

• would have led to bitterness among the participants, bad (?) for marketing

THE UGLY

• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson

PUBLICITY, AWARENESS

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

AWARENESS DYNAMICS

• HEPML workshop @NIPS14

• JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42

• CERN Open Data

• http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014

• DataScience@LHC

• http://indico.cern.ch/event/395374/

• Flavors of physics challenge

• https://www.kaggle.com/c/flavours-of-physics

LEARNING FROM THE WINNER

https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked

https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf

BENCHMARKING

B. Kégl / AppStat@LAL Learning to discover

BENCHMARKING

But what score did we optimize?

And why?

count (per year)

background

signal

probability

background

signal

Goal: optimize the expected discovery significance

flux × time

selectionexpected background

say, b = 100 events

total count, say, 150 events

excess is s = 50 events

AMS = = 5 sigma

approaches a simple asymptotic form related to the chi-squared distribution in the large-samplelimit. In practice the asymptotic formulae are found to provide a useful approximation even formoderate data samples (see, e.g., [6]). Assuming that these hold, the p-value of the background-only hypothesis from an observed value of q0 is found to be

p = 1 � F (p

q0) , (11)

where F is the standard Gaussian cumulative distribution.In particle physics it is customary to convert the p-value into the equivalent significance Z,

defined asZ = F�1(1 � p), (12)

where F�1 is the standard normal quantile. Eqs. (11) and (12) lead therefore to the simple result

n ln✓

◆� n + µb

◆(13)

if n > µb and Z = 0 otherwise. The quantity Z measures the statistical significance in unitsof standard deviations or “sigmas”. Often in particle physics a significance of at least Z = 5 (afive-sigma effect) is regarded as sufficient to claim a discovery. This corresponds to finding thep-value less than 2.9 ⇥ 10�7.11

4.2 The median discovery significanceEq. (13) represents the significance that we would obtain for a given number of events n observedin the search region G, knowing the background expectation µb. When optimizing the design ofthe classifier g which defines the search region G = {x : g(x) = s}, we do not know n and µb. Asusual in empirical risk minimization [9], we estimate the expectation µb by its empirical counter-part b from Eq. (5). We then replace n by s + b to obtain the approximate median significance

AMS2 =

r2⇣(s + b) ln

⇣1 +

⌘� s

⌘. (14)

Taking into consideration that (x + 1) ln(x + 1) = x + x2/2 +O(x3), AMS2 can be rewritten as

AMS2 = AMS3 ⇥s

1 +O✓⇣ s

⌘3◆

whereAMS3 =

. (15)

The two criteria Eqs. (14) and (15) are practically indistinguishable when b � s. This approxima-tion often holds in practice and may, depending on the chosen search region, be a valid surrogatein the Challenge.

In preliminary runs it happened sometimes that AMS2 was maximized in small selectionregions G, resulting in a large variance of the AMS. While large variance in the real analysis isnot necessarily a problem, it would make it difficult to reliably compare the participants of theChallenge if the optimal region was small. So, in order to decrease the variance of the AMS, wedecided to bias the optimal selection region towards larger regions by adding and artificial shiftbreg to b. The value breg = 10 was determined using preliminary experiments.

11This extremely high threshold for statistical significance is motivated by a number of factors related to multipletesting, accounting for mismodeling and the high standard one would like to require for an important discovery.

selection thresholdselection threshold

How to handle systematic (model) uncertainties?• OK, so let’s design an objective function that can take background

systematics into consideration

• Likelihood with unknown background b ⇠ N (µb,�b)

L(µs, µb) = P (n, b|µs, µb,�b) =(µs + µb)n

n!e�(µs+µb) 1p

2⇡�be�(b�µb)

2/2�b2

• Profile likelihood ratio �(0) =L(0, ˆ̂µb)

L(µ̂s, µ̂b)

• The new Approximate Median Significance (by Glen Cowan)

✓(s+ b) ln

b0� s� b+ b0

(b� b0)2

whereb0 =

⇣b� �b

2 +p(b� �b

2)2 + 4(s+ b)�b2⌘

1 / 126

HOW TO HANDLE SYSTEMATIC UNCERTAINTIES

Why didn’t we use it?

How to handle systematic (model) uncertainties?•

The new Approximate Median Significance

✓(s+ b) ln

b0� s� b+ b0

(b� b0)2

⇣b� �

2 +p(b� �

2)2 + 4(s+ b)�b

New AMS

Old AMS

• Sophisticated cross validation, CV bagging

• Sophisticated calibration and model averaging

• The first step: pro participants check if the effort is worthy, risk assessment

• variance estimate of the score

• Don’t use the public leaderboard score for model selection

• None of Gábor’s 200 out-of-the-ordinary ideas worked

THE TWO MOST COMMON DATA CHALLENGE KILLERS

Leakage

Variance of the test score

VARIANCE OF THE TEST SCORE

• Challenges are useful for

• generating visibility in the data science community about novel application domains

• benchmarking in a fair way state-of-the-art techniques on well-defined problems

• finding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended data science problems in realistic environments

• no direct access to solutions and data scientist

• no incentive to collaboration

DATA CHALLENGES

We decided to design something better

• Direct access to code, prototyping

• Incentivizing diversity

• Incentivizing collaboration

• Training

• Networking

RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)

• Our experience with the HiggsML challenge

• Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science

• Collaboration with management scientists specializing in managing innovation

• Michel Nielsen’s book: Reinventing Discovery

• 5+ iterations so far35

WHERE DOES IT COME FROM?

UNIVERSITÉ PARIS-SACLAY

+ horizontal multi-disciplinary and multi-partner initiatives to create cohesion

A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay

http://www.datascience-paris-saclay.fr/

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

THE DATA SCIENCE LANDSCAPE

Domain scienceenergy and physical sciences

health and life sciences Earth and environment

economy and society brain

Data scientist

Data trainer

Applied scientist

Domain scientistSoftware engineer

Data engineer

machine learning information retrieval

signal processing data visualization

databases

Tool building software engineering

clouds/grids high-performance

computing optimization

https://medium.com/@balazskegl

TOOLS: LANDSCAPE TO ECOSYSTEM

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains

machine learning information retrieval

signal processing data visualization

databases

• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges

• coding sprints • Open Software Initiative • code consolidator and engineering projects

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

• data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform

• Modularizing the collaboration

• independent subtasks

• reduces barriers

• broadens the range of available expertise

• Encouraging small contributions

• Rich and well-structured information commons

• so people can build on earlier work

NIELSEN’S CROWDSOURCING PRINCIPLES

• Single-day coding sessions

• 20-40 participants

• preparation is similar to challenges

• Goals

• focusing and motivating top talents

• promoting collaboration, speed, and efficiency

• solving (prototyping) real problems

TRAINING SPRINTS

• Single-day training sessions

• 20-40 participants

• focusing on a single subject (deep learning, model tuning, functional data, etc.)

• preparing RAMPs

ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE

ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE

ANALYTICS TOOLS TO MONITOR PROGRESS

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Jan 15 The HiggsML challenge

2015 Apr 10 Classifying variable stars

VARIABLE STARS

Learning to discoverB. Kégl / CNRS - Saclay

VARIABLE STARS

accuracy improvement: 89% to 96%

2015 June 16 and Sept 26 Predicting El Nino

RMSE improvement: 0.9˚C to 0.4˚C

2015 October 8 Insect classification

accuracy improvement: 30% to 70%

CONCLUSIONS

• Explore the open innovation space

• read Nielsen’s book

• Drop me a mail (balazs.kegl@gmail.com) if you are interested in beta-testing the RAMP tool

• Come to our CIML WS tomorrow

THANK YOU!