Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU...

Post on 01-Jan-2016

214 views 1 download

Tags:

transcript

Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube

Tim Ruhe, TU Dortmund

2

Outline

Data mining is more... Why is IceCube interesting (from a machine learning point of view) Data preprocessing and dimensionality reduction Training and validation of a learning algorithm Results Other Detector configuration? Summary & Outlook

3

Data Mining is more...

Model

BeisBeisExamples (annotated)

Historical data, simulations

New data(not annotated)

Learning Algorithm

Application

II Information,knowledge

Nobel prize(s)

4

Data Mining is more...

Model

BeisBeisExamples (annotated)

Historical data, simulations

New data(not annotated)

Learning Algorithm

Application

II Information,knowledge

Nobel prize(s)

Preprocessing

Garbage in/Garbage out

5

Data Mining is more...

Model

BeisBeisExamples (annotated)

Historical data, simulations

New data(not annotated)

Learning Algorithm

Application

II Information,knowledge

Nobel prize(s)

Preprocessing

Garbage in/Garbage out

Validation

6

Why is IceCube interesting from a machine learning point of view?

Huge amount of data Highly imbalanced distribution of event

classes (signal and background) Huge amount of data to be processed by

the learner (Big Data)

Real life problem

7

Preprocessing (1): Reducing the Data Volume Through Cuts

Background Rejection: 91.4%Signal Efficiency: 57.1%

BUT: Remaining Background

is significantly harder to reject!

8

Preprocessing (2): Variable Selection

Tim Ruhe | Statistische Methoden der Datenanalyse

Check for missing values.

Check for potential bias.

Check for correlations.

Exclude if number of missing values exceed a 30%.

Exclude everything that is useless, redundant or a source of potential bias.

Exclude everything that has a

correlation of 1.0.Automated

Feature Selection

2600 variables

477 variables

9

Relevance vs. Redundancy: MRMR (continuous case)

Relevance: Redundancy:

MRMR: or

10

Feature Selection Stability

BA

BAJ

Jaccard:

Average over many sets of variables:

11

Comparing Forward Selection and MRMR

12

Training and Validation of a Random Forest

treesn

ii

trees

sn

s0

1

use an ensemble of simple decision trees

Obtain final classification as an average over all trees

13

Training and Validation of a Random Forest

treesn

ii

trees

sn

s0

1

use an ensemble of simple decision trees

Obtain final classification as an average over all trees

5-fold cross validation to validate the performance of the forest.

14

Random Forest and Cross Validation in Detail (1)

Background Muons750,000 in total

CORSIKA, Polygonato

Neutrinos70,000 in total

NuGen, E-2 Spectrum

600,000 available for training

56,000 available for training

27,000

27,000

Sam

plin

g

15

Random Forest and Cross Validation in Detail (2)

150,000 available for testing

14,000 available for testing

27,000

27,000

Train Apply

Repeat (x5)

500 Trees

16

Random Forest Output

17

Random Forest Output

We need an additional

cut on the output of the

Random Forest!

18

Random Forest Output: Cut at 500 trees

We need an additional

cut on the output of the

Random Forest!

28830 ± 480 expected neutrino candidates

28830 ± 480 expected background muons

27,771 neutrino candidates

Background Rejection: 99.9999% Signal Efficiency 18.2% Estimated Purity: (99.59±0.37)%

Apply to experimental data

This yields

19

Unfolding the spectrum

TRUEE

This is no Data Mining...

...but it ain‘t magic either

20

Moving on... IC79

212 neutrino candidates per day 66885 neutrino candidates in total 330±200 background muons

Entire analysis chain can be applied on other detector configurations

...with minor changes (e.g. ice model)

21

Summary and Outlook

99.9999% Background Rejection

Purities above 99% are routinely achieved

Future Improvements???

By starting at an earlier analysis level...

MRMRRandom Forest

22

Backup Slides

23

RapidMiner in a Nutshell

Developed at the Department of Computer Science at TU Dortmund(YALE) Operator based, written in Java It used to be open source Many, many plugins due to a rather active community One of the most widely used data mining tools

24

What I like about it

Data flow is nicely visualized and can be easily followed and comprehended

Rather easy to learn, even without programming experience Large Community (Updates, Bugfixes, Plugins) Professional Tool (They actually make money with that!) Good support Many tutorials can be found online, even special one Most operators work like a charm Extendable

25

Relevance vs. Redundancy: MRMR (discrete case)

Relevance: Redundancy:

MRMR: or

Mutual Information

26

Feature Selection Stability

BA

BAJ

||

||||

)(),(

2

BAr

kBA

knk

krnBAIC

Jaccard:

Kuncheva:

27

Ensemblemethoden

Tim Ruhe | Statistische Methoden der Datenanalyse

Ensemble methods

With Weight (e.g. Boosting)

Without Weight (e.g. Random Forest)

28

Random Forest: What is randomized?

Randomness 1: Events the tree is trained on (bagging)

Randomness 2: Variables that are available for a split

29

Are we actually better, than simpler methods?