Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU...

transcript

Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube

Tim Ruhe, TU Dortmund

Outline

Data mining is more... Why is IceCube interesting (from a machine learning point of view) Data preprocessing and dimensionality reduction Training and validation of a learning algorithm Results Other Detector configuration? Summary & Outlook

Data Mining is more...

BeisBeisExamples (annotated)

Historical data, simulations

New data(not annotated)

Learning Algorithm

Application

II Information,knowledge

Nobel prize(s)

Learning Algorithm

Application

Nobel prize(s)

Preprocessing

Garbage in/Garbage out

Learning Algorithm

Application

Nobel prize(s)

Preprocessing

Garbage in/Garbage out

Validation

Why is IceCube interesting from a machine learning point of view?

Huge amount of data Highly imbalanced distribution of event

classes (signal and background) Huge amount of data to be processed by

the learner (Big Data)

Real life problem

Preprocessing (1): Reducing the Data Volume Through Cuts

Background Rejection: 91.4%Signal Efficiency: 57.1%

BUT: Remaining Background

is significantly harder to reject!

Preprocessing (2): Variable Selection

Tim Ruhe | Statistische Methoden der Datenanalyse

Check for missing values.

Check for potential bias.

Check for correlations.

Exclude if number of missing values exceed a 30%.

Exclude everything that is useless, redundant or a source of potential bias.

Exclude everything that has a

correlation of 1.0.Automated

Feature Selection

2600 variables

477 variables

Relevance vs. Redundancy: MRMR (continuous case)

Relevance: Redundancy:

MRMR: or

Feature Selection Stability

Jaccard:

Average over many sets of variables:

Comparing Forward Selection and MRMR

Training and Validation of a Random Forest

treesn

use an ensemble of simple decision trees

Obtain final classification as an average over all trees

Training and Validation of a Random Forest

treesn

use an ensemble of simple decision trees

Obtain final classification as an average over all trees

5-fold cross validation to validate the performance of the forest.

Random Forest and Cross Validation in Detail (1)

Background Muons750,000 in total

CORSIKA, Polygonato

Neutrinos70,000 in total

NuGen, E-2 Spectrum

600,000 available for training

56,000 available for training

27,000

Random Forest and Cross Validation in Detail (2)

150,000 available for testing

14,000 available for testing

27,000

Train Apply

Repeat (x5)

500 Trees

Random Forest Output

We need an additional

cut on the output of the

Random Forest!

Random Forest Output: Cut at 500 trees

We need an additional

cut on the output of the

Random Forest!

28830 ± 480 expected neutrino candidates

28830 ± 480 expected background muons

27,771 neutrino candidates

Background Rejection: 99.9999% Signal Efficiency 18.2% Estimated Purity: (99.59±0.37)%

Apply to experimental data

This yields

Unfolding the spectrum

This is no Data Mining...

...but it ain‘t magic either

Moving on... IC79

212 neutrino candidates per day 66885 neutrino candidates in total 330±200 background muons

Entire analysis chain can be applied on other detector configurations

...with minor changes (e.g. ice model)

Summary and Outlook

99.9999% Background Rejection

Purities above 99% are routinely achieved

Future Improvements???

By starting at an earlier analysis level...

MRMRRandom Forest

Backup Slides

RapidMiner in a Nutshell

Developed at the Department of Computer Science at TU Dortmund(YALE) Operator based, written in Java It used to be open source Many, many plugins due to a rather active community One of the most widely used data mining tools

What I like about it

Data flow is nicely visualized and can be easily followed and comprehended

Rather easy to learn, even without programming experience Large Community (Updates, Bugfixes, Plugins) Professional Tool (They actually make money with that!) Good support Many tutorials can be found online, even special one Most operators work like a charm Extendable

Relevance vs. Redundancy: MRMR (discrete case)

Relevance: Redundancy:

MRMR: or

Mutual Information

Feature Selection Stability

krnBAIC

Jaccard:

Kuncheva:

Ensemblemethoden

Tim Ruhe | Statistische Methoden der Datenanalyse

Ensemble methods

With Weight (e.g. Boosting)

Without Weight (e.g. Random Forest)

Random Forest: What is randomized?

Randomness 1: Events the tree is trained on (bagging)

Randomness 2: Variables that are available for a split

Are we actually better, than simpler methods?

Application of Data Mining Algorithms in Atmospheric Neutrino Analyses with IceCube Tim Ruhe, TU...

Documents