1 Helge Voss Nikhef 23 rd - 27 th April 2007TMVA Toolkit for Multivariate Data Analysis: ACAT 2007...

1Helge Voss Nikhef 23rd - 27th April 2007 TMVA Toolkit for Multivariate Data Analysis: ACAT 2007

TMVA Toolkit for Multivariate Data Analysis

with ROOT

Helge Voss, MPI-K Heidelberg

on behalf of: Andreas Höcker, Fredrik Tegenfeld, Joerg Stelzer*

http://tmva.sourceforge.net/arXiv: physics/0703039

Supply an environment to easily:

apply different sophisticated data selection algorithms

have them all trained, tested and evaluated

find the best one for your selection problem

and contributors:

A.Christov, S.Henrot-Versillé, M.Jachowski, A.Krasznahorkay Jr., Y.Mahalalel, X.Prudent, P.Speckmayer, M.Wolter, A.Zemla


Motivation/Outline

Outline:

introduction

the MVA classifiers available in TMVA

demonstration with toy examples

summary

ROOT: is the analysis framework used by most (HEP)-physicists

Idea: rather than just implementing new MVA techniques and making them somehow available in ROOT (i.e. like TMulitLayerPercetron does):

have one common platform/interface for all MVA classifierseasy to use and compare different MVA classifiers

train/test on same data sample and evaluate consistently


…

y(Bkg) 0y(Signal) 1

Multivariate Event Classification

All multivariate classifiers condense (correlated) multi-variable input information into a single scalar output variable: Rn R

One variable to base your decision on


What is in TMVATMVA currently includes:

Rectangular cut optimisation Projective and Multi-dimensional likelihood estimator Fisher discriminant and H-Matrix (2 estimator)Artificial Neural Network (3 different implementations) Boosted/bagged Decision Trees Rule Fitting Support Vector Machines

TMVA package provides training, testing and evaluation of the classifiers

each classifier provides a ranking of the input variables

classifiers produce weight files that are read by reader class for MVA application

all classifiers are highly customizable

integrated in ROOT(since release 5.11/03) and very easy to use!

support of arbitrary pre-selections and individual event weights

common pre-processing of input: de-correlation, principal component analysis


Commonly realised for all methods in TMVA (centrally in DataSet class):

Note that this “de-correlation” is only complete, if: input variables are Gaussians

correlations linear only

in practise: gain form de-correlation often rather modest – or even harmful

Preprocessing the Input Variables: Decorrelation

originaloriginal SQRT derorr.SQRT derorr. PCA derorr.PCA derorr.

Removal of linear correlations by rotating variables using the square-root of the correlation matrix

using the Principal Component Analysis


Simplest method: cut in rectangular volume using

event eventvariabl

cut, , ,min ,mas

xe

(0,1) ,v vv

ivix x x x

scan in signal efficiency [0 1] and maximise background rejection

from this scan, the optimal working point in terms if S,B numbers can be derived

Technical problem: how to perform optimisation TMVA uses: random sampling, Simulated Annealing or Genetics Algorithm

speed improvement in volume search:

training events are sorted in Binary Seach Trees

Cut Optimisation

do this in normal variable space or de-correlated variable space


Combine probability from different variables for an event to be signal or background like

event

event

event

variables,

signal

speci

P

variabe le

DE,

,ss

( )

( )

v vv

vS

i

iv

i

vS

p x

x

p x

discriminating variables

Species: signal, background types

Likelihood ratio for event ievent

PDFs

Projected Likelihood Estimator (PDE Appr.)

automatic,unbiased, but suboptimal

easy to automate, can create artefactsTMVA uses: Splines0-5, Kernel estimators

difficult to automate

Technical problem: how to implement reference PDFs

3 ways: counting, function fitting , parametric fitting (splines, kernel estimators.)

Optimal if no correlations and PDF’s are correct (known)

usually it is not true development of different methods


Carli-Koblitz, NIM A501, 576 (2003)

Generalisation of 1D PDE approach to Nvar dimensions

Optimal method – in theory – if “true N-dim PDF” were known

Practical challenges: derive N-dim PDF from training sample

TMVA implementation: Range search PDERS

count number of signal and background events in “vicinity” of a data event fixed size or adaptive (latter one = kNN-type classifiers)

Multidimensional Likelihood Estimator

S

B

x1

x2

test event

speed up range search by sorting training events in Binary Trees

use multi-D kernels (Gaussian, triangular, …) to weight events within a volume

volumes can be rectangular or spherical


Well-known, simple and elegant classifier:

determine linear variable transformation where:

linear correlations are removed

mean values of signal and background are “pushed” as far apart as possible

the computation of Fisher response is very simple: linear combination of the event variables * Fisher coefficients

event eventvariab

Fisheres

,l

, i v vv

ix x F

“Fisher coefficients”

Fisher Discriminant (and H-Matrix)


Fee

d-fo

rwar

d M

ultil

ayer

Per

cept

ron

1( ) 1 xA x e

1

i

. . .N

1 input layer k hidden layers 1 ouput layer

1

j

M1

. . .

. . . 1

. . .Mk

2 output classes (signal and background)

Nvar discriminating input variables

11w

ijw

1jw. . .. . .

1( ) ( ) ( ) ( 1)

01

kMk k k k

j j ij ii

x w w xA

var

(0)1..i Nx

( 1)1,2kx

with:

(“Activation” function)

Get a non-linear classifier response by giving linear combination of input variables to nodes with non-linear activation

Artificial Neural Network (ANN)

Training: adjust weights using known event such that signal/background are best separated

Nodes (or neurons) and arranged in series

Feed-Forward Multilayer Perceptrons (3 different implementations in TMVA)


Decision Trees

sequential application of “cuts” which splits the data into nodes, and the final nodes (leaf) classifies an event as signal or background

Training: growing a decision tree

Start with Root node

Split training sample according to cut on best variable at this node

Splitting criterion: e.g., maximum “Gini-index”: purity (1– purity)

Continue splitting until min. number of events or max. purity reached

Bottom up Pruning:

remove statistically insignificant nodes avoid overtraining

Classify leaf node according to majority of events, or give weight; unknown test events are classified accordinglyDecision tree before pruning

Decision tree after pruning

Decision Trees


Decision Trees: well know since a long time but hardly used in HEP (although very similar to “simple Cuts”)

Disatvantage: instability: small changes in training sample can give large changes in tree structure

Boosted Decision Trees

Boosted Decision Trees (1996): combine several decision trees: forest

classifier output is the (weighted) majority vote of individual trees

trees derived from same training sample with different event weights

e.g. AdaBoost: wrong classified training events are given a larger weight

bagging (re-sampling with replacement) random weights

Boosted Decision Trees

Remark: bagging/boosting create a basis of classifiers

final classifier is a linear combination of base classifiers


Following RuleFit from Friedman-Popescu:

Classifier is a linear combination of simple base classifiers

that are called rules and are here: sequences of cuts:

The procedure is:1. create the rule ensemble created from a set of decision trees2. fit the coefficients “Gradient directed regularization” (Friedman et al)

Rule Fitting(Predictive Learning via Rule Ensembles)

RF 01 1

ˆ ˆR RM n

m m k km k

x x xy a ra b

rules (cut sequence rm=1 if all cuts satisfied, =0 otherwise)

normalised discriminating event variables

RuleFit classifier

Linear Fisher termSum of rules

Friedman-Popescu, Tech Rep, Stat. Dpt, Stanford U., 2003


Support Vector Machines

x1

x2

Find hyperplane that best separates signal from background

best separation: maximum distance between closest events (support) to hyperplane

linear decision boundary

Non linear cases:

transform the variables in higher dimensional feature space where linear boundary (hyperplanes) can separate the data

transformation is done implicitly using Kernel Functions that effectively introduces a metric for the distance measures that “mimics” the transformation

Choose Kernel and fit the hyperplane

x1

x1

x2

x1

x2

x3

Available Kernels: Gaussian,

Polynomial,

Sigmoid


A Complete Example Analysisvoid TMVAnalysis( ) { TFile* outputFile = TFile::Open( "TMVA.root", "RECREATE" );

TMVA::Factory *factory = new TMVA::Factory( "MVAnalysis", outputFile,"!V");

TFile *input = TFile::Open("tmva_example.root"); TTree *signal = (TTree*)input->Get("TreeS"); TTree *background = (TTree*)input->Get("TreeB"); factory->AddSignalTree ( signal, 1. ); factory->AddBackgroundTree( background, 1.);

factory->AddVariable("var1+var2", 'F'); factory->AddVariable("var1-var2", 'F'); factory->AddVariable("var3", 'F'); factory->AddVariable("var4", 'F');

factory->PrepareTrainingAndTestTree("", "NSigTrain=3000:NBkgTrain=3000:SplitMode=Random:!V" );

factory->BookMethod( TMVA::Types::kLikelihood, "Likelihood", "!V:!TransformOutput:Spline=2:NSmooth=5:NAvEvtPerBin=50" );

factory->BookMethod( TMVA::Types::kMLP, "MLP", "!V:NCycles=200:HiddenLayers=N+1,N:TestRate=5" );

factory->TrainAllMethods(); factory->TestAllMethods(); factory->EvaluateAllMethods(); outputFile->Close(); delete factory;}

create Factory

give training/test trees

tell which variables(example uses variables not directly

avaiable in the tree:i.e.” var1+var2”)

select the MVA methods

train,test and evaluate


Example Applicationvoid TMVApplication( ) { TMVA::Reader *reader = new TMVA::Reader("!Color");

Float_t var1, var2, var3, var4; reader->AddVariable( "var1+var2", &var1 ); reader->AddVariable( "var1-var2", &var2 ); reader->AddVariable( "var3", &var3 ); reader->AddVariable( "var4", &var4 );

reader->BookMVA( "MLP method", "weights/MVAnalysis_MLP.weights.txt" );

TFile *input = TFile::Open("tmva_example.root"); TTree* theTree = (TTree*)input->Get("TreeS");

Float_t userVar1, userVar2; theTree->SetBranchAddress( "var1", &userVar1 ); theTree->SetBranchAddress( "var2", &userVar2 ); theTree->SetBranchAddress( "var3", &var3 ); theTree->SetBranchAddress( "var4", &var4 );

for (Long64_t ievt=3000; ievt<theTree->GetEntries();ievt++) { theTree->GetEntry(ievt);

var1 = userVar1 + userVar2; var2 = userVar1 - userVar2; cout << reader->EvaluateMVA( "MLP method" ) <<endl; }

delete reader;}

create Reader

tell it about the variables

selected MVA method

set tree variables

(example uses variables not

directly avaiable in the tree) event loop

calculate the MVA response


Use data set with 4 linearly correlated Gaussian distributed variables:

--------------------------------------- Rank : Variable : Separation --------------------------------------- 1 : var3 : 3.834e+02 2 : var2 : 3.062e+02 3 : var1 : 1.097e+02 4 : var0 : 5.818e+01 ---------------------------------------

A purely academic Toy example


Validating the Classifier Training

average no. of nodes before/after pruning: 4193 / 968

Validating the classifiers

TMVA GUIProjective likelihood PDFs, MLP training, BDTs, ....


TMVA output distributions:

The OutputClassifier Output

due to correlationscorrelations removed

Likelihood PDERS Fisher

Neural Network Boosted Decision Trees Rule Fitting


TMVA output distributions for Fisher, Likelihood, BDT and MLP…

The OutputEvaluation Output

For this case: Fisher discriminant provides the theoretically ‘best’ possible method

Same as de-correlated Likelihood

For this case: Fisher discriminant provides the theoretically ‘best’ possible method

Same as de-correlated Likelihood

Cuts and Likelihood w/o de-correlation are inferior

Cuts and Likelihood w/o de-correlation are inferior

Note: About A

ll Realistic

Use Cases are Much More Difficult T

han This One


Evaluation Output (taken from TMVA printout)

Evaluation results ranked by best signal efficiency and purity (area)------------------------------------------------------------------------------ MVA Signal efficiency at bkg eff. (error): | Sepa- Signifi-

Methods: @B=0.01 @B=0.10 @B=0.30 Area | ration: cance:------------------------------------------------------------------------------

Fisher : 0.268(03) 0.653(03) 0.873(02) 0.882 | 0.444 1.189MLP : 0.266(03) 0.656(03) 0.873(02) 0.882 | 0.444 1.260LikelihoodD : 0.259(03) 0.649(03) 0.871(02) 0.880 | 0.441 1.251PDERS : 0.223(03) 0.628(03) 0.861(02) 0.870 | 0.417 1.192RuleFit : 0.196(03) 0.607(03) 0.845(02) 0.859 | 0.390 1.092HMatrix : 0.058(01) 0.622(03) 0.868(02) 0.855 | 0.410 1.093BDT : 0.154(02) 0.594(04) 0.838(03) 0.852 | 0.380 1.099CutsGA : 0.109(02) 1.000(00) 0.717(03) 0.784 | 0.000 0.000Likelihood : 0.086(02) 0.387(03) 0.677(03) 0.757 | 0.199 0.682

------------------------------------------------------------------------------Testing efficiency compared to training efficiency (overtraining check)

------------------------------------------------------------------------------ MVA Signal efficiency: from test sample (from traing sample)

Methods: @B=0.01 @B=0.10 @B=0.30------------------------------------------------------------------------------

Fisher : 0.268 (0.275) 0.653 (0.658) 0.873 (0.873)MLP : 0.266 (0.278) 0.656 (0.658) 0.873 (0.873)LikelihoodD : 0.259 (0.273) 0.649 (0.657) 0.871 (0.872)PDERS : 0.223 (0.389) 0.628 (0.691) 0.861 (0.881)RuleFit : 0.196 (0.198) 0.607 (0.616) 0.845 (0.848)HMatrix : 0.058 (0.060) 0.622 (0.623) 0.868 (0.868)BDT : 0.154 (0.268) 0.594 (0.736) 0.838 (0.911)CutsGA : 0.109 (0.123) 1.000 (0.424) 0.717 (0.715)Likelihood : 0.086 (0.092) 0.387 (0.379) 0.677 (0.677)

-----------------------------------------------------------------------------

Bet

ter

clas

sifie

r

Check for over-training


More Toys: Linear-, Cross-, Circular Correlations

Illustrate the behaviour of linear and nonlinear classifiers

Circular correlations(same for signal and background)

More Toys: Circular correlations


Weight Variables by Classifier Performance

Example: How do classifiers deal with the correlation patterns ?

Illustustration: Events weighted by MVA-response:

Linear Classifiers:

Non Linear

Classifiers:

Decision Trees PDERS

Likelihood Fisherdecorrelated

Likelihood


CircularExample

Final Classifier Performance

Background rejection versus signal efficiency curve:



More Toys: “Schachbrett” (chess board)

Performance achieved without parameter adjustments:

PDERS and BDT are best “out of the box”

After some parameter tuning, also SVM und ANN(MLP) perform

Theoretical maximum

Event Distribution

Events weighted by SVM response


We (finally) have a Users Guide !

Available from tmva.sf.net

TMVA-Users Guide

TMVA Users Guide78pp, incl. code examples

arXiv: physics/0703039


TMVA unifies highly customizable and performing multivariate

classification algorithms in a single user-friendly framework

Summary

This ensures most objective classifier comparisons and simplifies their use

TMVA is available from tmva.sf.net and in ROOT (>5.11/03)

A typical TMVA analysis requires user interaction with a Factory (for classifier training) and a Reader (for classifier application)

a set of ROOT macros displays the evaluation results

We will continue to improve flexibility and add new classifiers

Bayesian Classifiers

“Committee Method” combination of different MVA techniques

C-code output for trained classifiers (for selected methods…)


More Toys: Linear-, Cross-, Circular Correlations

Illustrate the behaviour of linear and nonlinear classifiers

Linear correlations(same for signal and background)

Linear correlations(opposite for signal and background)


More Toys: Linear, Cross, Circular correlations


Weight Variables by Classifier Performance

Linear correlations(same for signal and background)

Linear correlations(opposite for signal and background)


How well do the classifier resolve the various correlation patterns ?

Illustustration: Events weighted by MVA-response:



Background rejection versus signal efficiency curve:

LinearExampleCrossExampleCircularExample



Stability with Respect to Irrelevant Variables

Toy example with 2 discriminating and 4 non-discriminating variables ?

use only two discriminant variables in classifiers

use only two discriminant variables in classifiersuse all discriminant variables in classifiers

use all discriminant variables in classifiers

Stability with respect to irrelevant variables


Using TMVA in Training and Application

Can be ROOT scripts, C++ executables or python scripts (via PyROOT), or any other high-level language that interfaces with ROOT

Can be ROOT scripts, C++ executables or python scripts (via PyROOT), or any other high-level language that interfaces with ROOT


A linear boundary? A nonlinear one?Rectangular cuts?

S

B

x1

x2 S

B

x1

x2

B

x1

x2

Introduction: Event Classification

Different techniques use different ways trying to exploit (all) features

compare and choose

How to place the decision boundary?

Let the machine learn it from training events

S

Date post:	30-Dec-2015
Category:	Documents
Upload:	benedict-york
View:	215 times
Download:	2 times

1 Helge Voss Nikhef 23 rd - 27 th April 2007TMVA Toolkit for Multivariate Data Analysis: ACAT 2007...

Documents