+ All Categories
Home > Data & Analytics > Higgs bosob machine learning challange

Higgs bosob machine learning challange

Date post: 06-Aug-2015
Category:
Upload: tharindu-ranasinghe
View: 89 times
Download: 2 times
Share this document with a friend
Popular Tags:
24
Higgs Boson Machine Learning Challenge Group Project CS4622 Team Members: 100112V Edirisinghe E.A.S.D 100132G Fernando W.V.D. 100440A Ranasinghe R.H.T.D. 100498G Senaratne H. H. 100559V Vithana Y. G. K. 100577A Weerasinghe L.A.
Transcript

Higgs Boson Machine Learning Challenge

Group Project ­ CS4622

Team Members:

100112V ­ Edirisinghe E.A.S.D 100132G ­ Fernando W.V.D. 100440A ­ Ranasinghe R.H.T.D. 100498G ­ Senaratne H. H. 100559V ­ Vithana Y. G. K. 100577A ­ Weerasinghe L.A.

Table of Content

1. Introduction

2. Background

3. Approach Followed

3.1 Preprocessing

3.1.1 Understanding the nature of the given variables

3.1.2 Handling missing values

3.1.3 Converting Data Types

3.1.4 Data Normalization

3.1.5 Feature Selection and Deriving Features

3.2 Training Techniques

3.2.1 Random Forest Classifier

3.2.2 Gradient Boost Classifier

3.2.3 Neural Networks

3.2.4 XGBoost Classifier

4. Results and Discussion

5. Reference

6. Appendix

1. Introduction

This report reveals the procedure used by the team in order to solve “Higgs Boson Machine

Learning Challenge” stated under the Kaggle site.

As the initial parts of this report, we have delivered some knowledge about the background of

this problem, which is closely related to particle physics. Later on we have included how we have

modeled and pre­processed the data, what machine learning techniques and procedures that we have

used to solve this problem and what results we were able to obtain with the followed approaches.

Finally we have analysed and discussed about the methods that we followed and the outputs that we

obtained.

2. Background

Discovery of Higgs Boson which is an elementary particle of particle physics was recently

claimed by the ATLAS experiment and the CMS experiment. This discovery was acknowledged by the

2013 Nobel prize in physics given to Francois Englert and Peter Higgs. The related experiments are

running at the Large Hadron Collider which is commonly known as LHC at CERN (the European

Organization for Nuclear Research), Geneva, Switzerland; which began operating in 2009 after about

20 years of design and construction.

This particle decays under several processes, and produces other particles. A channel is the

term that is used to indicate the decay of a particle into other specific particles in physics. It was recently

reported by the ATLAS experiment, the first evidence of the Higgs Boson to tau tau channel. The

ATLAS experiment has observed a signal of the above mentioned decay and this signal is small one and

buried in background noise.

What is expected from this Higgs Boson Machine Learning Challenge is to explore the potential

of advanced machine learning methods to improve the discovery significance of the experiment by

classifying a given event into the correct region out of ‘signal’ and ‘background’. That is deciding

whether the the results of a certain event has happened due to tau tau decay of Higgs Boson (signal) or

due to other background noise (background).

Training set consists of several primary attributes and derived attributes related to this event

classification, along with signal/background labels and with weights. The weights are related to the

normalizations of signals and backgrounds. The test set consists of the variables in training set instead of

labels and weights. The required data as the solution should contain the fields EventID (a unique

identifier for each event), RankOrder (a permutation of integer list from 1 to test set size) and Class

(either “b” or “s”). The higher ranks indicate more signal­like events and the lower ranks indicate more

background­like events. Since the rank could be calculated using the weight values, the objective is to

find a function of weights or in simple terms to predict the weights for test set after training a machine

learning model with the use of training set. Depending on the value for the weight it is possible to predict

the event’s class because it is clear that two different ranges of weights fall into two different classes.

Figure 1: Graphical representation of a Higgs boson decaying to two tau particles in the ATLAS

detector

3. Approach Followed

Under this section we discuss how we preprocessed training data in order to feed for a machine

learning model and what machine learning techniques that we have used for training.

3.1 Preprocessing

Data preprocessing plays an important part in any machine learning challenge. In higgs Boson

Machine Learning Challenge, we used several data preprocessing methods which will be described in

this section.

3.1.1 Understanding the nature of the given variables

Before starting with the preprocessing work, we tried to figure out any directly visible

relationships between the classification and the variables. In order to see this we thought of graphically

representing the data which will show any information directly associated with the classification. The

following figures (Figure 2 to Figure 5) show how the classification has occurred with respect to the

range of values of few of the variable.

Figure 2: Classification relative to the distribution of the variable Der_lep_eta_centrality

Figure 3: Classification relative to the distribution of the variable Weight

Figure 4: Classification relative to the distribution of the variable DER_mass_MMC

Figure 5: Classification relative to the distribution of the variable PRI_lep_eta

Through these visualizations we figured out that there is no directly associated variable except

for the weight. From this fact we learned that if we predict the weight for the given test scenarios, we

will be able to do both the classification and the ranking at the same time.

3.1.2 Handling missing values

In the data given for the competition the missing values are stored as ­999. Exploring we

discovered that there are lot of missing values in data.

Figure 6: Variable statistics

As you can see in the Figure 6, many columns like DER_deltaeta_jet_jet,

DER_massdelta_jet_jet have ­999 values for more than half of the total values (more than the median).

It was clear that dropping training subjects where the missing values are present cannot be used for

handling missing values, because we need to predict for the test entries which also contain these missing

values. So as the first approach to handle missing values, we tried dropping variables where the

missing values are present. It could not improve the results due to the large amount of missing values

present. After dropping the variables there was not enough data to predict and also important

relationships and variables tend to disappear for the sake of handling missing values. Therefore it is not a

good approach for handling missing values.

The next approach we considered to handle the missing values was to use traditional

imputation, but the results were not good. In this case we substituted missing values with the average

values of the corresponding variable columns while ignoring the missing values for the calculations. The

main reason behind the non improved performance is that the missing values are “actually” missing

values where a value for that feature can not exist in that particular training instance. So the best way to

handle the missing values is to interpret ­999 as a special missing value and use algorithms that will

consider ­999 values as a special category.

3.1.3 Converting Data Types

In Order to apply xgboosting and gradient boosting techniques the value of the label should be

numeric. So we had to change the Label type to 0,1 in when we were doing data pre processing. Used

0 if label is equal to “b” and used 1 if the label is equal to “s”.

3.1.4 Data Normalization

If you have a look at Figure 6, the distribution of data values varies highly for different columns.

For an example the column DER_pt_h varies from 0 to 2835, DER_met_phi_centrality varies from ­1.4

to +1.4 only. To guarantee stable convergence of weights and biases in our model we had to normalize

all the columns. In this competition we used min­max normalization where each value in the columns will

be matched to a value between 0 and 1.

3.1.5 Feature Selection and Deriving Features

The Figure 7 represents the correlation between the label and other features in the data set.

Figure 7: The correlation between the label and other features in the data set

As you can see in the Figure 7 some variables like PRI_tau_eta can be dropped when building

the model since they are insignificant to the Label value. But the other thing you should notice with the

diagram is that no variable can be considered as significant to Label­value that we should predict. So

deriving new features was required.

We identified 4 features[1] which can be important to our model.

ssymenj (MET MHT )/(MHT MET )a = − +

ijet sum of the two jet massesd =

eltaphi jet1_phi jet2_phid = −

eltaphimet (jet1_phi jet2_phi) / 2d = +

The feature; dijet is already included as a variable in the data sets as; DER_mass_jet_jet. We

derived the other three variables with the available data as follows. Since MHT (Missing energy

calculated from the jets) was not readily available we used a derived variable (estimatedMHT) which is

proportional to this quantity.

stimatedMHT PRI_jet_all_pt PRI_jet_leading_pt PRI_jet_subleading_pte = − −

ssymenj (PRI_met stimatedMHT ) / (PRI_met stimatedMHT ) a = − e + e

eltaphi PRI_jet_leading_phi PRI_jet_subleading_phi d = −

eltaphimet (PRI_jet_leading_phi PRI_jet_subleading_phi) / 2d = +

Also identified a variable using greedy approach that had a 0.2 correlation with the label.

pecial DER_mass_MMC ER_pt_ratio_lep_tau / (DER_sum_pt .0000001)S = ×D + 0

These newly added columns in the pre­processing stage improved the score in the public leader

board with submissions using the xgboosting algorithm. The initial version of these new variables simply

calculated the relevant values using the relevant columns of data without considering the fact whether

any of the columns have ­999. In that case the variable creation algorithm simply took ­999 as a valid

value for the respective field and calculated the result. We thought that since ­999 is not a valid value,

but only an indicator that those values are not available.

With consideration of the above fact, we decided to filter out the entries which have invalid

inputs. We changed the variable creation algorithm to output ­999 in cases where at least one of the

inputs have invalid value. But unfortunately the results did not appear to be as expected. from the

analysis of the change and the results, we came to a conclusion that the success rate decreased due to

the elimination of diversity. To clarify this let’s use two example entries.

EventID Value 1 Value 2 New variable neglecting ­999

New Variable considering ­999

1 ­999 2.14 ­996.86 ­999

15 1.52 ­999 1000.52 ­999

122 ­999 ­999 0 ­999

You can see that for the above three entries, the variable which does not consider the invalid

input has given three different values and their value range is directly associated with the invalid data

input combination. But in the variable created with the consideration of the invalid inputs, all three have

the same value ­999. This clearly shows us the reason why the success rate decreased with the new

variable. The new variable had eliminated variability of the previous variable which hides a lot of

information which are very important in classification.

The new variables before considering the invalid inputs seemed to have introduced new

measurements of the relations among the the variables when taken as groups. With this improvement we

thought of discovering the possibility of creating more derived variables to impose measurements of the

collective relative relationship among the primitive variables for a specific result.

We came up with few more columns by randomly combining primitive variables. The intention

was to see if we can improve the success rate by introducing new variables which have combined

information on other variables. But the results decreased the success rate. So we concluded that

introducing variables which have known relationship to the classification may improve success rate

while others may decrease the success rate by introducing unimportant relationships.

3.2 Training Techniques

3.2.1 Random Forest Classifier

We used random forest classifier for the higgs boson challenge in the earlier stages. We used

scikit learn package for python to develop the solution. Random Forest Classifier comes under the

sklearn.ensemble package in scikit learn.

The basic functionality of random forest is as follows[2]. It creates number of classification trees

instead of making a single classification tree. So when it needs to classify for new input it is given to all

the classifier trees and the answer is taken. Then in order to get the final answer it uses voting

mechanism where results from each tree is considered as a vote and the final answer is selected by the

answer which has most votes.

When building the trees in the random forest, there are some guidelines follows. One is if there

are N cases in the training set then there will be N sample cases which are used to train the trees. One

other thing is it will select m number of variables randomly from the total M number of input variables for

each node. Other thing is the trees are grown without any pruning.

One major feature in random forest classifier is it runs efficiently on large data sets, and it can

handle large number of input variables. It has the ability to handle the missing values effectively and it can

maintain the accuracy when large proportion of data is missing. Furthermore it has the ability to identify

the variables which are most important and relationship between variables. It also does not get over

fitted to the inputs.

When training the trees in random forest classifier about one third of data is not used and they

are used as out of bag data to get running unbiased estimates and also to get the importance of

variables. The rest of the data is used a bootstrap to train the trees. The out of bag data for the trees are

put back on them to get a classification, and finally take a class which got most votes from the out of

bag data. That is used as an error estimate for the random forest classifier.

Measuring the importance of variables is also an important feature in random forest

classification. That is done by putting the out of bag data on each tree in the forest and count the number

of votes in correct class. Then it changes the value in the variable that needs to be checked and put

back on to the trees and count the votes in the correct class. Then by subtracting the votes from the

original result and from the changed input and averaging the results over the forest to get an score about

the importance of the variable. If the number of variables in the data set is very high then forest can be

run for all the variables for once and then again with only the most important variables.

Proximities are also an important feature in the random forest classifier. It is formed by creating

a NxN matrix and putting all the data including the training data and out of bag data. Since it is not

possible to have NxN matrix for large data sets NxT matrix is formed where T is number of trees in the

forest.

To fill the missing values in the data set random forest classifier has two methods. The faster

way is filling the missing values by the median. But the more accurate way is the second way where

initially filling the missing values by rough estimates and then run the forest to compute the proximities.

Outliers are identified by the random forest method by the proximity values. If there are entries

in a class with small proximities then those entries are identified as outliers.

In the random forest classifier of scikit learn package there are several parameters which we can

use to tune our results[3]. One parameter is n_estimators which is used to specify the number of trees in

the forest. max_depth is used to specify the maximum depth of the trees in the random forest. The

default value for that is none and it will expand until all the leaves are pure. oob_score is an boolean

parameter to specify whether to use Out of bag data for the dataset.

It has several methods that the users can use for the prediction work. Fit method is used to build

the forest using the training set and predict method is used to predict the results for the test data. It has a

method called transform which can be used to reduce the input data matrix to the most important

features.

Initially we tried Random forest classifier to predict the Label value of the data as Signal (s) or

Background (b) without predicting the weight value. That way we were not able to have a rank value

for the test data. And also for the initial submission we removed derived features from the training and

then we added them back later. Then we made submissions with replacing the values with ­999 from

the average value of the columns and also tried removing those columns from training and test data sets.

But Random forest classifier did not gave much good results with either of those methods. The

maximum we were able to score with random forest method was 2.90576 in the private score with the

n_estimator value as 150. Then when we tried to estimate the weight value using the random forest

classifier it failed because it took huge amount of memory. So we decided to move for other available

options to have better results.

3.2.2 Gradient Boost Classifier

Another classifier we tested in the initial states was the Gradient Boost Classifier. Gradient

boosting algorithms use an ensemble of weak decision trees built to optimize a customizable loss

function. Trees are built using boosting in a staged manner. Gradient boosting classifiers can be used for

both regression and classification. Gradient boosting algorithms can handle data of mixed types and are

very robust to outliers.

We used a Gradient boosting regression trees algorithm from the Scikit­learn library in python

for this problem. This model used all the features in the data set to train the classifier. To improve the

accuracy we used hyper parameter tuning along with stratified cross validation to set the best values for

the parameters.

We also tried using multiple loss functions such as the default ‘deviance’ function as well as the

AMS function used in this competition. Using the AMS function as the loss function improved our

results slightly. Even with all this effort we were unable to match the performance we got from the

XGBoost algorithm. so this approach was dropped.

3.2.3 Neural Networks

Artificial neural networks provide a general, practical method for learning real­valued,

discrete­valued, and vector­valued functions from examples. Algorithms such as Backpropagation use

gradient descent to tune network parameters to best fit a training set of input­output pairs. Neural

Network learning is robust to errors in the training data and has been successfully applied to problems

such as interpreting visual scenes, speech recognition, and learning robot control strategies.

We have used the PyBrain[5] python library to build a neural network which used

backpropagation algorithm to train the network. While training the neural network, we have faced a

number of problems such as

1. Number of hidden layers to be used

Number of Hidden Layers Result

none Only capable of representing linear separable functions

decisions.

1 Can approximate any function that contains a continuous

mapping from one finite space to another.

2 Can represent an arbitrary decision boundary to arbitrary

accuracy with rational activation functions and can

approximate any smooth mapping to any accuracy.

Above summarize the knowledge we have acquired by going through various research

papers. But unfortunately we were unable to find a specific method to determine the number

hidden layers and hence we’ve tested a various number of hidden layers ranging from 2 to 50.

We were unable to increase the number of hidden layers further due to the huge amount of time

taken by the network training phase.

2. Numbers of neurons for each hidden layer

We were unable to find any specific formula to calculate the number of neurons in a particular

hidden layer. Although we found many rule­of­thumb methods for determining the correct

number of neurons to use in the hidden layers, such as the following:

The number of hidden neurons should be between the size of the input layer and the size

of the output layer.

The number of hidden neurons should be 2/3 the size of the input layer, plus the size of

the output layer.

The number of hidden neurons should be less than twice the size of the input layer.

We have applied above rules and further we tried to decide the number of neurons in a

hidden layer based on a combination of prime number series and fibonacci number series.

3. Neural Network training time

Training the neural network took a lot of time. As a last resort we tried to use genetic

algorithms[6] and pruning algorithms[7][8] to optimize the neural network. But the result was

not satisfactory

4. How to decide the cut­off mark for signal or background noise

The output of the neural network was a floating point value between 0­1. The values

closer to the 1 are the signal and the values closer to the 0 are the background noise. By using

the 10­fold cross validation, we found that floating point values above 0.65 should be

considered as signal and values below 0.65 should be considered as background noise

However, all the prediction results obtained through the neural network model performed

poorly when compared to other models during the cross validation.

3.2.4 XGBoost Classifier

We used xgboost package[4] with R language which implements extreme gradient boosting

classifier. Extreme gradient boosting classifier is an efficient and scalable implementation of gradient

boosting framework that we have described earlier. The package includes efficient linear model solver

and tree learning algorithms. The package can automatically do parallel computation with OpenMP, so

that this is more than 10 times faster than the previously used gradient boosting. Xgboost supports

various objective functions, including regression, classification and ranking. We used “rank” function in

order to rank the probabilities of the events being due to signals, as required for the submissions. The

two sets of classes; “s” and “b” are classified using a threshold value which was carefully chosen after

analysis, on the sorted test entries according to the probabilities of those being due to signals.

Unlike in gradient boost classifier, xgboost classifier provides a special way of handling missing

values. Xgboost automatically learns the best direction to go when a value is missing. We feeded ­999

as missing values to xgboost classifier and by doing so, we could improve the scores that we got in

public leaderboard.

Then we tried tuning up the parameters. We used GridSearch in SciKit learn package[9] in

selecting a better set of parameters, apart from the default parameter values. With the following set of

parameters we could obtain good results. Since gradient boosting dramatically improves the model’s

generalization ability with lower learning rates (heavy shrinkages), we reduced the default value of eta.

Since the lower learning rates needs more iterations increasing nrounds variable had a positive impact on

our results. It is a known fact that lower learning rates reduce overfitting.

eta = 0.05

max_depth = 6

silent = 1

nthread = 16

nround = 500

4. Results and Discussion As mentioned in the previous parts of this document we tried multiple classification systems to

solve this challenge to varying results.

1. Gradient Boosting 2. Random Forests 3. Neural Networks 4. XGBoost

But we found out that XGBoost algorithm was able to produce the best results for this problem.

Using the XGBoost algorithm as mentioned above we were able to get a score of 3.64655 which gave

us a rank of 437.

While we were able to get higher score on the public leaderboards they were misleading us on

the quality of the overall predictive ability of our models. So our best submission on the private

leaderboard which scored 3.64672 and was ranked 223 was a result of overfitting which caused us to

drop in the private leaderboard which determines the final positions.

So the biggest issue we had with our process in this competition was the lack of good cross

validation. We were relying too much on the public leaderboard to access the quality of our model that

we were unable to avoid overfitting the predictions to the leaderboard. As the public leader board was

made up using only 18% of the data relying on that to gauge improvements had the effect of leading to

overfitting. A valuable lesson learned through this contest is the importance of maintaining good

standards of cross validating our predictions, which would have allowed us to perform much better.

One of the major challenges we have faced during this competition was to come up with derived

features. Since we had no knowledge in the field of high energy particle physics we had to read a couple

of research papers on that area in­order to come up features mentioned in section 3.1.5. and in the

process we were able to gain a considerable amount of knowledge on that field.

This competition was a valuable opportunity for us to learn important machine learning and data

mining concepts while contributing to a very important scientific cause. Through this competition we

were able to get a better understanding of the challenges in the field and also the methodologies

practically used to overcome them.

5. References

[1] http://www.lps.ens.fr/~laetitia/HIGGS.pdf

[2] http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

[3] http://scikit­learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html

[4] http://cran.r­project.org/web/packages/xgboost/index.html

[5] Schaul, T., Bayer, J., Wierstra, D., Sun, Y., Felder, M., Sehnke, F., ... & Schmidhuber, J. (2010).

PyBrain. The Journal of Machine Learning Research, 11, 743­746.

[6] Karnin, E. D. (1990). A simple procedure for pruning back­propagation trained neural networks.

Neural Networks, IEEE Transactions on, 1(2), 239­242.

[7] Leung, F. H. F., Lam, H. K., Ling, S. H., & Tam, P. K. S. (2003). Tuning of the structure and

parameters of a neural network using an improved genetic algorithm. Neural Networks, IEEE

Transactions on, 14(1), 79­88.

[8] http://www.pybrain.org/docs/api/optimization/optimization.html#population­based

[9] Scikit­learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825­2830, 2011.

6. Appendix Appendix A : Public and private scores for Random Forest models and Logistic Regression

Models

Appendix B : Public and private scores for Gradient Boosting Models

Appendix C : Public and private scores for some Xgboost Models

Appendix D : Public and private scores for some Neural Network Models


Recommended