+ All Categories
Home > Documents > Titanic - Presentation

Titanic - Presentation

Date post: 15-Apr-2017
Category:
Upload: sonali-haldar
View: 458 times
Download: 0 times
Share this document with a friend
19
Business Analytics and Insights Final Project Pallavi Herekar | Sonali Haldar
Transcript
Page 1: Titanic - Presentation

Business Analytics and InsightsFinal Project

Pallavi Herekar | Sonali Haldar

Page 2: Titanic - Presentation

Introduction

• RMS Titanic was a British passenger liner that started its journey with 2200 passengers and four days later sank in the North Atlantic Ocean in the early morning of 15th April 1912. Around 1500 people died and 700 survived the tragedy

• According to Encyclopedia Titanica, of the 712 survivors 500 were passengers ( 369-women & children ,131-men) and 212 were crew (20-women, 192-men)

Page 3: Titanic - Presentation

Problem Statement

• Hypothesis: Certain sources claim that the survivors belonged to one of the following categories– Women, Children and/or Upper Class

• Our problem to confirm if this hypothesis is true or not using the given sample of 342 survivors data and derive conclusions using different models in R

Page 4: Titanic - Presentation

Data Visualization

Page 5: Titanic - Presentation

Data Visualization

Page 6: Titanic - Presentation

Data Acquisition & Processing• Data source:

Kaggle - https://www.kaggle.com/c/titanic

• Data Processing• Analyzed the data-types using str() function and converted some

factor data types

• Identified the columns with NA/NaN values

• Assigned the empty values with median or mode values of the columns

• Populated the missing ages with median age• Populated the empty Embarked values with Cherbough• Populated the empty Sex values with Male as the distribution was mostly male

• Converted the values into factors before generating the Association model

Page 7: Titanic - Presentation

Data Analysis - Statistical Models

• General Linear Model• Model 1 - Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + AgeGrp• Model 2 - Pclass + Sex + Age• Model 3 - poly(Age, 2) * Sex * Pclass + SibSp

• LDA Model• QDA Model• KNN Model• Recursive Decision Tree • Random Forest • Association Rules

Page 8: Titanic - Presentation

Model Result MatrixModel Accuracy 95% Confidence Interval

GLM - Model 1 79.03 % (0.7365, 0.8375)

GLM - Model 2 78.28 % (0.7284, 0.8307)

GLM - Model 3 80.90 % (0.7566, 0.8543)

LDA 79.40% (0.7405, 0.8409)

QDA 77.53% (0.7204, 0.8239)

KNN 62.92 % (0.5682, 0.6873)

Decision Tree 78.65 % (0.7324, 0.8341)

Random Forest 81.65 % (0.7647, 0.861)

Page 9: Titanic - Presentation

Result Interpretation - GLM - Model 1

• Model 1 indicates that Pclass, Sexmale has higher significance followed by Age and SibSp.• Staying in 2nd class compared to 1st class and 3rd class compared to 2nd class reduces the

odds of survival by a factor of 1.3 respectively• Being Male reduces the odds of survival by a factor of 2.7 compared to being female• Age and SibSp reduces the odds of survival by a factor of .09 and .33 respectively• In model 1, predictors: Parch, Fare, EmbarkedQ, EmbarkedS, AgeGrp are not statistically

significant

Page 10: Titanic - Presentation

Result Interpretation - GLM - Model 2

• In Model 2, we just picked the statistically significant predictors based on the analysis from Model 1

• Only considering the significant predictors decreased the accuracy from 79.03 % to 78.28 %. This signifies that we missed few other predictors that can also influence the predictability. In this case it can be AgeGrp and EmbarkedS

Page 11: Titanic - Presentation

Result Interpretation - GLM - Model 3

• Model 3 is formulated where Age is given much higher weightage • The biggest differentials predictors are Sex, Pclass and SibSp• The accuracy is highest using this model i.e. 80.9%• 37 out of 186 were wrongly predicted which is error rate of 19.12% (Type I)• Also only 14 out of 81 survivors were missed by this model i.e. 17.28% (Type II)• This model clearly is the best predictor model for this data and shows that along with Age

other factors like Sex,PClass and SibSp are statistically significant

Page 12: Titanic - Presentation

Result Interpretation - GLM using ROC Curve

Model 1 - BlackModel 2 - GreenModel 3 - Blue

Area Under ROC Curve:

Model 1 - 0.8321Model 2 - 0.8291Model 3 - 0.8309

Page 13: Titanic - Presentation

Result Interpretation - LDA

• LDA was conducted on 70% of sample (Train Set) and validated using 30% of sample ( Validation Set) to predict the factors affecting survivors

• LDA model indicates that the percentage of accuracy is 79.4%

• 104 were predicted to survive of which actually 73 survived. Hence 31 out of 170 were incorrectly identified to have survived. 24 survivors were missed by the model

Page 14: Titanic - Presentation

Result Interpretation -QDA

• QDA was conducted on sample set as LDA to predict the factors affecting survivors

• QDA model indicates that the percentage of accuracy is 77.53%

• 104 were predicted to survive of which actually 71 survived. Hence 33 out of 169 were incorrectly identified to have survived. 27 survivors were missed by this model

Page 15: Titanic - Presentation

Result Interpretation -KNN

• We used K=1 for the KNN model. It does not indicate which factors are significant in determining the survivors

• The accuracy of KNN is least (62.92%) compared to other classification methods which indicates that the data may have a complex non-linear relationship which cannot be explained by this non-parametric method

• The confusion matrix indicates that 63 out of 190 i.e. 33.16% of actual survivors were predicted as non-survivors (Type 1 error)

• However of the 77 survivors, 36 were missed by the model i.e. 46.75% (Type II error) which implies that this is not a good model to be used

Page 16: Titanic - Presentation

Result Interpretation - Decision Tree

● Decision tree highlights that apart from Sex and Pclass, Age, Fare, SibSp and Embarked are also significant predictors

● The tree also highlights that females from 3rd class has 21 % survival rate

● The tree also highlights that males greater the 6.5 years has 80% non survival rate

Page 17: Titanic - Presentation

Result Interpretation - Random Forest

● The Mean Decrease Accuracy parameter measure defined that Sex has the highest significance to predict the survival of passengers followed by PClass, Fare and Age

Page 18: Titanic - Presentation

Result Interpretation - Association Analysis● Based on the Association Analysis,

we found that 3rd class male passengers who embarked from Southampton and are in the age group of 20-30 years are 46.6% more likely not to survive compared to others.

Page 19: Titanic - Presentation

Conclusion• Comparing the accuracy of the different models, Random

Forest is the best followed by GLM-model3 is the best among classification models where Age is considered as a second degree polynomial

• Random Forest model highlights the importance of predictors Sex, Pclass, Fare and Age

• GLM Model 3 have similar results where Age is given higher weightage followed by Sex, Pclass, SibSp

• After analyzing all the models we can conclude that predictors Sex, Pclass, Age did played a major role for Titanic survivors


Recommended