Submit Predictions
Statistics &Analysis
Data Management
Hypotheses
Goal
Get Data
Predict whom survived the Titanic Disaster
Woman and Children First
Read dataset into Excel, R, etc
Some Age Missing Data, Analyze Gender Only
74% Women, 19% Men
320 / 418 = 76.5%
Variable Description Type Hypothesispclass Passenger Class Categorical,
Ordinal1st class 3rd
name Name TextSex Sex Categoricalage Age Numericsibsp Number of Siblings/Spouses Aboard Integer
parch Number of Parents/Children Aboard Integer
ticket Ticket Number Textfare Passenger Fare Numericcabin Cabin Textembarked Port of Embarkation Categorical
Predictor Variables
AgeAll
N = 891
MissingN = 177
DataN = 714
0 10 20 30 40 50 60 70 80 900
2
4
6
8
10
12
14
16
18
20
Survived Not
• Dependent variable, (Y) • Continuous• Categorical
Decision Trees
The Decision Tree looks for split on sample at the node that can lead to the most differentiation on Y
Survived
Age Lesser Than X
Age Greater Than X• Independent variables, (X’s)
• Continuous• Categorical
Age
1 2 3 4 5 6 7 8 9 10 11 12 13 14 150%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
5
10
15
20
25
30
35
40
45
50
A B Delta N
0 10 20 30 40 50 60 70 80 90 1000
2
4
6
8
10
12
14
16
18
20
• maximize data likelihood (minimize deviance).
Decision Trees
Prediction and Missing Values
Variable Descriptionpclass Passenger Classname NameSex Sexage Agesibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Numberfare Passenger Farecabin Cabinembarked Port of Embarkation
Correlation, Association of Age with other Variables?
Submit Predictions
Statistics &Analysis
Data Management
Hypotheses
Goal
Get Data
Predict whom survived the Titanic Disaster
Woman and Children First
Read dataset into Excel, R, etc
Some Age Missing Data, Analyze Gender Only
74% Women, 19% Men
320 / 418 = 76.5%
Gender
Gender and Age• Tree grows based on optimizing
only the split from the current node rather then optimizing the entire tree• Tree stops when further split
becomes ineffective
0 10 20 30 40 50 60 700%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Female Survival%
Prediction: Gender + Age
Submit Predictions
Statistics &Analysis
Data Management
Hypotheses
Goal
Get Data
Predict whom survived the Titanic Disaster
Woman and Children First
Read dataset into Excel, R, etc
Some Age Missing Data, Analyze Gender Only
Submit Predictions
Statistics &Analysis
Data Management
Hypotheses
Goal
Get Data
Predict whom survived the Titanic Disaster
Woman and Children First
Read dataset into Excel, R, etc
Age + Gender
Kitchen Sink
Kitchen Sink
• Popular Implementations• CART Classification And Regression Tree• CHAID CHi-squared Automatic Interaction Detector
• CHAID allows multiple branch split - a wider tree• CART uses binary split
Decision Trees