+ All Categories
Home > Documents > Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN,...

Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN,...

Date post: 21-Oct-2019
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
46
Dean Abbott Abbott Analytics, SmarterHQ KNIME Fall Summit 2018 Email: [email protected] Twitter: @deanabb © Abbott Analytics 2001-2018 1 Doing the Data Science Dance
Transcript
Page 1: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Dean Abbott

Abbott Analytics, SmarterHQ

KNIME Fall Summit 2018

Email: [email protected]

Twitter: @deanabb

© Abbott Analytics 2001-20181

Doing the Data Science Dance

Page 2: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Data Science vs. Other Labels

© Abbott Analytics 2001-20182

Page 3: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Google Trends

© Abbott Analytics, 2001-20183

Page 4: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Google Trends

© Abbott Analytics, 2001-20184

Page 5: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

What do Predictive Modelers do?The CRISP-DM Process Model

© Abbott Analytics 2001-20185

• CRoss-Industry Standard Process Model for Data Mining

• Describes Components of Complete Data Mining Cycle from the Project Manager’s Perspective

• Shows Iterative Nature of Data Mining

Business

Understanding

Data

Understanding

Data

Preparation

Modeling

Evaluation

Deployment

DataData

Data

Page 6: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

What we Want

to Do!

© Abbott Analytics 2001-20186

Page 7: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

© Abbott Analytics 2001-20187

How The Citizen Data

Scientist Will

Democratize Big Data

Published on April 6, 2016

Page 8: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

© Abbott Analytics 2001-20188

How The Citizen Data

Scientist Will

Democratize Big Data

Published on April 6, 2016

Retailer Sears, for example,

recently empowered 400 staff

from its business intelligence (BI)

operations to carry out advanced,

Big Data driven customer

segmentation – work which

would previously have been

carried out by specialist Big Data

analysts, probably with PhDs.

Page 9: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Is it a Recipe?

© Abbott Analytics 2001-20189

Page 10: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Is it a Recipe?

© Abbott Analytics 2001-201810

Can we apply a recipe to

machine learning and

data science modeling

processes?

Page 11: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Good Set of Data Prep Steps!

© Abbott Analytics, 2001-201711

https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf

Page 12: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Data Preparation Dependencies

• Fill missing values

• Explode categorical variables

• *Outliers and scale very influential

• Sometimes automatic in software; beware of how!

• Categoricals are fine

• Numeric data must be binned (except some

decision trees)

• Outliers don’t matter

• Missing values a category

© Abbott Analytics, 2001-201812

Page 13: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Why Are Outliers a Problem?

Squares…

Linear Regression:

Mean Squared ErrorK-Means Clustering

© Abbott Analytics, 2001-201813

https://en.wikipedia.org/wiki/Mean_s

quared_error

https://en.wikipedia.org/wiki/Eucli

dean_distance

Page 14: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Effect of Outliers on Correlations

(and Regression)

• 4,843 records

© Abbott Analytics 2001-201814

Page 15: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Effect of Outliers on Correlations

(and Regression)

• 4,843 records

© Abbott Analytics 2001-201815

Page 16: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Effect of Outliers on Correlations

(and Regression)

• 4,843 records

Corresponds to R^2 increase from 0.42 to 0.53

© Abbott Analytics 2001-201816

Page 17: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Decision Trees Can Handle it

© Abbott Analytics 2001-201817

Page 18: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Effect of Distance on Clusters

© Abbott Analytics, 2001-201718

Page 19: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Effect of Distance on Clusters

© Abbott Analytics, 2001-201719

Page 20: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Effect of Distance on Clusters

© Abbott Analytics, 2001-201720

Page 21: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Effect of Distance on Clusters

© Abbott Analytics, 2001-201721

Page 22: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

© Abbott Analytics 2001-201822

Page 23: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Log transform the heavily skewed fields

© Abbott Analytics 2001-201823

Page 24: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

© Abbott Analytics 2001-201824

Dummy Vars

Note: stdev are

Typically 0.5

Page 25: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Try K-Means with Different

Normalization Approaches

© Abbott Analytics 2001-201825

Page 26: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

K Means Clustering:

Magnitude and Dummy Bias

© Abbott Analytics 2001-201826

Measurements

are F Statistic

Page 27: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

PCA: Natural Units

© Abbott Analytics 2001-201827

Page 28: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

PCA: Scaled Units

© Abbott Analytics 2001-201828

Page 29: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

PCA: Scaled and Dummy Scaling

© Abbott Analytics 2001-201829

Page 30: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

PCA: Scaled and Dummy Scaling

© Abbott Analytics 2001-201830

Page 31: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Missing Value Imputation

• Delete the record (row), or delete the field (column)

• Replace with a constant

• Replace missing value with mean, median, or distribution

• Replace missing with random self-substitution

• Surrogate Splits (CART)

• Make missing a category

• Simple for “rule-based” algorithms; Turn continuous into categorical for numeric algorithms

• Replace with the missing value with an estimate

• Select value from another field having high correlation with variable containing missing values

• Build a model with variable containing missing values as output, and other variables without

missing values as an input© Abbott Analytics, 2001-201831

Page 32: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

CHAID Trees: Missing Values are

Just Another Category

© Abbott Analytics 2001-201832

Page 33: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Summary

© Abbott Analytics 2001-201833

Data Preparation StepLinear

Regression K-NNK-Means

Clustering PCANeural

NetworksDecision

TreesFill Missing Values Y Y Y Y Y *Correlation Filtering Y Y Y

De-Skew (log, box-cox) Y Y Y YMitigate Outliers Y Y Y Y * *Remove Magnitude Bias (Scale) Y Y Y Y *Remove Categorical "Dummy" Bias Y Y Y Y

Mitigate Categorical Cardinality Bias -- -- -- -- -- Y

Page 34: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Stratify or Not to Stratify…

That is the Question!?

© Abbott Analytics 2001-201834

5.1% TARGET_B = 1:

unbalanced data

Page 35: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Comparing Logistic Regression with and

without Equal Size Sampling

© Abbott Analytics 2001-201835

Equal Sampling

No Stratified Sampling

https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf

Page 36: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Don’t Need to Stratify With Many Algorithms

© Abbott Analytics 2001-201836https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf

Page 37: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Know the Algorithm when Developing

Sampling Strategy

© Abbott Analytics 2001-201837

Variable Coeff.Std. Err.

P>|z| Coeff._natural Std. Err._natural P>|z|_natural coeff diff coeff compare

RFA_2F -0.133532984 0.0338 0.000 -0.1563345 0.024 0.000 0.023 within SE

D_RFA_2A -0.163727182 0.1210 0.176 -0.0934212 0.079 0.237 0.070 within SE

F_RFA_2A 0.038231571 0.0884 0.665 0.0357819 0.062 0.565 0.002 within SE

G_RFA_2A 0.316663027 0.1267 0.012 0.2779701 0.091 0.002 0.039 within SE

DOMAIN2 -0.068966948 0.0767 0.369 -0.1169964 0.056 0.036 0.048 within SE

DOMAIN1 -0.266408264 0.0837 0.001 -0.2845323 0.060 0.000 0.018 within SE

NGIFTALL_log10

-0.46212497 0.0998 0.000 -0.4444304 0.072 0.000 0.018 within SE

LASTGIFT_log10

0.062766545 0.2044 0.759 0.1813683 0.141 0.199 0.119 within SE

Constant 0.695770991 0.2785 0.012 3.5393926 0.194 0.000 2.844 outside SE

Stratified Natural (orig)

Page 38: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Input Variable Interactions

• Algorithms are mixed on interactions in theory

• Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models

• Decision trees are greedy searchers

• Built to find interactions

• But, only if they can be found in sequence (one at a time, stepwise)

• Neural Networks find interactions well (XOR)

• Naïve Bayes find intersections, not interactions

• Algorithms don’t always identify interactions well or well-enough in practice

© Abbott Analytics, 2001-201738

Page 39: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Simple Interaction Function

• Two uniform variables: x and y

• 2,564 records

• if ( x*y > 0 ) return ("1");

• else return("0");

© Abbott Analytics, 2001-201739

Page 40: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Four Classifiers

• aaa

© Abbott Analytics, 2001-201740

Naïve BayesDecision Tree, min Leaf node 50 records

Logistic Regression Rprop Neural Net, 300 epochs

Page 41: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Errors

© Abbott Analytics, 2001-201741

Naïve BayesDecision Tree, min Leaf node 50 records

Logistic Regression Rprop Neural Net, 300 epochs

True correct

False incorrect

False correct

True incorrect

Page 42: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Don’t Build Interactions Manually*

• Too many…too many

• So what do you do?

© Abbott Analytics, 2001-201742* Except for those you know about

Page 43: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Automatic Interaction Detection• Trees: build 2-level trees

• Pros: works with continuous and categoricals

• Cons: greedy, only finds one solution at a time (Battery)

• Association rules: build 2-antecedent rules

• Pros: exhaustive

• Cons: only works with categoricals

• Use the linear/logistic regression algorithm itself, loop over all 2-way interactions

• Pros: context is the model you may want to use, easy to do in R, Matlab, Python, SAS (coding)

• Cons: slow, have to code, what to do with dummies

© Abbott Analytics, 2001-201743

Page 44: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

© Abbott Analytics 2001-201844

Summing up what

we’ve covered

Is this a Recipe?

Page 45: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Is it a Recipe?....YES!

© Abbott Analytics 2001-201845

Can we apply a recipe to

machine learning and

data science modeling

processes?

Page 46: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy

Conclusions

• Know what the algorithms can do (and not do!)

before deciding on data preparation

• When are data shapes and data ranges important?

• It’s not hard….just requires some thought

• Once you know what to do, you have your recipe!

© Abbott Analytics 2001-201846


Recommended