Mastering the 80% of Analytics: What Data Scientists Really Do

Post on 22-Mar-2017

140 views 2 download

transcript

Mastering the 80% of Analytics

What Data Scientists Really Do

Mik, PhD @AvrioAnalytics

Boss Parents

Me Reality

Boss Parents

Me Reality

Caveat

•Data Science is very broad

•This is a particular perspective

•Mathematician

•Predictive algorithm developer

•Very brief

A Day in the LifeWrangling

Modeling

FeaturesR

esul

ts

What is “Wrangling”?•Data:

•Getting

•Formatting

•Cleaning

What is “Wrangling”?•Data:

•Getting

•Formatting

•Cleaning

Data Janitorial Work

Getting the Data

•Myriad of sources

Getting the Data

•Myriad of sources

•Varying collection, storage and maintenance

Getting the Data

•Myriad of sources

•Varying collection, storage and maintenance

•Most people just don’t care

Getting the Data

•Myriad of sources

•Varying collection, storage and maintenance

•Most people just don’t care

•At least not soon enough

Got it. Now what?

•Structured: in a consistent and defined format

Got it. Now what?

•Structured: in a consistent and defined format

•Unstructured: no consistent format

Got it. Now what?

•Structured: in a consistent and defined format

•Unstructured: no consistent format

•Text data

Got it. Now what?

•Structured: in a consistent and defined format

•Unstructured: no consistent format

•Text data

Movie Rating

Star Wars 5 StarsI loved the new Star Wars,

definitely 5/5 stars!

Formatting

•Alignment

Formatting

•Alignment

•Unions, intersections, grouping

Formatting

•Alignment

•Unions, intersections, grouping

•Transformations

FormattingTime Username Views

12:30 jsmith 32

12:45 mik 27

1:00 dmartin 8

1:15 jsmith 46

Time Username Views

12:20 gwarren 12

12:30 lpeabody 53

12:40 dmartin 20

12:50 hjohnson 5

Formatting

Username Views

jsmith 32, 46

Data is Dirty Business

•Duplicates

Data is Dirty Business

•Duplicates

•Missing values

Data is Dirty Business

•Duplicates

•Missing values

• Ill-formed values

Data is Dirty Business

•Duplicates

•Missing values

• Ill-formed values

•Wrong values

Data is Dirty Business

•Duplicates

•Missing values

• Ill-formed values

•Wrong values

Similar in effect

Data is Dirty Business

•Duplicates

•Missing values

• Ill-formed values

•Wrong values

Types of Missing-ness

•MCAR: Missing Completely at Random

Types of Missing-ness

•MCAR: Missing Completely at Random

•MAR: Missing at Random

Types of Missing-ness

•MCAR: Missing Completely at Random

•MAR: Missing at Random

•MNAR: Missing Not at Random

Types of Missing-ness

•MCAR: Missing Completely at Random

•MAR: Missing at Random

•MNAR: Missing Not at Random

Bad

Worse

Dealing with Missing DataX Y Z

129 1 40110 3210 32

989 65

Dealing with Missing Data

•DeletionX Y Z

129 1 40110 3210 32

989 65

Dealing with Missing Data

•Deletion

•Pairwise

•Listwise

X Y Z129 1 40110 3210 32

989 65

Dealing with Missing Data

•Deletion

•Pairwise

•Listwise

X Y Z129 1 40110 3210 32

989 65

X Z129 40210 32

PairwiseX Y Z

129 40

Listwise

Dealing with Missing Data

• Imputation

Dealing with Missing Data

• Imputation

•Mean substitution

•Regression

Dealing with Missing Data

•Multiple Imputation

Dealing with Missing Data

•Multiple Imputation

•Stochastic simulation

Dealing with Missing Data

•Multiple Imputation

•Stochastic simulation

•Must know distribution

Gotchas

•Sampling Error

Gotchas

•Sampling Error

•Statistical Power

Gotchas

•Sampling Error

•Statistical Power

•Population Parameters

Gotchas

•Sampling Error

•Statistical Power

•Population Parameters

•Propagation

So what do I do?

•Approaches vary quite a lot

So what do I do?

•Approaches vary quite a lot

•MCAR, MAR hard to prove

So what do I do?

•Approaches vary quite a lot

•MCAR, MAR hard to prove

•Principle of Least Harm

60% - 80% of Work

Cleaning Done! Now the fun!

•Almost…

Cleaning Done! Now the fun!

•Almost…

•Clean data is still “raw”

Cleaning Done! Now the fun!

•Almost…

•Clean data is still “raw”

•Features: pre-processed for modeling

Feature Engineering

•A lot of data is useless

Feature Engineering

•A lot of data is useless

•Filter, slice, transform

Feature Engineering

•A lot of data is useless

•Filter, slice, transform

•Singular idea: What’s the main driver?

Feature Engineering

•Considerations

•Relevance

•Redundancy

Feature Engineering

•Considerations

•Relevance

•Redundancy

•Curse of Dimensionality

Feature Engineering Methods

• PCA

• Edge Detection

• Blob Detection

• Auto encoding

• Kernel PCA

• Partial Least Squares

• Generalized Least Squares

• Direct Modeling

• Isomapping

• Mutual Information Theory

• Information Entropy Theory

• ICA

• MDR

• Latent Factors

• MPCA

• LSA

• Statistical Moments

• Random Projections

•De-Noising

•Weighting

•Patch Extraction

•Functional Mapping

•Discretization

•Filtering

•FFT

•Smoothing

•Density Mapping

Feature Engineering

• It’s hard

Feature Engineering

• It’s hard

•Analysis + Domain knowledge

Feature Engineering

• It’s hard

•Analysis + Domain knowledge

•…Deserves a presentation on its own

Feature Engineering

• It’s hard

•Analysis + Domain knowledge

•…Deserves a presentation on its own

•Features are input to machine learning

Now the fun stuff (finally)

•ML: computer acts without explicit program

Now the fun stuff (finally)

•ML: computer acts without explicit program

•Utilizes empirical data to “teach” a process

Now the fun stuff (finally)

•ML: computer acts without explicit program

•Utilizes empirical data to “teach” a process

•Pattern Rec. -> ML -> Deep Learning

Now the fun stuff (finally)

•ML: computer acts without explicit program

•Utilizes empirical data to “teach” a process

•Pattern Rec. -> ML -> Deep Learning

•Buzzwords abound

Now the fun stuff (finally)

•ML: computer acts without explicit program

•Utilizes empirical data to “teach” a process

•Pattern Rec. -> ML -> Deep Learning

•Buzzwords abound

•Fairly simple, lots of libraries

ML Approaches

•Classes of problems

ML Approaches

•Classes of problems

•Continuous (regression)

ML Approaches

•Classes of problems

•Continuous (regression)

•Discrete (classification)

ML Approaches

•Classes of problems

•Continuous (regression)

•Discrete (classification)

•Classes of solutions

ML Approaches

•Classes of problems

•Continuous (regression)

•Discrete (classification)

•Classes of solutions

•Supervised

ML Approaches

•Classes of problems

•Continuous (regression)

•Discrete (classification)

•Classes of solutions

•Supervised

•Unsupervised

ML Algorithms

•Neural Networks

ML Algorithms

•Neural Networks

•Genetic Algorithms

ML Algorithms

•Neural Networks

•Genetic Algorithms

•Bayesian Classification

ML Algorithms

•Neural Networks

•Genetic Algorithms

•Bayesian Classification

•Support Vector Machines

ML Algorithms

•Neural Networks

•Genetic Algorithms

•Bayesian Classification

•Support Vector Machines

•Many used as type of feature extraction

Neural Networks

•Motivated by brain function

•Neurons fire, activate paths

•Non-linear

•Simplest: PerceptronX1

X2

Logic Layer

w1

w2

Neural Networks

• Inputs feed neuron with weight

Neural Networks

• Inputs feed neuron with weight

•Logic Layer: activation function

Neural Networks

• Inputs feed neuron with weight

•Logic Layer: activation function

•Fires (or not) based on inputs

Neural Networks

• Inputs feed neuron with weight

•Logic Layer: activation function

•Fires (or not) based on inputs

•Weights from minimizing cost function

Neural Networks

• Inputs feed neuron with weight

•Logic Layer: activation function

•Fires (or not) based on inputs

•Weights from minimizing cost function

•Backpropagation

Sigmoid Logic Layer

0

0.25

0.5

0.75

1

-10 -8 -6 -4 -2 0 2 4 6 8 10

w = 1 w = 2

1

1 + e�w

Tx

Neural Networks

•Most networks are bigger

X1

X2

A1

AM

Y1

YK

Machine Learning

•Got data, features and algorithm

Machine Learning

•Got data, features and algorithm

•Just plug in and profit!

Machine Learning

•Got data, features and algorithm

•Just plug in and profit!

•Not quite

Machine Learning

•Got data, features and algorithm

•Just plug in and profit!

•Not quite

•Tuning and training

Tuning

•What about N, M and K?

X1

X2

A1

AM

Y1

YK

Tuning

•What about N, M and K?

•Hyper-parameters

X1

X2

A1

AM

Y1

YK

Tuning

•What about N, M and K?

•Hyper-parameters

•Size of layers, thresholds, etc.

X1

X2

A1

AM

Y1

YK

Tuning

•What about N, M and K?

•Hyper-parameters

•Size of layers, thresholds, etc.

•Static specifics of the algorithm

X1

X2

A1

AM

Y1

YK

Training

• It’s all about the teaching

Training

• It’s all about the teaching

•Representative data set

Training

• It’s all about the teaching

•Representative data set

•Large, clean

Training

•Don’t teach to the test

Training

•Don’t teach to the test

•Causes overfitting

Training

•Don’t teach to the test

•Causes overfitting

•Training (80%) and Testing (20%) data

Training

•Don’t teach to the test

•Causes overfitting

•Training (80%) and Testing (20%) data

•Cross-validation

With all the open source libraries, isn’t machine learning easy now?

I got results!

•Why doesn’t anyone care?

I got results!

•Why doesn’t anyone care?

•Kaggle vs. Real Life Syndrome

I got results!

•Why doesn’t anyone care?

•Kaggle vs. Real Life Syndrome

• It’s all in the presentation

It’s all in the presentation

•Complex topic

It’s all in the presentation

•Complex topic

•Non-technical audience

It’s all in the presentation

•Complex topic

•Non-technical audience

•Several stakeholders

It’s all in the presentation

•Complex topic

•Non-technical audience

•Several stakeholders

•Many likely skeptics

It’s all in the presentation

•Avoid buzzwords

It’s all in the presentation

•Avoid buzzwords

•Focus on a business problem

It’s all in the presentation

•Avoid buzzwords

•Focus on a business problem

•Show value

It’s all in the presentation

•Avoid buzzwords

•Focus on a business problem

•Show value

•Keep in mind cost

Is it actually science?

•Sometimes

Is it actually science?

•Sometimes

•…but often not

Is it actually science?

•Sometimes

•…but often not

•Data Sciences vs. Data Engineering

Is it actually science?

•Sometimes

•…but often not

•Data Sciences vs. Data Engineering

• It should be — focus on why

Is it actually science?

Applied Math

Computer Science

Domain Expertise

Is it actually science?

Applied Math

Computer Science

Domain Expertise

Applied Math

Computer Science

Physics

Physicist

Why Data Science?

•Big problems, fun challenges

Why Data Science?

•Big problems, fun challenges

•Both science and business

Why Data Science?

•Big problems, fun challenges

•Both science and business

•Consistently awesome

2012: Sexiest Job of the Century

2016: Best Job of the Year

2016: Hottest Job of the Year

2016: Best Career Opportunity

Why Data Science?S

alar

y

So want to get started?

•Theano

So want to get started?

•Theano

•TensorFlow

So want to get started?

•Theano

•TensorFlow

•Torch

So want to get started?

•Theano

•TensorFlow

•Torch

•Pandas

Tomorrow is here

www.avrioanalytics.com