BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data...

Post on 22-Dec-2015

213 views 1 download

Tags:

transcript

BUILDING BET TER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R

Robert KrzyzanowskiDirector of Data Engineering at Avant

BUILDING BET TER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R

Robert KrzyzanowskiDirector of Data Engineering at Avant

3

• Gap between complicated non-deployable models and need for production ready solutions.

Netflix never implemented the algorithm that won the Netflix $1 million challenge.

O u r I n i ti a l P r o b l e m

4

• Gap between complicated non-deployable models and need for production ready solutions.

Netflix never implemented the algorithm that won the Netflix $1 million challenge.

• Frustration in developing models in one language (R, python, etc.) and productionizing them in another (C++, Java etc.).

O u r I n i ti a l P r o b l e m

5

• Gap between complicated non-deployable models and need for production ready solutions.

Netflix never implemented the algorithm that won the Netflix $1 million challenge.

• Frustration in developing models in one language (R, python, etc.) and productionizing them in another (C++, Java etc.).

• Result - Advanced and complicated models are rarely used in production.

O u r I n i ti a l P r o b l e m

6

Industry Standard

Software Engineer Get real time data

Validate deployed models

Implement sources

Statistician Refit models

Write new analytical methods

Model card creation

Separate processes in different groups, leading to translation errors

Data Scientist

Performs both Software Engineer and Statistician functions

No separate process to re-code model for production

Days vs. Months

Agile Development

Development = Production

D e v e l o p m e n t = D e p l o y m e n t

7

"Data scientists spend 50 to 80 percent of their time collecting and preparing digital data."

- New York Times 08/18/2014

T h e P r o b l e m o f D a t a P r e p a r a ti o n

Problem:

8

"Data scientists spend 50 to 80 percent of their time collecting and preparing digital data."

- New York Times 08/18/2014

T h e P r o b l e m o f D a t a P r e p a r a ti o n

Problem:

"Good feature engineering is oftenmore important for classifier performance

than model selection." - Google Research Paper (2015)

9

T h e P r o b l e m o f D a t a P r e p a r a ti o n

Solution:

Re-define machine learning.

10

W h y D a t a P r e p a r a ti o n ?

Definition: A machine learning model is a trained statistical predictor applied to a cleaned up data set.

11

W h y D a t a P r e p a r a ti o n ?

Definition: A machine learning model is a trained statistical predictor applied to a cleaned up data set.

Wrong.

12

W h y D a t a P r e p a r a ti o n ?

Definition: A machine learning model is1. A trained data preparation

applied to raw production data.

2. A trained statistical predictor applied to the results of the trained data preparation.

13

P r o o f o f D e fi n i ti o n

Two Models:

Direct Mail ResponseIn Eastern States:

Variable state levels: New York Massachusetts New Jersey

Direct Mail ResponseIn Western States:

Variable state levels: California Oregon Washington

Need two data pipelines to restore categorical levels

14

P r o o f o f D e fi n i ti o n

Two Models:

Model A – Seed 100

Impute variable “inquiries” with mean of 0.2

Model A – Seed 101

Impute variable “inquiries” with mean of 0.3

Need two data pipelinesto replace NA with mean in production

15

H o w t o A p p l y T h e D e fi n i ti o n

1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.

16

H o w t o A p p l y T h e D e fi n i ti o n

1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.

2. Define a mapping from the development framework to a production system: then production-ready machine learning is free and identical to the experimental environment.

17

H o w t o A p p l y T h e D e fi n i ti o n

1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.

2. Define a mapping from the development framework to a production system: then production-ready machine learning is free and identical to the experimental environment.

3. Data scientists now write both data preparation on raw production data and apply statistical classifier.

18

H o w t o A p p l y T h e D e fi n i ti o n

To master data scienceyou must masterdata preparation.

19

H o w t o A p p l y T h e D e fi n i ti o n

To master data scienceyou must masterdata preparation.

- Confucius

20

A H a r d e r E x a m p l e

Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.

21

A H a r d e r E x a m p l e

Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.

list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric))

22

A H a r d e r E x a m p l e

Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.

list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric))

New real-time customers must be scored in < 1 second on EC2.Cannot discretize a 1-row data.frame (new customer).

Must be careful to use different function to achieve same behavior.

23

A H a r d e r E x a m p l e

discretize <- function(variable) { variable <- arules::discretize(variable) # If levels are [0, 3), [3, 6), [6, 10), # cutoffs will be 3, 6, Inf cutoffs <- c(as.numeric( gsub(“^[^0-9]*([0-9]+).*$”, “\\1”, levels(variable)[-1])), Inf) # list(“[0, 3)” = 3, “[3, 6)” = 6, “[6, 10)” = Inf) input$cutoffs <- setNames(cutoffs, levels(variable)) variable}

restore_levels <- function(variable) { factor(vapply(variable, function(val) { names(input$cutoffs)[which.max(val < input$cutoffs)] }, character(1)), levels = names(input$cutoffs))}

(can replace with Rcpp version to make it faster)

24

A H a r d e r E x a m p l e

input <- new.env()

var <- discretize(1:10)# input$cutoffs is:# list(“[0, 3)” = 3, “[3, 6)” = 6, “[6, 10)” = Inf)[1] [0, 3) [0, 3) [3, 6) [3, 6) [3, 6) [6] [6, 10) [6, 10) [6, 10) [6, 10) [6, 10)Levels: [0,3) [3, 6) [6, 10)

discretize(5)Error: 'breaks' are not unique

restore_levels(5)[1] [3, 6)Levels: [0, 3) [3, 6) [6, 10)

restore_levels(0:11) [1] [0, 3) [0, 3) [0, 3) [3, 6) [3, 6) [3, 6) [7] [6, 10) [6, 10) [6, 10) [6, 10) [6, 10) [6, 10)Levels: [0, 3) [3, 6) [6, 10)

25

A H a r d e r E x a m p l e

Key Point:

Identical mathematical operationsrequire different logic in

train versus predict.

You must train your data preparationjust like you train your model.

26

A H a r d e r E x a m p l e

list( import = indicator ~ data_source1 + data_source2, data = list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric) ), model = list(“glmnet”, link = “binomial”, alpha = 0.5),

export = list(s3 = “location/of/model”))

Final toy model

27

T h e R e s u l t

Our most complex models went from 1,000s of lines of code to 100.

Zero code deployment:

development = production

Modularity: Can re-use credit model data preparation and methods for lead conversion model, or direct mail for collections model

28

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion)

29

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion)

0/1 or continuous-valued dependent variable

30

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion

0/1 or continuous-valued dependent variable

612 variables

31

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313

0/1 or continuous-valued dependent variable

612 variables 312 variables

32

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

33

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

34

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

35

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank + application

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

36

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank + application + iovation

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

37

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank + application + iovation + clarity

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

38

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity

R formula object

39

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity

R formula object

4 Avant R packages using database connections + caching layersinstantly translate this R formula to a live data source yielding adata.frame with

100,000s – 1,000,000s of rows 1,000s of cols

Adding new data sources = Easy

40

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list( make_validation_set, seed = seed, trainpct = trainpct )

41

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list( make_validation_set, seed = …

1 Line of Code per data preparation step encouragesDRY (Don’t Repeat Yourself) code and easier testing

42

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL,…

Parsing engine translates this grammar to mean“do this in training but not in live production”.

Data scientist forced to ensure data preparationworks in production while developing model.

Avoid later headaches and angry girlfriends whenmodel breaks at midnight.

43

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,…

Train versus predict duality is present on final model object.

Model will predict on raw (not clean) production data, either1-row or 1,000,000-row data.frames, in interactive R modeor on deployed Amazon EC2 instance, without extra code.

44

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05…

Data scientist can re-run data steps and examine or visualize data to interactively build and debug data preparation. No need to start fromscratch or keep duplicate copies of data. R memory problem = Solved

> run("data/Sweep") Special notation defined by framework

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .

R a p i d E x p e r i m e n t a ti o n

list( import = fraud(num_installments = 2, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .

Answering new business questions is easy: change indicator query.

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) #, "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .

Comment out or hyper-parameterize data preparation andre-run full model to test the effect of feature engineering on final model performance.

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

Any of 6,500+ CRAN, BioConductor, or in-house stats packages can be incorporated with a light-weight wrapper.

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "mars", nk = knots, degree = deg ))

Any of 6,500+ CRAN, BioConductor, or in-house stats packages can be incorporated with a light-weight wrapper.

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

Can use parallel, snow, etc. to parallelize models that require many cores/nodes.

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

Parametrization of full modeling process is configurable.

Different grammar exists for ensemble models,sequential models, etc.

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

Train models in parallelto solve differentbusiness questions

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

list( import = collection(delinquency_window = 60) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "mars", nk = knots, degree = deg ))

list( import = conversion(response_window = 3) ~ transunion data = list(…),

model = list( "ensemble", lapply(seq(0, 1, by = 0.1), function(alpha) { list( "glmnet", link = "bernoulli", alpha = alpha ) }) ))

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

list( import = collection(delinquency_window = 60) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "mars", nk = knots, degree = deg ))

list( import = conversion(response_window = 3) ~ transunion data = list(…),

model = list( "ensemble", lapply(seq(0, 1, by = 0.1), function(alpha) { list( "glmnet", link = "bernoulli", alpha = alpha ) }) ))

Framework of 25 R packages.

Deploy models same-day on raw production data with no extra code.

Provides smooth transition from R to “Big Data”.

R a p i d E x p e r i m e n t a ti o n

56

57

Contact: robert.k@avant.com

peterhurford/batchman kirillseva/ruigi michaelochurch/fixedwidth-hs robertzk/cachemeifyoucan