Date post: | 11-Apr-2017 |
Category: |
Data & Analytics |
Upload: | t-scott-clendaniel |
View: | 57,598 times |
Download: | 0 times |
How to design a strategy for boosting performance.
2- Strategy
How to use Feature Engineering to boost model performance.
3. Features
Explaining why boosting performance is relevant.
1- Background
Time for questions from the audience.
5. Questions
A collection of free resources for boosting model performance.
4. Bonus Round
AGENDA
Explaining why boosting performance is relevant.
1- Background
SECTION 1: Background
TIPS SOURCESWhere do the recommendations originate?
197 Kaggle Winner
Interviews
How did they win?
50 In-depth Case
Studies
Which factors mattered
25,000 Head-to-Head
Tests
What made the difference?
WHERE HAVE THESE TIPS WORKED?
IMPORTANT: All views expressed are solely my own, and should not be taken as being those of current or past employers, clients or others.
TWO CATEGORIES OF TIPSPresentation Focus
The plan, method, series of tactics or stratagems for building your model.
Model StrategyPart 1
The process for identifying, building, developing, standardizing, normalizing and engineering the correct inputs for one or more analytics processes.
Data PreparationPart 2
How to design a strategy for boosting performance.
2- Strategy
Explaining why boosting performance is relevant.
1- Background
SECTION 2Strategy
Source: Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic,https://www.slideshare.net/jeongyoonlee/data-science-competition-72596610
TIP 1: Leverage Extreme EnsemblesThe performance boost from models with non-correlated errors is consistently higher than single models or smaller ensembles.
Source: Owen Zhang, Chief Product Officer at DataRobot,https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions
• 6-layer process• 5 distinct data prep steps• 31 combined feature sets• 2 layers of 3 models each
2015 Liberty Mutual ContestOwen Zhang
• 7 feature sets• 64 component models• 15 models in Level 1 Ensemble• 2 models in Level 2 Ensemble
2015 KDD CUPJeong-Yoon Lee
• Seed lists• Old, unusable lead sources• Discontinued markets
MARKETINGEliminate irrelevant populations
• Low dollar thresholds
• “Best” customers
• Higher authentication transactions
• “Standing” transactions
• Canceled transfers
FRAUDEliminate “safer” populations
• What do you already know?• What is beyond your influence?• Which problems can be handled separately?
GENERALOther instances
TIP 2: Reduce Decision SpaceReduce the Decision Space
TIP 3: Use Targeted AUC Instead of Total AUCMatch model objective to organizational objective. Example courtesy of ORACLE.
• Less common approach• Perfect for projects with target thresholds such as
limited marketing budgets or maximum fraud referral/ turndown rates
• Sacrifices overall accuracy for accuracy at lower threshold targets
TARGETED AUCOptimizes targeted model performance
• Traditional approach• Perfect for may Kaggle competitions• Sacrifices accuracy at lower threshold targets for
overall accuracy
TOTAL AUCOptimizes overall model performance
TIP 4: Cross-Validate EverywhereReducing overfitting while extracting maximum learning from your data
OUT-OF-SAMPLE VALIDATION
Traditional methodology
CROSS-VALIDATION
Used to reduce both overfitting and outlier influence
TIP 5: Algorithm ArsenalLeverage diverse modeling arsenal
Bayesian Network
Gradient Boosting
Machines
Random Forests
Logistic Regression
Factorization Machines
Neural Network
Genetic Algorithms
Support Vector Machines
How to design a strategy for boosting performance.
2- Strategy
How to use Feature Engineering to boost model performance.
3. Features
Explaining why boosting performance is relevant.
1- Background
SECTION 3Features
“Stumps” represent the first split in decision trees, and make powerful “weak learners.” Create a derived feature for each input.
1. Derive “Stumps”
Using trees creates bin “boundaries” directly associated with the dependent variable, rather than a more arbitrary approach. Assign bins for each continuous inputs.
2. Bin Continuous Inputs
Missing values assigned to a separate, unique category preserves information content and eliminates arbitrary replacement approaches.
3. Handle Missing Values
Each input, regardless of data type, can have consistent, normalized scaling by using something like NORM Sigmoid or Yule’s Q for each terminal node from each univariate tree.
5. Normalize scaling
Calling out tree nodes with uniquely powerful splitting capabilities as derived features leverages the most benefit from single inputs.
4. Derive High-Impact Flags
Re-coding the original input into the values from the terminal nodes makes interpretation much easier.
6. Overall Transformation
TIPS 8-13: Univariate Tree Feature EngineeringFeatures
Moving Away From… Moving Toward…
TIP 14: Think “Crafts-person-ship”Less “Assembly Line,” More “Fine Craftsmanship”
How to design a strategy for boosting performance.
2- Strategy
How to use Feature Engineering to boost model performance.
3. Features
Explaining why boosting performance is relevant.
1- Background
A collection of free resources for boosting model performance.
4. Bonus Round
SECTION 4Bonus Round
2. Create Common Table
of Values for Each Node
3. Calculate Z-Score
Across Entire Table
5. Calculate Avg., High
and Low
6. Gradient Boosting4. Assign New Value to
New Derived Feature
1. Univariate Tree
Models
Bonus Round:
Patent-Application IMPACT FeaturesPatent application approach for transforming and combining model inputs
How to design a strategy for boosting performance.
2- Strategy
How to use Feature Engineering to boost model performance.
3. Features
Explaining why boosting performance is relevant.
1- Background
Time for questions from the audience.
5. Questions
A collection of free resources for boosting model performance.
4. Bonus Round
AGENDA
USA 1-443-810-8066
MktgSciences3719 Yolando RoadBaltimore, MD 21218
Get in TouchSee you soon....
Source: Jeong-Yoon Lee, Chief Data Scientist at Conversion Logic,https://www.slideshare.net/jeongyoonlee/data-science-competition-72596610
MODEL STRATEGY TIP 1Cross-validate everywhere.
Source: Owen Zhang, Chief Product Officer at DataRobot,https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions
MODEL STRATEGY TIP 1Cross-validate everywhere.
DEFINITIONS
performance(noun):
“the manner in which or the efficiencywith which something reacts or fulfills its intended purpose.”
PEFORMANCE WILL DETERMINE COMPENSATIONLike it or not, Data Science compensation will become more closely tied to model performance.