Date post: | 26-Oct-2014 |
Category: |
Documents |
Upload: | nektarios-leonardos |
View: | 130 times |
Download: | 1 times |
2JMP Visual Data MiningJMP Visual Data Mining
Where is Massachusetts?Where is Massachusetts?
3JMP Visual Data MiningJMP Visual Data Mining
Where in Massachusetts?Where in Massachusetts?
4JMP Visual Data MiningJMP Visual Data Mining
Williams CollegeWilliams College
5JMP Visual Data MiningJMP Visual Data Mining
Williams College Williams College
6JMP Visual Data MiningJMP Visual Data Mining
Reason for Data MiningReason for Data Mining
Data = $$Data = $$
7JMP Visual Data MiningJMP Visual Data Mining
Data Mining IsData Mining Is……“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad
“finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley
“a knowledge discovery process of extracting previously unknown, actionable information from very large data bases”--- Zornes
“ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”---Edelstein
8JMP Visual Data MiningJMP Visual Data Mining
What is Data Mining?What is Data Mining?
9JMP Visual Data MiningJMP Visual Data Mining
Data Mining Models Data Mining Models –– a partial lista partial listTraditional statistical models
Linear regression, logistic regression, splines, smoothers etc.Vendors are adding these to DM software
Clustering and Visualization
Neural networks
Decision trees
Naïve Bayes
K Nearest Neighbor Methods
K-means
Combining Models – Bagging and Boosting
10JMP Visual Data MiningJMP Visual Data Mining
What makes Data Mining Different?What makes Data Mining Different?
Massive amounts of dataNumber of rows (cases)Number of columns (variables)
Low signal to noiseMany irrelevant variablesSubtle relationshipsVariation
UPS16TB – U.S. library of congressMostly tracking
Facebook200-300 TB of photos every month
Google1 PB every 72 minutes
11JMP Visual Data MiningJMP Visual Data Mining
Why Is Data Mining Taking Off Now?Why Is Data Mining Taking Off Now?
Because we canComputer powerThe price of digital storage is near zero
Data warehouses already built
Companies want return on data investment
12JMP Visual Data MiningJMP Visual Data Mining
Users are Also DifferentUsers are Also Different
UsersDomain experts, not statisticians Have too much dataWant automatic methodsWant useful information without spending all their time doing statistical analysis
13JMP Visual Data MiningJMP Visual Data Mining
Data Mining MythsData Mining Myths
Find answers to unasked questions
Continuously monitor your data base for interesting patterns
Eliminate the need to understand your business
Eliminate the need to collect good data
Eliminate the need to have good data analysis skills
14JMP Visual Data MiningJMP Visual Data Mining
Examples of Data Mining Applications Examples of Data Mining Applications --Customer Relationship ManagementCustomer Relationship Management
Transactional DataCustomer retentionUpselling opportunitiesCustomer optimization across different areas
Marketing ExperimentsOften, new hypotheses are generated by data mining a planned experiment.Segmentation
15JMP Visual Data MiningJMP Visual Data Mining
Manufacturing ApplicationsManufacturing Applications
Product reliability and quality control
Process controlWhat can I do to improve batch yields?
Warranty analysisProduct problemsService assessment Adverse experiences – link to production
16JMP Visual Data MiningJMP Visual Data Mining
Medical ApplicationsMedical Applications
Medical procedure effectivenessWho are good candidates for surgery?
Physician effectivenessWhich tests are ineffective?
Which physicians are likely to over-prescribe treatments?
What combinations of tests are most effective?
17JMP Visual Data MiningJMP Visual Data Mining
EE--commercecommerce
Automatic web page design
Recommendations for new purchases
Cross selling
Social Network Marketing
18JMP Visual Data MiningJMP Visual Data Mining
Pharmaceutical ApplicationsPharmaceutical Applications
High throughput screeningPredict actions in assaysPredict results in animals or humans
Rational drug designRelating chemical structure with chemical propertiesInverse regression to predict chemical properties from desired structure
DNA snips
GenomicsAssociate genes with diseasesFind relationships between genotype and drug response (e.g., dosage requirements, adverse effects) Find individuals most susceptible to placebo effect
19JMP Visual Data MiningJMP Visual Data Mining
Pharmaceutical ApplicationsPharmaceutical ApplicationsCombine clinical trial results with extensive medical/demographic information
Non traditional uses of clinical trial data warehouse to explore:
Prediction of adverse experiences – combining more than one trialWho is likely to be non-compliant or drop out?What are alternative (I.E., Non-approved) uses supported by the data?
20JMP Visual Data MiningJMP Visual Data Mining
Fraud and Terrorist DetectionFraud and Terrorist DetectionIdentify false:
Medical insurance claimsAccident insurance claims
Which stock trades are based on insider information?
Whose cell phone numbers have been stolen?
Which credit card transactions are from stolen cards?
Which documents are “interesting”
When are changes in networks signs of potential illegal activity?
21JMP Visual Data MiningJMP Visual Data Mining
Lesson 1: Learn to Make FriendsLesson 1: Learn to Make FriendsPVA is a philanthropic organization,
Sanctioned by the US Govt to represent the disabled veterans
They send out 4 million “free gifts” , every 6 weeksAnd hope for donations
Data were used for the KDD 1998 cup 200,000 donors
(100,000 training, 100,000 test)481 demographic variables
Past giving, income, age etc etc etcRecent campaign (only for training set)
Did they give? (Target B)How much did they give (Target D)
To optimize profit, who should receive the current solicitation?
What is the most cost effective strategy?
22JMP Visual Data MiningJMP Visual Data Mining
WhatWhat’’s s ““HardHard””? ? ----ExampleExample
23JMP Visual Data MiningJMP Visual Data Mining
TT--CodeCode
24JMP Visual Data MiningJMP Visual Data Mining
More More TcodeTcode
25JMP Visual Data MiningJMP Visual Data Mining
Transformation?Transformation?
26JMP Visual Data MiningJMP Visual Data Mining
Categories?Categories?
27JMP Visual Data MiningJMP Visual Data Mining
What does it mean?What does it mean?T -C ode
0 _ 1 6 DEAN 4 8 CO RP O RAL 1 0 9 LIC. 1 M R. 1 7 J UDGE 5 0 ELDER 1 1 1 S A.
1 0 0 1 M ES S RS . 1 7 0 0 2 J UDGE & M RS . 5 6 M AYO R 1 1 4 DA. 1 0 0 2 M R. & M RS . 1 8 M AJ O R 5 9 0 0 2 LIEUTENANT & M RS . 1 1 6 S R.
2 M RS . 1 8 0 0 2 M AJ O R & M RS . 6 2 LO RD 1 1 7 S RA. 2 0 0 2 M ES DAM ES 1 9 S ENATO R 6 3 CARDINAL 1 1 8 S RTA.
3 M IS S 2 0 GO V ERNO R 6 4 FRIEND 1 2 0 YO UR M AJ ES TY 3 0 0 3 M IS S ES 2 1 0 0 2 S ERGEANT & M RS . 6 5 FRIENDS 1 2 2 HIS HIGHNES S
4 DR. 2 2 0 0 2 CO LNEL & M RS . 6 8 ARCHDEACO N 1 2 3 HER HIGHNES S 4 0 0 2 DR. & M RS . 2 4 LIEUTENANT 6 9 CANO N 1 2 4 CO UNT 4 0 0 4 DO CTO RS 2 6 M O NS IGNO R 7 0 BIS HO P 1 2 5 LADY
5 M ADAM E 2 7 REV EREND 7 2 0 0 2 REV EREND & M RS . 1 2 6 P RINCE 6 S ERGEANT 2 8 M S . 7 3 P AS TO R 1 2 7 P RINCES S 9 RABBI 2 8 0 2 8 M S S . 7 5 ARCHBIS HO P 1 2 8 CHIEF
1 0 P RO FES S O R 2 9 BIS HO P 8 5 S P ECIALIS T 1 2 9 BARO N 1 0 0 0 2 P RO FES S O R & M RS . 3 1 AM BAS S ADO R 8 7 P RIV ATE 1 3 0 S HEIK 1 0 0 1 0 P RO FES S O RS 3 1 0 0 2 AM BAS S ADO R & M RS 8 9 S EAM AN 1 3 1 P RINCE AND P RINCES S
1 1 ADM IRAL 3 3 CANTO R 9 0 AIRM AN 1 3 2 YO UR IM P ERIAL M AJ ES T 1 1 0 0 2 ADM IRAL & M RS . 3 6 BRO THER 9 1 J US TICE 1 3 5 M . ET M M E.
1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE 2 1 0 P RO F.1 2 0 0 2 GENERAL & M RS . 3 8 CO M M O DO RE 1 0 0 M .
1 3 CO LO NEL 4 0 FATHER 1 0 3 M LLE. 1 3 0 0 2 CO LO NEL & M RS . 4 2 S IS TER 1 0 4 CHANCELLO R
1 4 CAP TAIN 4 3 P RES IDENT 1 0 6 REP RES ENTATIV E 1 4 0 0 2 CAP TAIN & M RS . 4 4 M AS TER 1 0 7 S ECRETARY
1 5 CO M M ANDER 4 6 M O THER 1 0 8 LT. GO V ERNO R 1 5 0 0 2 CO M M ANDER & M RS . 4 7 CHAP LAIN
T itle
28JMP Visual Data MiningJMP Visual Data Mining
Relational Data BasesRelational Data BasesData are stored in tables
Items
ItemID ItemName price
C56621 top hat 34.95
T35691 cane 4.99
RS5292 red shoes 22.95
Shoppers
Person ID person name ZIPCODE item bought
135366 Lyle 19103 T35691
135366 Lyle 19103 C56621
259835 Dick 01267 RS5292
29JMP Visual Data MiningJMP Visual Data Mining
MetadataMetadata
The data survey describes the data set contents and characteristics
Table nameDescriptionPrimary key/foreign key relationshipsCollection information: how, where, conditionsTimeframe: daily, weekly, monthlyCosynchronus: every Monday or Tuesday
30JMP Visual Data MiningJMP Visual Data Mining
Data PreparationData Preparation
Build data mining database
Explore data
Prepare data for modeling
60% to 95% of the time is spent preparing the data
60% to 95% of the time is spent 60% to 95% of the time is spent preparing the datapreparing the data
31JMP Visual Data MiningJMP Visual Data Mining
Data ChallengesData Challenges
Data definitionsTypes of variables
Data consolidationCombine data from different sourcesNASA mars lander
Data heterogeneityHomonymsSynonyms
Data quality
32JMP Visual Data MiningJMP Visual Data Mining
Data QualityData Quality
33JMP Visual Data MiningJMP Visual Data Mining
Missing ValuesMissing Values
Random missing valuesDelete row?
Paralyzed Veterans
Substitute valueImputationMultiple ImputationJMP 8 (!)
Systematic missing dataNow what?
34JMP Visual Data MiningJMP Visual Data Mining
Missing Values Missing Values ---- SystematicSystematic
Credit Card Bank finds that “Income” field is missing
Wharton Ph.D. Student questionnaire on survey attitudes
Bowdoin college applicants have mean SAT verbal score above 750
Clinical Trial of Depression Medication –what does missing mean?
35JMP Visual Data MiningJMP Visual Data Mining
Results for PVA Data SetResults for PVA Data SetIf entire list (100,000 donors) are mailed, net donation is $10,500
Using data mining techniques, this was increased 41.37%
36JMP Visual Data MiningJMP Visual Data Mining
KDD CUP 98 ResultsKDD CUP 98 Results
37JMP Visual Data MiningJMP Visual Data Mining
KDD CUP 98 Results 2KDD CUP 98 Results 2
38JMP Visual Data MiningJMP Visual Data Mining
Students in Data Mining ClassStudents in Data Mining Class
Student #1 $15,024Student #2 $14,695Student #3 $14,345
39JMP Visual Data MiningJMP Visual Data Mining
Data Mining and OLAPData Mining and OLAP
On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis
Descriptive, not predictive
Data mining: software uses data to inductively find patterns – models!
Predictive or descriptive
Associations?Most associated variables in the censusMost associated variables in a supermarketAssocation Rules
40JMP Visual Data MiningJMP Visual Data Mining
Why Models?Why Models?Beer and Diapers
“In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated”Conclusions?Actions?
41JMP Visual Data MiningJMP Visual Data Mining
Beer and DiapersBeer and Diapers
Picture from TandemTM ad
42JMP Visual Data MiningJMP Visual Data Mining
ModelsModelsModels are:
Powerful summaries for understandingUsed for exploration and prediction
Of course, models are not reality
George Box“All models are wrong, but some are useful”“Statisticians, like artists, have the bad habit of falling in love with their models”.
43JMP Visual Data MiningJMP Visual Data Mining
TwymanTwyman’’s Law and Corollariess Law and Corollaries
“If it looks interesting, it must be wrong”
De Veaux’s Corollary 1 to Twyman’s Law“If it’s perfect, it’s wrong”
De Veaux’s Corollary 2 to Twyman’s Law“If it isn’t wrong, you probably knew it already
44JMP Visual Data MiningJMP Visual Data Mining
Lesson 2 Lesson 2 –– An Example of TwymanAn Example of Twyman’’s Laws Law
Ingot cracking3935 30,000 lb. IngotsUp to 25% cracking rate$30,000 per recast90 potential explanatory variables
Water composition (reduced)Metal compositionProcess variablesOther environmental variables
45JMP Visual Data MiningJMP Visual Data Mining
Data ProcessingData Processing
Five months to consolidate process data
Three months to analyze and reduce dimension of water data
Eight months after starting projects, statisticians received flat file:
960 ingots (rows)149 variables
46JMP Visual Data MiningJMP Visual Data Mining
Household Income > $40000
Debt > $10000
No
Yes
On Job > 5 Yr
No
0.050.01
Yes
No Yes
0.060.11
Decision Trees – Mortgage Defaults
47JMP Visual Data MiningJMP Visual Data Mining
Decision Tree Decision Tree ---- TitanicTitanic
|M
3
46% 93%
3 1,2,CChildAdult
1 or 2
F
27% 100%
33%23%
1stCrew
1 or Crew2 or 3
14%
48JMP Visual Data MiningJMP Visual Data Mining
Cook County Hospital Cook County Hospital ---- ““ERER””
The 3 “Urgent”Risk Factors:
1. Is the reported Pain unstable angina?
2. Is there fluid in patient’s lungs?
3. Is the patient’s systolic BP < 100?
The ECG Tests:•MI: myocardial infarction (heart attack)
•Ischemia – Heart muscle not getting enough blood
49JMP Visual Data MiningJMP Visual Data Mining
Confusion MatrixConfusion Matrix
No Heart Attack
Actual Heart Attack
Doctors in ER
Predict No Heart Attack
Predict Heart Attack
0.250.11
0.750.89
No Heart Attack
Actual Heart Attack
Tree Algorithm (Goldman)
Predict No Heart Attack
Predict Heart Attack
0.920.04
0.080.96
50JMP Visual Data MiningJMP Visual Data Mining
Regression Tree Regression Tree
|Price<9446.5
Weight<2280 Disp.<134
Weight<3637.5
Price<11522
Reliability:abde
HP<154
34.00 30.17 26.22
24.17
21.67 20.40
22.57
18.60
51JMP Visual Data MiningJMP Visual Data Mining
Ingots Ingots –– First TreeFirst Tree
CountMeanStd Dev
39350.23560.4244
All Rows
CountMeanStd Dev
30050.1590.3661
Alloy (6045,7348,8234,2345,3234)CountMeanStd Dev
9300.48170.4999
Alloy (5434,5894,2439)
We know that – some alloys are hard to make. That’s why we gave you the data in the first place.
52JMP Visual Data MiningJMP Visual Data Mining
Second TreeSecond Tree
CountMeanStd Dev
3935
All Rows
CountMeanStd Dev
30550.16730.3723
MG<3.9CountMeanStd Dev
8800.47270.4999
MG>=3.9
What do you think is in those alloys?
0.42440.2356
53JMP Visual Data MiningJMP Visual Data Mining
One More TimeOne More Time
Looks like ChromeOH!Did that solve it? No, but
Experimental designEnabled us to focus on important variables
Oh, that’s funny!
-Issac Asimov
54JMP Visual Data MiningJMP Visual Data Mining
What did we learn?What did we learn?
Data mining gave clues for generating hypotheses
Followed up with DOE
DOE led to substantial process improvement
55JMP Visual Data MiningJMP Visual Data Mining
HerbHerb’’s Tree s Tree –– TwymanTwyman’’s Law agains Law again
94649Count
37928.436G^2
01
Level0.94940.0506
Prob
All Rows
4792Count
0G^2
01
Level0.00001.0000
Prob
TARGET_D>=1
89857Count
0G^2
01
Level1.00000.0000
Prob
TARGET_D<1
56JMP Visual Data MiningJMP Visual Data Mining
Doing it Right Doing it Right –– Knowledge DiscoveryKnowledge Discovery
Define business problem
Build data mining database
Explore data
Prepare data for modeling
Build model
Evaluate model
Deploy model and results
Note: This process model borrows from Note: This process model borrows from CRISPCRISP--DM: DM: CRossCRoss Industry Standard Process for Data Industry Standard Process for Data MiningMining
57JMP Visual Data MiningJMP Visual Data Mining
Successful Data MiningSuccessful Data MiningThe keys to success:
Formulating the problemUsing the right dataFlexibility in modelingActing on results
Success depends more on the way you mine the data rather than the specific tool
58JMP Visual Data MiningJMP Visual Data Mining
Types of ModelsTypes of Models
Descriptions
Classification (categorical or discrete values)
Regression (continuous values)Time series (continuous values)
Clustering
Association
59JMP Visual Data MiningJMP Visual Data Mining
Model BuildingModel Building
Model buildingTrainTest
Evaluate
60JMP Visual Data MiningJMP Visual Data Mining
OverfittingOverfitting in Regressionin RegressionClassical overfitting:
Fit 6th order polynomial to 6 data points
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
-1 0 1 2 3 4 5 6
61JMP Visual Data MiningJMP Visual Data Mining
OverfittingOverfitting
Fitting non-explanatory variables to data
Overfitting is the result ofIncluding too many predictor variablesLack of regularizing the model
Neural net run too longDecision tree too deep
62JMP Visual Data MiningJMP Visual Data Mining
Avoiding OverfittingAvoiding OverfittingAvoiding overfitting is a balancing act –Occam’s Razor
Fit fewer variables rather than moreHave a reason for including a variable (other than it is in the database)Regularize (don’t overtrain)Know your field.
All models should be as simple as possiblebut no simpler than necessary
Albert Einstein
All models should be as simple as possiblebut no simpler than necessary
Albert Einstein
63JMP Visual Data MiningJMP Visual Data Mining
““ToyToy”” Problem Problem
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]tra
in2$
y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
train2[, i]
train
2$y
0.0 0.2 0.4 0.6 0.8 1.0
510
1520
25
64JMP Visual Data MiningJMP Visual Data Mining
Tree ModelTree Model|x4<0.512146
x1<0.359395
x4<0.140557
x3<0.215425
x3<0.490631
x5<0.232708
x2<0.54068
x4<0.283724
x2<0.299431
x2<0.129879
x4<0.336583x4<0.148234
x5<0.412206
x4<0.223909
x3<0.177433
x3<0.784959
x3<0.114976
x4<0.264999
x3<0.0789249
x3<0.728124
x3<0.0777104
x1<0.209569
x5<0.260297
x3<0.885533
x4<0.768584
x2<0.27271
x5<0.621811
x2<0.20279x3<0.248065
x5<0.588094
x4<0.916189
x1<0.328133
x8<0.933915
x4<0.700738
x2<0.414822
x3<0.821878
x4<0.941058
9.785
4.400 7.074
7.602
7.994 9.956
12.040 6.688 9.865 8.88712.060
15.020
10.78014.060
19.06014.34017.560
14.03016.860
21.26017.770
10.640
12.70014.380
16.830
12.25015.83018.16015.150
15.190
17.69019.470
14.200
21.24018.760
21.10024.280
25.320
R –squared 82.3% Train 67.2% Test
65JMP Visual Data MiningJMP Visual Data Mining
Predictions for ExamplePredictions for Example
R –squared 82.3% Train 67.2% Test
0
10
20
y
10 20
y Predictor
66JMP Visual Data MiningJMP Visual Data Mining
Tree AdvantagesTree Advantages
Model explains its reasoning -- builds rules
Build model quickly
Handles non-numeric data
No problems with missing dataMissing data as a new valueSurrogate splits
Works fine with many dimensions
67JMP Visual Data MiningJMP Visual Data Mining
WhatWhat’’s Wrong With Trees?s Wrong With Trees?
Output are step functions – big errors near boundaries
Greedy algorithms for splitting – small changes change model
Uses less data after every split
Model has high order interactions -- all splits are dependent on previous splits
Often non-interpretable
68JMP Visual Data MiningJMP Visual Data Mining
Linear Regression Linear Regression Term Estimate Std Error t Ratio Prob>|t|
Intercept -0.900 0.482 -1.860 0.063x1 4.658 0.292 15.950 <.0001x2 4.685 0.294 15.920 <.0001x3 -0.040 0.291 -0.140 0.892x4 9.806 0.298 32.940 <.0001x5 5.361 0.281 19.090 <.0001x6 0.369 0.284 1.300 0.194x7 0.001 0.291 0.000 0.998x8 -0.110 0.295 -0.370 0.714x9 0.467 0.301 1.550 0.122x10 -0.200 0.289 -0.710 0.479
R-squared: 73.5% Train 69.4% Test
69JMP Visual Data MiningJMP Visual Data Mining
Stepwise RegressionStepwise Regression
Term Estimate Std Error t Ratio Prob>|t|Intercept -0.625 0.309 -2.019 0.0439x1 4.619 0.289 15.998 <.0001x2 4.665 0.292 15.984 <.0001x4 9.824 0.296 33.176 <.0001x5 5.366 0.28 19.145 <.0001
R-squared 73.4% on Train 69.8% Test
70JMP Visual Data MiningJMP Visual Data Mining
Stepwise 2Stepwise 2NDND Order ModelOrder ModelTerm Estimate Std Error t Ratio Prob>|t|
Intercept -2.026 0.264 -7.68 <.0001x1 4.311 0.184 23.47 <.0001x2 4.808 0.185 26.04 <.0001x3 -0.506 0.181 -2.79 0.0054x4 10 0.186 53.79 <.0001x5 5.212 0.176 29.67 <.0001x8 -0.181 0.186 -0.97 0.3301x9 0.427 0.188 2.28 0.0232(x1-0.51811)*(x1-0.51811) -0.932 0.711 -1.31 0.1905(x2-0.48354)*(x1-0.51811) 8.972 0.634 14.14 <.0001(x3-0.48517)*(x1-0.51811) -1.367 0.65 -2.1 0.0358(x3-0.48517)*(x2-0.48354) -0.8 0.639 -1.25 0.2111(x3-0.48517)*(x3-0.48517) 20.515 0.69 29.71 <.0001(x4-0.49647)*(x1-0.51811) 1.014 0.651 1.56 0.1197(x4-0.49647)*(x2-0.48354) -1.159 0.65 -1.78 0.075(x5-0.50509)*(x2-0.48354) -0.794 0.62 -1.28 0.2008(x5-0.50509)*(x3-0.48517) 1.105 0.619 1.78 0.0748(x5-0.50509)*(x4-0.49647) 0.127 0.635 0.2 0.8414(x8-0.52029)*(x5-0.50509) 1.065 0.63 1.69 0.0914
R-squared 89.9% Train 88.8% Test
71JMP Visual Data MiningJMP Visual Data Mining
Next Steps Next Steps
Higher order terms?
When to stop?
Transformations?
Too simple: underfitting – bias
Too complex: inconsistent predictions, overfitting – high variance
Selecting models is Occam’s razorKeep goals of interpretation vs. prediction in mind
72JMP Visual Data MiningJMP Visual Data Mining
Logistic RegressionLogistic RegressionWhat happens if we use linear regression on1-0 (yes/no) data?
Income
20000 40000 60000 80000
0.0
0.2
0.4
0.6
0.8
1.0
73JMP Visual Data MiningJMP Visual Data Mining
Logistic Regression IILogistic Regression II
Points on the line can be interpreted as probability, but don’t stay within [0,1]Use a sigmoidal function instead of linear function to fit the data
IeIf −+=
11)(
00.20.40.60.8
11.2
-10 -6 -2 2 6 10
74JMP Visual Data MiningJMP Visual Data Mining
Logistic Regression IIILogistic Regression III
Income
Acc
ept
20000 40000 60000 80000
0.0
0.2
0.4
0.6
0.8
1.0
75JMP Visual Data MiningJMP Visual Data Mining
Neural NetsNeural NetsDon’t resemble the brain
Are a statistical modelClosest relative is projection pursuit regression
76JMP Visual Data MiningJMP Visual Data Mining
Input (z1)Output
x1
x2
x3
x4
x5
x0
0.3
0.7
-0.2
0.4-0.5
z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5
h(z1)
A Single NeuronA Single Neuron
0.8
77JMP Visual Data MiningJMP Visual Data Mining
Single NodeSingle Node
lj
jjkl xwzI θ+== ∑ 1
)(ˆ klk zhy =
Output:
Input to outer layer from “hidden node”:
78JMP Visual Data MiningJMP Visual Data Mining
Layered ArchitectureLayered Architecture
Input layer
Output layer
Hidden layer
z1
z2
z3
x1
x2y
79JMP Visual Data MiningJMP Visual Data Mining
Neural NetworksNeural Networks
lj
jjkl xwz θ+=∑ 1
jk zhwzhwy θ+++= …)()(ˆ 222121
Create lots of features – hidden nodes
Use them in an additive model:
80JMP Visual Data MiningJMP Visual Data Mining
Put It TogetherPut It Together
))((~ˆ 12 jlj
jjkl
klk xwhwhy θθ ++= ∑∑
The resulting model is just a flexible non-linear regression of the response on a set of predictor variables.
81JMP Visual Data MiningJMP Visual Data Mining
Predictions for ExamplePredictions for Example
R2 89.5% Train 87.7% Test
82JMP Visual Data MiningJMP Visual Data Mining
What Does This Get Us?What Does This Get Us?
Enormous flexibility
Ability to fit anythingIncluding noise
Interpretation?
83JMP Visual Data MiningJMP Visual Data Mining
Neural Net ProNeural Net Pro
AdvantagesHandles continuous or discrete valuesComplex interactionsIn general, highly accurate for fitting due to flexibility of modelCan incorporate known relationships
So called grey box modelsSee De Veaux et al, Environmetrics 1999
84JMP Visual Data MiningJMP Visual Data Mining
Neural Net ConNeural Net Con
DisadvantagesModel is not descriptive (black box)Difficult, complex architecturesSlow model buildingCategorical data explosionSensitive to input variable selection
85JMP Visual Data MiningJMP Visual Data Mining
KK--Nearest Neighbors(KNN) Nearest Neighbors(KNN)
To predict y for an x: Find the k most similar x'sAverage their y's
Find k by cross validation
No training (estimation) required
Works embarrassingly wellFriedman, KDDM 1996
86JMP Visual Data MiningJMP Visual Data Mining
Collaborative FilteringCollaborative Filtering
Goal: predict what movies people will like
Data: list of movies each person has watchedLyle André, StarwarsEllen André, Starwars, Destin Fred Starwars, BatmanDean Starwars, Batman, RamboJason Destin d’Amélie Poulin, Caché
87JMP Visual Data MiningJMP Visual Data Mining
Data BaseData Base
Data can be represented as a sparse matrix
Karen likes André. What else might she like?
CDNow doubled e-mail responses
Starwars Rambo Batman My Dinner w/André Destin D'Amilie Caché
Lyle y yEllen y y yFred y yDean y y yJason y y y
Karen ? ? ? ? ? ?
88JMP Visual Data MiningJMP Visual Data Mining
Lesson 3: Know When to Hold Lesson 3: Know When to Hold ‘‘emem
Breast cancer data from mammogramsError rates by trained radiologists are near 25% for both false positives and false negatives
Newer equipment is prohibitively expensive for the developing world
Early detection of breast cancer is crucial
Cumulative type I error over a decade is near 100% leading to needless biopsies
89JMP Visual Data MiningJMP Visual Data Mining
The DataThe Data
1618 mammograms showing clustered microcalcifications
Biostatistics Dept Institut Curie
VariablesResponse: Malignant or notPredictors: Age, Tissue Type (light/dense) Size (mm), Number of microcalc, Number of suspicious clusters, Shape of microcalc (1-5), Polyshape?(y/n), Shape of cluster (1,2,3), Retro (cluster near nipple?), Deep? (y/n)
90JMP Visual Data MiningJMP Visual Data Mining
Tree modelTree model
91JMP Visual Data MiningJMP Visual Data Mining
Combining ModelsCombining Models
In 1950’s forecasters found that combining forecasting models worked better on average than any single forecast model
Reduces variance by averagingCan reduce bias if collection
is broader than single model
92JMP Visual Data MiningJMP Visual Data Mining
Bagging and BoostingBagging and BoostingBagging (Bootstrap Aggregation)
Bootstrap a data set repeatedlyTake many versions of same model (e.g. tree)
Random Forest VariationForm a committee of modelsTake majority rule of predictions
BoostingCreate repeated samples of weighted dataWeights based on misclassificationCombine by majority rule, or linear combination of predictions
93JMP Visual Data MiningJMP Visual Data Mining
ResultsResults
False Positives False NegativesSimple Tree 32.20% 33.70%Neural Network 25.50% 31.70%Boosted Trees 24.90% 32.50%Bagged Trees 19.30% 28.80%Radiologists 22.40% 35.80%
• Split data into train and test (62.5% -37.5%)
• Repeat random splits 1000 timesFor each iteration, count false positives and false negatives on the 600 test set cases
94JMP Visual Data MiningJMP Visual Data Mining
How Do We Really Start?How Do We Really Start?
Life is not so kindCategorical variablesMissing data500 variables, not 10
481 variables – where to start?
95JMP Visual Data MiningJMP Visual Data Mining
Where to StartWhere to Start
Three rules of data analysisDraw a pictureDraw a pictureDraw a picture
Ok, but how? There are 90 histogram/bar charts and 4005 scatterplots to look at (or at least 90 if you look only at y vs. X)
96JMP Visual Data MiningJMP Visual Data Mining
Exploratory Data ModelsExploratory Data Models
Use a tree to find a smaller subset of variables to investigate
Explore this set graphicallyStart the modeling process over
Build model Compare model on small subset with full predictive model
97JMP Visual Data MiningJMP Visual Data Mining
More RealisticMore Realistic
250 predictors200 Continuous 50 Categorical
10,000 rows
Why is this still easy?No missing valuesRelatively high signal/noise
98JMP Visual Data MiningJMP Visual Data Mining
Start With a Simple ModelStart With a Simple Model
Tree? |x4<0.477873
x2<0.288579
x5<0.465905 x1<0.333728
x1<0.152683 x5<0.466843
x4<0.208211
x1<0.297806
x5<0.529173 x2<0.343653
x2<0.125849 x4<0.752766
x5<0.644585 x5<0.49235
-2.560 -0.265
-1.890 1.150
2.000 4.570
5.820
2.540 5.120
2.910 6.050
7.500 10.100 9.880 12.200
99JMP Visual Data MiningJMP Visual Data Mining
BrushingBrushing
100JMP Visual Data MiningJMP Visual Data Mining
Lesson 4: Know when to Fold Lesson 4: Know when to Fold ‘‘ememLiability for churches
Some PredictorsNet Premium ValueProperty ValueCoastal (yes/no)Inner100 (a.k.a., highly-urban) (yes/no)High property value Neighborhood (yes/no)Indicator Class
1 (Church/House of worship)2 (Sexual Misconduct – Church)3 (Add’l Sex. Misc. Covg Purchased)4 (Not-for-profit daycare centers)5 (Dwellings – One family (Lessor’s risk))6 (Bldg or Premises – Office – Not for profit)7 (Corporal Punishment – each faculty member)8 (Vacant land- not for profit)9 (Private, not for profit, elementary, Kindergarten and Jr. High Schools)10 (Stores – no food or drink – not for profit)11 (Bldg or Premises – Maintained by insured (lessor’s risk) – not for profit)12 (Sexual misconduct – diocese)
101JMP Visual Data MiningJMP Visual Data Mining
Fast FailFast Fail
Not every modeling effort is a successA model search can save lots of queries
Data took 8 months to get ready
Analyst spent 2 months exploring it
Tree models, stepwise regression (and a neural network running for several hours) found no out of sample predictive ability
102JMP Visual Data MiningJMP Visual Data Mining
Lesson 5: Machines are Smart Lesson 5: Machines are Smart ––You are SmarterYou are Smarter
Why do statisticians like interpretability?
Black boxes are not interpretable, but there may be important information
103JMP Visual Data MiningJMP Visual Data Mining
Case Study Case Study –– Warranty DataWarranty Data
A new backpack inkjet printer is showing higher than expected warranty claims
What are the important variables?What’s going on?
A neural networks shows that Zip code is the most important predictor
104JMP Visual Data MiningJMP Visual Data Mining
Zip Code?Zip Code?
105JMP Visual Data MiningJMP Visual Data Mining
Data Mining Data Mining –– DOE SynergyDOE Synergy
Data Mining is exploratory
Efforts can go on simultaneously
Learning cycle oscillates naturally between the two
106JMP Visual Data MiningJMP Visual Data Mining
What Did We Learn?What Did We Learn?Toy problem
Functional form of model
PVA dataUseful predictor – increased sales 40%
Depression StudyIdentified critical intervention point at 2 weeks
IngotsGave clues as to where to lookExperimental design followed
ChurchesWhen to quit
PrintersWhen to experiment – what factors
107JMP Visual Data MiningJMP Visual Data Mining
Challenges for data miningChallenges for data mining
Not algorithms
Overfitting
Finding an interpretable model that fits reasonably well
108JMP Visual Data MiningJMP Visual Data Mining
Recap Recap –– Success in Data MiningSuccess in Data MiningProblem formulation
Data preparationData definitionsData cleaningFeature creation, transformations
EDM – exploratory modelingReduce dimensions
109JMP Visual Data MiningJMP Visual Data Mining
Success in Data Mining IISuccess in Data Mining II
Don’t forget Graphics
Second phase modeling
Testing, validation, implementation
Constant re-evaluation of models
110JMP Visual Data MiningJMP Visual Data Mining
Which Method(s) to Use?Which Method(s) to Use?
No method is best
Which methods work best when?
Which method to use?YES!
111JMP Visual Data MiningJMP Visual Data Mining
For More InformationFor More InformationTwo Crows
http//www.twocrows.com
KDNuggetshttp://www.kdnuggets.com
M. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997
Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999
Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press 2001
Tan, P.N, Steinbach, and Kumar: Introduction to Data Mining, Addison-Wesley, 2006
Hastie, Tibshirani and Friedman, Statistical Learning 2nd edition, Springer