Date post: | 17-Jan-2016 |
Category: |
Documents |
Upload: | cora-washington |
View: | 221 times |
Download: | 0 times |
Dimension Reduction in Workers Compensation
CAS predictive Modeling SeminarLouise Francis, FCAS, MAAA
Francis Analytics and Actuarial Data Mining, [email protected]
www.data-mines.com
Objectives• Answer questions: What is
dimension reduction and why use it?• Introduce key methods of dimension
reduction• Illustrate with examples in Workers
Compensation• There will be some formulas, but
emphasis is on insight into basic mechanisms of the procedures
Introduction• “How do mere observations become data
for analysis?”• “Specific variable values are never
immutable characteristics of the data”• Jacoby, Data Theory and Dimension Analysis, Sage
Publications
• Many of the dimension reduction/measurement techniques originated in the social sciences and dealt with how to create scales from responses on attitudinal and opinion surveys
Unsupervised learning• Dimension reduction methods
generally unsupervised learning
• Supervised Learning• A dependent or target variable
• Unsupervised learning • No target variable• Group like variables or like records
together
The Data• BLS Economic indexes
• Components of inflation• Employment data• Health insurance inflation
• Texas Department of Insurance closed claim data for 2002 and 2003• Employment related injury• Excludes small claims• About 1800 records
What is a dimension?
• Jacoby – The number of separate and interesting sources of variation
• In many studies each variable is a dimension
• However, we can also view each record in a database as a dimension
Dimensions
Year Medical Csre MedicalServices Transportation Electricity
1980 74.90$ 74.80$ 83.10$ 26.70$ 1981 82.90 82.80 93.20 31.55 1982 92.50 92.60 97.00 36.01 1983 100.60 100.70 99.30 37.18 1984 106.80 106.70 103.70 38.60 1985 113.50 113.20 106.40 38.98 1986 122.00 121.90 102.30 40.22 1987 130.10 130.00 105.40 40.02 1988 138.60 138.30 108.70 40.20 1989 149.30 148.90 114.10 40.83 1990 162.80 162.70 120.50 41.66
The Two Major Categories of Dimension Reduction
• Variable reduction• Factor Analysis• Principal Components Analysis
• Record reduction• Clustering
• Other methods tend to be developments on these
Principal Components Analysis
• A form of dimension (variable) reduction• Suppose we want to combine all the
information related to the “inflation” dimension of insurance costs• Medical care costs• Employment (wage) costs• Other
• Energy• Transportation• Services
Principal Components
• These variables are correlated but not perfectly correlated
• We replace many variables with a weighted sum of the variables
• These are then used as independent variables in a predictive model
Factor Analysis: A Latent Factor
Subtitle
9/12/2005
litigation rates
Subtitle
9/12/2005
# Procedures
Subtitle
9/12/2005
Index of tort climate
Social Inflation
Factor/Principal Components Analysis
• Linear methods – use linear correlation matrix
• Correlation matrix decomposed to find smaller number of factors the are related to the same underlying drivers
• Highly correlated variables tend to have high load on the same factor
Factor/Principal Components Analysis
Medical Care MedicalServices Transportation Electricity Utility Fuel Oil Gas BreadMedical Care 1.000MedicalServices 1.000 1.000Transportation 0.993 0.992 1.000Electricity 0.888 0.884 0.910 1.000Utility 0.872 0.873 0.875 0.771 1.000Fuel Oil 0.448 0.451 0.468 0.281 0.704 1.000Gas 0.586 0.592 0.601 0.402 0.752 0.926 1.000Bread 0.983 0.983 0.975 0.844 0.847 0.459 0.595 1.000
Factor/Principal Components Analysis
•Uses eignevectors and eigenvalues•R is correlation matrix, V eigenvectors, lambda eigenvalues
VRV
Inflation Data
Component Matrixa
.986 -.086
.986 -.081
.990 -.073
.895 -.205
.877 .303
.551 .761
.709 .639
.973 -.078
.587 .337
.766 .077
.457 -.644
.967 -.202
-.695 .521
.986 -.048
Medical Care
MedicalServices
Transportation
Electricity
Utility
Fuel Oil
Gas
Bread
Eggs
Apples
Coffee
Employment
UEP
EmpCost
1 2
Component
Extraction Method: Principal Component Analysis.
2 components extracted.a.
Factor Rotation• Find simpler more easily
interpretable factors
• Use notion of factor complexity
rowfor factor on loading
mean is b j,fcator on i variableof
loading is b factors, ofnumber is
)(1
ij
ij
222
r
bbr
qr
iijiji
Factor Rotation
• Quartimax Rotation• Maximize q
• Varimax Rotation• Maximizes the variance of squared
loadings for each factor rather than for each variable
Varimax Rotation
Rotated Component Matrixa
.834 .533
.831 .537
.829 .546
.835 .383
.510 .775
-.028 .939
.172 .939
.818 .532
.260 .625
.560 .529
.755 -.232
.890 .429
-.869 -.011
.811 .563
Medical Care
MedicalServices
Transportation
Electricity
Utility
Fuel Oil
Gas
Bread
Eggs
Apples
Coffee
Employment
UEP
EmpCost
1 2
Component
Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.
Rotation converged in 3 iterations.a.
Plot of Loadings on Factors
How Many Factors to Keep?•Eigenvalues provide information on how much variance is explained•Proportion explained by a given component=corresponding eigenvalue/n•Use Scree Plot•Rule of thumb: keep all factors with eigenvalues>1
WC Severity vs Factor 1
WC Severity vs Factor 2
What About Categorical Data?• Factor analysis is performed on
numeric data• You could code data as binary
dummy variables • Categorical Variables from Texas
data• Injury• Cause of loss• Business Class• Health Insurance (Y/N)
Optimal Scaling
• A method of dealing with categorical variables
• Can be used to model nonlinear relationships
• Uses regression to • Assign numbers to categories• Fit regression coefficients• Y*=f(X*)
• In each round of fitting, a new Y* and X* is created
Variable Correlations
Correlations Original Variables
1.000 -.019 .049
-.019 1.000 .105
.049 .105 1.000
1 2 3
1.109 1.014 .877
injury
cause
Business class
Dimension
Eigenvalue
injury causeBusiness
class
Correlations Transformed Variables
Dimension: 1
1.000 .710 .433
.710 1.000 .552
.433 .552 1.000
1 2 3
2.138 .590 .272
injury
cause
Business class
Dimension
Eigenvalue
injury causeBusiness
class
Visualizations of Scaled Variables
Can we use scaled variables in prediction?
Average Paid LossNtile of Optimal Score First Score Second Score
1 294,305 163,736 2 270,763 188,733 3 233,056 206,497 4 151,455 261,773 5 147,751 277,389
Tree Using Optimal Scaling Scores
Tree for Subrogation
Row Reduction: Cluster Analysis• Records are grouped in categories that
have similar values on the variables• Examples
• Marketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing
• Text analysis: Use words that tend to occur together to classify documents
• Fraud modeling• Territory definition
• Note: no dependent variable used in analysis
ClusteringClustering• Common Method: k-means,
hierarchical
• No dependent variable – records are grouped into classes with similar values on the variable
• Start with a measure of similarity or dissimilarity
• Maximize dissimilarity between members of different clusters
Dissimilarity (Distance) Measure – Dissimilarity (Distance) Measure – Continuous VariablesContinuous Variables
•Euclidian Distance
•Manhattan Distance
1/ 22
1( ) i, j = records k=variable
mij ik jkkd x x
1
mij ik jkkd x x
Binary Variables
Row Variable1 0
0 a b a+b1 c d c+d
a+c b+dCo
lum
n
Var
iab
le
Binary Variables
• Sample Matching
• Rogers and Tanimoto
b cd
a b c d
2( )( ) 2( )
b cd
a d b c
Example: Texas Data
• Data from 2002 and 2003 closed claim database by Texas Ins Dept
• Only claims over a threshold included• Variables used for clustering:
• Report Lag• Settlement Lag• County (ranked by how often in data)• Injury• Cause of Loss• Business class
Results Using Only Numeric Variables
• Used Euclidian distance measureFinal Cluster Centers
10.741 5.158 14.500
25.155 7.342 53.000
11.204 4.553 14.000
40.67 42.26 63.00
233 8264 13893
391 7439 13843
172 0 14627
.39 .00 .00
.35 .00 .00
RANK of NCounty
RANK of SumLoss
RANK of numSuit
age
Elapsed time betweendate of injury and datereported to insurer
Elapsed time betweendate of injury and datesuit filed
Elapsed time betweendate of injury and dateof trial
BackInj
MultInj
1 2 3
Cluster
Two Stage Clustering With Categorical Variables
• First compute dissimilarity measures
• Then get clusters
• Find optimum number of clusters
Loadings of Injuries on Cluster
Age and Cluster
County vs Cluster
Means of Financial Variables by Cluster
Average of Financial Variables by Cluster
Mean
257,111.7426 38,831.05
78,186.5918 53,273.24
263,851.2863 57,535.26
174,739.1995 25,522.39
219,854.6705 38,853.73
TwoStep Cluster Number1
2
3
4
Total
Paidloss
Totalallocated loss
adjustmentexpense
Tying Things Together: Multidimensional Scaling
• A mathematical way to connect clustering and factor analysis
• Data can be decomposed into key row dimensions times a diagonal weight matrix times key column dimensions
kkTkk VDUX ˆ
Modern dimension reduction• Hidden layer in neural networks like
a nonlinear principle components• Projection Pursuit Regression – a
nonlinear PCA• Kahonen self-organizing maps – a
kind of neural network that does clustering
• These can be understood as enhancements factor analysis or clustering
Kahonen SOM for Fraud
1 4 7 10 13 16
S1
S4
S7
S10
S13
S16
4-5
3-4
2-3
1-2
0-1
Recommended References
• Hacher, 1994, A Step-by-Step Approach for Using the SAS System for Factor Ananlysis and Structural Equation Modeling, SAS Publications
• Jacoby, 1991, Data Theory and Dimension Analysis, Sage Publications
• Kaufman and Rousseeuw,1990, Finding Groups in Data, Wiley
• Kim and Mueller, 1978, Factor Analysis: Statistical Methods and Practical Issues, Sage Publications
Questions?