1
Tests of structural equation models do not work: What to do ?
Willem E.Saris
ESADE
Universitat Ramon Llull
22
Concern about testing
I have been worried about the testing procedures in SEM from my first contacts
More then 25 years ago Albert Satorra and me wrote our first paper on the power of the test.
Our worries have not been shared by the SEM community untill recently (Publication in SEM)
I am very pleased that today I have the opportunity to convince you of our point of view
333
Importance of testing
The purpose of SEM is to estimate the strength of relationships between variables correcting for measurement error
All estimates are conditional on the specified model
Therefore testing the models is essential for SEM
44
Content of my lecture
•Brief intro in SEM and the standard test•Our criticism•The alternative direction of the SEM community: fit indices•The special case of RMSEA•Why fit indices are not the solution•Back to the basics•An illustration
55
Introduction SEM by example
A frequently discussed issue nowadays is whether Social Trust is related with Political Trust.
Both latent variables are normally measures by three indicators
•Path analysis suggests:
• ij = ikjm if k=m
• ij = ikjm if k≠m
66
Estimation of effects
The parameters are estimated by minimizing the following quadratic form:
f = wij (sij – ij)2
The estimates are the values which minimize this function
The value of this function at its minimum is denoted by f0
77
Imagine that this is the observed correlation matrix
Correlation Matrix
y1 y2 y3 y4 y5 y6
-------- -------- -------- -------- -------- --------
y1 1.00
y2 0.64 1.00
y3 0.64 0.64 1.00
y4 0.32 0.32 0.32 1.00
y5 0.32 0.32 0.32 0.64 1.00
y6 0.32 0.32 0.32 0.64 0.64 1.00
88
The estimates
LAMBDA-Y
F 1 F 2
-------- --------
y1 0.80 - -
y2 0.80 - -
y3 0.80 - -
y4 - - 0.80
y5 - - 0.80
y6 - - 0.80
Correlation of F1 with F2 = 0.50
We can estimate the relationship between latent variables and observed variables but also between latent variables
99
The residuals= differences between observed and expected correlations
Residuals
y1 y2 y3 y4 y5 y6
-------- -------- -------- -------- -------- --------
y1 0.00
y2 0.00 0.00
y3 0.00 0.00 0.00
y4 0.00 0.00 0.00 0.00
y5 0.00 0.00 0.00 0.00 0.00
y6 0.00 0.00 0.00 0.00 0.00 0.00
10
Imagine that the model in the population is different
.5F1 F2
.8 .2
Y1 Y2 Y3 Y4 Y5 Y6
e1 e2 e3 e4 e5 e6
1.00 .64 1.00 .64 .64 1.00 .48 .48 .48 1.00 .32 .32 .32 .72 1.00 .32 .32 .32 .72 .64 1.00
11
Now the estimates are also differentThese estimates deviate somewhat from the values in the population
The deviations are due to the misspecification
Can we detect that the hypothesized model is wrong ?
LAMBDA-Y F 1 F 2 -------- -------- VAR 1 0.80 - - (0.04) 19.96 VAR 2 0.80 - - (0.04) 19.96 VAR 3 0.80 - - (0.04) 19.96 VAR 4 - - 0.95 (0.04) 26.14 VAR 5 - - 0.77 (0.04) 19.42 VAR 6 - - 0.77 (0.04) 19.42
12
The fitted residuals
Based on these estimates the expected correlations can be calculated.
The residuals (observed-expected correlations) can indicate that the model is misspecified
In this case the residuals are:
VAR 1 VAR 2 VAR 3 VAR 4 VAR 5 VAR 6 -------- -------- -------- -------- -------- ---- VAR 1 0.00 VAR 2 0.00 0.00 VAR 3 0.00 0.00 0.00 VAR 4 0.02 0.02 0.02 0.00 VAR 5 -0.05 -0.05 -0.05 -0.01 0.00 VAR 6 -0.05 -0.05 -0.05 -0.01 0.05 0.00
13
When should the model be rejected ?
Residuals can differ from zero due to misspecification of the model
But also due to sampling fluctuations.
So when should the model be rejected ?
14
The quality the test should haveMacCallum, Browne and Sugawara (1996: 131)
“if the model is truly a good model in terms of its fit in the population, we wish to avoid concluding that the model is a bad one.
Alternatively, if the model is truly a bad one, we wish to avoid concluding that it is a good one.”
15
In statistical terms
Required is:
A small probability of a type 1 error i.e. the probability of rejection of a good model
A small probability of a type II error i.e. the probability of acceptance of a bad model
16
Bad models are misspecified models
Hu and Bentler (1998: 427):
“a model is said to be misspecified when
(a) one or more parameters are estimated whose population values are zeros (i.e. an over-parameterised misspecified model)
(b) one or more parameters are fixed to zeros whose population values are non-zeros (i.e. an under-parameterised misspecified model)
(c) or both.”
17
Definition of the size of a misspecification
The size of the misspecification is the absolute difference between
the true value of the parameter and
the value specified in the analysis
In the above example the size of the misspecification was .2
18
The standard chi2 test
It can be shown that under very general conditions:
the test statistic T = nF0 has a 2 (df) distribution if the model is correct
The model is rejected if T > C
where Cis the value for which
pr(2 (df) > C ) =
19
Criticism
The specified test does not test directly for misspecifications in the model
The test checks possible consequences of misspecifications present in the residuals
The specified test only controls the type I errors and not the type II errors
20
Can we evaluate type II errors ?
It is well known that
T has a non central 2 (df, ncp) distribution if the model is incorrect
Due to a misspecification in the model the mean of the distribution of T increases with what is called the Noncentrality parameter (NCP)
21
The Central and noncentral chi2 distribution and the power
22
The non-centrality parameter NCP
The NCP can be computed as shown by Satorra and Saris (1985) by generating population data and estimating the parameters with an incorrect model.
The difference between the two models is the misspecification in the model
In that case the value of the test statistic T is equal to the NCP for this misspecification given that the rest of the model is correct.
2321 april 2023 23college titel en nummer
An illustration
24
High Power (left) and low Power (right)
•High power is good for big errors not for small errors.
•Low power is good for small errors not for big errors
•With loading .8 the left side applies. With loadings .5 the right side applies for the same error.
25
The standard test is not good enough
The standard test can only detect misspecifications for which the test is sensitive (high power).
Rejection of the model can be due to very small misspecifications for which the test is very sensitive
Not rejection does not mean that the model is correct. The test can be insensitive for the misspecifications
26
The reasons for the problems
Only type I errors are taken into account
It is not a direct test of misspecifications but of consequences of misspecifications.
These consequences (residuals) are also affected by other characteristics of the model
27
This was not the mainstream problem
Hu and Bentler say:
“the decision for accepting or rejecting a particular model may vary as a function of sample size, which is certainly not desirable.”
This problem with the chi2 test has led to the development of a plethora of Fit indices.
28
Fit indices with cut-of criteria
29
Model evaluation with Fit indices
The traditional model evaluation method has been replaced by a similar procedure using Fit indices.For fit indices that have a theoretical upper value of 1 for good fitting models (such as AGFI and GFI) , the model being rejected if:FI < Cfi
There are however, also FIs for which a theoretical lower value of 0 indicates a good fit; for them the model is rejected if:FI > Cfi
where Cfi is a fix cut-off value developed specifically for each FI.
30
Criticism
For most indices the distribution is unknown. Only by Monte Carlo experiments, based on specific cases, arguments are made for critical values
Only consequences for the residuals are evaluated and not the misspecifications themselves.
31
Goodness of fit by approximationSteiger (1990), Browne & Cudeck (1993) and MacCallum et al. (1996), have argued:models are always simplifications of reality and are therefore always misspecified. This has led to the most popular fit index nowadays: Root Mean Squared Error of Approximation or RMSEA
Although there is truth in this argument, this is not a good reason to completely change the approach to model testing.
32
This is not necessary
One has to design tests which take into account Type 1 and type 2 errors so that:
Models with substantially relevant misspecifications should be rejected and
Models with substantially irrelevant misspecifications should be accepted.
33
Serious problems
The fit indices are functions of the fitting function
So they have the same serious problems as the standard test
Let us show that by very simple but fundamental models.
34
A model Mo with a substantively relevant misspecification
Population model M1 Hypothesized model M0
The misspecification is in the correlated disturbance terms
The size of the misspecification in .2
Without detection the misspecification b21=.2 not .0 !
This model should be rejected
35
A model Mo with a substantively irrelevant misspecification
Population model M1 Hypothesized model M0
The misspecification is in the correlated factors
The size of the misspecification in .05
For all practical purposes this model should be accepted
36
Population data
y1 y2 x1 x2
y1 1.00
y2 0.20 1.00
x1 0.40 0.00 1.00
x2 0.00 0.10 0.00 1.00
37
Population study with different values of 22
γ22 CHI2 power RMSEA CFI AGFI SRMR MI of ψ21
0.1 3.20 0.34 0.00 1.00 0.99 0.025 3.20
0.2 3.30 0.35 0.00 1.00 0.99 0.025 3.30
0.3 3.49 0.37 0.00 1.00 0.99 0.025 3.49
0.4 3.80 0.38 0.00 1.00 0.99 0.025 3.80
0.5 4.20 0.43 0.01 1.00 0.99 0.025 4.20
0.6 5.07 0.50 0.03 1.00 0.98 0.025 5.07
0.7 6.47 0.62 0.04 0.99 0.98 0.025 6.47
0.8 9.50 0.79 0.06 0.98 0.97 0.025 9.50
0.9 20.27 0.99 0.10 0.96 0.94 0.025 20.27
38
A model Mo with a substantive irrelevant misspecification
Population model M1 Hypothesized model M0
The misspecification is in the correlated factors
The size of the misspecification in .05
For all practical purposes this model should be accepted
39
Population study of the factor model
The better the measures are the more likely it is that the model is rejected
This is not a very attractive test
S RMSEASRMR
40
These examples show
The model with a substantively relevant misspecification will most likely not be rejected
The model with a substantively irrelevant misspecification will most likely be rejected
This is the opposite of what all of us would like
41
We see what should not happen
In contrast to what MacCallum, Browne and Sugawara (1996: 131) required:
A bad model will not be rejected
A good model will be rejected
42
Conclusion
We could say paraphrasing Hu and Bentler :
“the decision for accepting or rejecting a particular model may vary as a function of irrelevant parameters, which is certainly not desirable.”
So there are reasons enough to consider alternative procedures for testing these models.
43
Can information about the power help?
We have thought that information about the power of the test can help to test hypotheses about single parameters or small sets of parameters
Let me illustrate this by the last example
44
We want to test if the factors measure the same i.e. Correlate perfectly
Population model M1 Hypothesized model M0
The misspecification is in the correlated factors
What is the power of the chi2 test if the size of the misspecification is .10
45
The power of the test
46
Now we can design the test
Given that the loadings are around .8
And we accept a type I error of .05 ()
And we want to have a high power (.8) to detect a deviation of .1 or more
Then we should have a sample size of at least 300 cases
In this case the model should be rejected
If T > 3.84
47
Criticism
The problem of this test is that we have to suppose that there are no other misspecifications in the model
If there are other misspecifications they can be the cause of the rejection of the model
48
There are many other possible errors
49
The situation is even worse
The model test requires a test for all parameters
But the tests are unequally sensitive for misspecifications in different parameters
We can only expect that the test detects misspecifications for which the test is sensitive
This sensitivety depends on characteristics of the model that have nothing to do with the size of the misspecification.
50
For example
NCP
51
Model test is impossible
Given these differences in power between the different parameters one can never formulate a test for all parameters of the model
If one increases the power
minimal missspecifications in some parameters will lead to rejection of the model.
If one does not increase the power
some misspecifications will never be detected.
52
Our proposal: back to the basics
We have to test for misspecifications in the models
In this test type I and type II (or power) have to be taken into account
Only one serious misspecification is already enough to reject the model
53
A half way solution from 1987: Estimation of the EPC and MI
MI
EPC
54
What do we get for each constrained parameter ?
For each constrained parameter we can get the EPC and the MI.
This means that we get an estimate of the misspecification (EPC) and the test statistic (MI) for this misspecification.
What we still miss is an indication of the power of the test for each EPC.
55
The power of the test
What a relevant misspecification is depends of the progress in a discipline
In the social science the following sizes of misspecifications are certainly relevant
.1 for a causal effect or correlated error
.4 for a loading
56
The power of the test
We call the critical value for a misspecification So a value larger than should be detected with high likelihood
So models with a misspecification of should be rejected with high likelihood = high power
It can be shown that
NCP = (MI/EPC2) δ2
Given the NCP, one can determine the power
57
Decision table for detection misspecification of single parameters
Power
Low High
Modification index
Not significant Not informative. Inconclusive
(I)
No misspecification
(nm)
Significant Misspecification present
(m)
Inspect EPC (EPC)
58
An illustration : Model Blok and Saris
59
What the traditional tests tell chi2= 161 with df =9, SRMR =.073, RMSEA = .21, CFI = .95 and AGFI =.67.
According to the suggested cut-off values, all fit indices, with the only exception of CFI, would reject the model.
How can we be sure of this conclusion?
It is also possible that there are only very small misspecification(s) for which these test statistics and fit indices are very sensitive.
60
Testing all parameters with JRule(William van der Veld et al.)
61
Test for variances and covariances
62
The corrected model
63
What the traditional tests tell
chi2 = 3.88, with df=5 (p-value =.57), SRMR =.0076, RMSEA = .0, CFI= 1.0 AGFI = .98.
Now all indices suggest that the model fits the data.
However, this decision is also doubtful
It is possible that the power of the tests is so low for this model that the misspecifications are not detected.
64
Testing all parameters with JRule
65
Conclusions
The traditional chi2 test does not provide the information that is needed
It does not test for misspecifications and
It ignores the power of the test
The fit indices have the same problems
Our new approach directly tests for misspecifications and takes the power of the test into account.
66
Conclusions
Our new approach can detect for any parameter whether this parameter is misspecified or not or that there is not enough information to decide.
If a misspecification is detected the model should be rejected
If the power is too low to make a decision further research is needed to test the quality of the model
The latter option is completely ignored in the traditional model tests