Tests of structural equation models do not work: What to do ?

1

Tests of structural equation models do not work: What to do ?

Willem E.Saris

ESADE

Universitat Ramon Llull

22

Concern about testing

I have been worried about the testing procedures in SEM from my first contacts

More then 25 years ago Albert Satorra and me wrote our first paper on the power of the test.

Our worries have not been shared by the SEM community untill recently (Publication in SEM)

I am very pleased that today I have the opportunity to convince you of our point of view

333

Importance of testing

The purpose of SEM is to estimate the strength of relationships between variables correcting for measurement error

All estimates are conditional on the specified model

Therefore testing the models is essential for SEM

44

Content of my lecture

•Brief intro in SEM and the standard test•Our criticism•The alternative direction of the SEM community: fit indices•The special case of RMSEA•Why fit indices are not the solution•Back to the basics•An illustration

55

Introduction SEM by example

A frequently discussed issue nowadays is whether Social Trust is related with Political Trust.

Both latent variables are normally measures by three indicators

•Path analysis suggests:

• ij = ikjm if k=m

• ij = ikjm if k≠m

66

Estimation of effects

The parameters are estimated by minimizing the following quadratic form:

f = wij (sij – ij)2

The estimates are the values which minimize this function

The value of this function at its minimum is denoted by f0

77

Imagine that this is the observed correlation matrix

Correlation Matrix

y1 y2 y3 y4 y5 y6

-------- -------- -------- -------- -------- --------

y1 1.00

y2 0.64 1.00

y3 0.64 0.64 1.00

y4 0.32 0.32 0.32 1.00

y5 0.32 0.32 0.32 0.64 1.00

y6 0.32 0.32 0.32 0.64 0.64 1.00

88

The estimates

LAMBDA-Y

F 1 F 2

-------- --------

y1 0.80 - -

y2 0.80 - -

y3 0.80 - -

y4 - - 0.80

y5 - - 0.80

y6 - - 0.80

Correlation of F1 with F2 = 0.50

We can estimate the relationship between latent variables and observed variables but also between latent variables

99

The residuals= differences between observed and expected correlations

Residuals

y1 y2 y3 y4 y5 y6

-------- -------- -------- -------- -------- --------

y1 0.00

y2 0.00 0.00

y3 0.00 0.00 0.00

y4 0.00 0.00 0.00 0.00

y5 0.00 0.00 0.00 0.00 0.00

y6 0.00 0.00 0.00 0.00 0.00 0.00

10

Imagine that the model in the population is different

.5F1 F2

.8 .2

Y1 Y2 Y3 Y4 Y5 Y6

e1 e2 e3 e4 e5 e6

1.00 .64 1.00 .64 .64 1.00 .48 .48 .48 1.00 .32 .32 .32 .72 1.00 .32 .32 .32 .72 .64 1.00

11

Now the estimates are also differentThese estimates deviate somewhat from the values in the population

The deviations are due to the misspecification

Can we detect that the hypothesized model is wrong ?

LAMBDA-Y F 1 F 2 -------- -------- VAR 1 0.80 - - (0.04) 19.96 VAR 2 0.80 - - (0.04) 19.96 VAR 3 0.80 - - (0.04) 19.96 VAR 4 - - 0.95 (0.04) 26.14 VAR 5 - - 0.77 (0.04) 19.42 VAR 6 - - 0.77 (0.04) 19.42

12

The fitted residuals

Based on these estimates the expected correlations can be calculated.

The residuals (observed-expected correlations) can indicate that the model is misspecified

In this case the residuals are:

VAR 1 VAR 2 VAR 3 VAR 4 VAR 5 VAR 6 -------- -------- -------- -------- -------- ---- VAR 1 0.00 VAR 2 0.00 0.00 VAR 3 0.00 0.00 0.00 VAR 4 0.02 0.02 0.02 0.00 VAR 5 -0.05 -0.05 -0.05 -0.01 0.00 VAR 6 -0.05 -0.05 -0.05 -0.01 0.05 0.00

13

When should the model be rejected ?

Residuals can differ from zero due to misspecification of the model

But also due to sampling fluctuations.

So when should the model be rejected ?

14

The quality the test should haveMacCallum, Browne and Sugawara (1996: 131)

“if the model is truly a good model in terms of its fit in the population, we wish to avoid concluding that the model is a bad one.

Alternatively, if the model is truly a bad one, we wish to avoid concluding that it is a good one.”

15

In statistical terms

Required is:

A small probability of a type 1 error i.e. the probability of rejection of a good model

A small probability of a type II error i.e. the probability of acceptance of a bad model

16

Bad models are misspecified models

Hu and Bentler (1998: 427):

“a model is said to be misspecified when

(a) one or more parameters are estimated whose population values are zeros (i.e. an over-parameterised misspecified model)

(b) one or more parameters are fixed to zeros whose population values are non-zeros (i.e. an under-parameterised misspecified model)

(c) or both.”

17

Definition of the size of a misspecification

The size of the misspecification is the absolute difference between

the true value of the parameter and

the value specified in the analysis

In the above example the size of the misspecification was .2

18

The standard chi2 test

It can be shown that under very general conditions:

the test statistic T = nF0 has a 2 (df) distribution if the model is correct

The model is rejected if T > C

where Cis the value for which

pr(2 (df) > C ) =

19

Criticism

The specified test does not test directly for misspecifications in the model

The test checks possible consequences of misspecifications present in the residuals

The specified test only controls the type I errors and not the type II errors

20

Can we evaluate type II errors ?

It is well known that

T has a non central 2 (df, ncp) distribution if the model is incorrect

Due to a misspecification in the model the mean of the distribution of T increases with what is called the Noncentrality parameter (NCP)

21

The Central and noncentral chi2 distribution and the power

22

The non-centrality parameter NCP

The NCP can be computed as shown by Satorra and Saris (1985) by generating population data and estimating the parameters with an incorrect model.

The difference between the two models is the misspecification in the model

In that case the value of the test statistic T is equal to the NCP for this misspecification given that the rest of the model is correct.

2321 april 2023 23college titel en nummer

An illustration

24

High Power (left) and low Power (right)

•High power is good for big errors not for small errors.

•Low power is good for small errors not for big errors

•With loading .8 the left side applies. With loadings .5 the right side applies for the same error.

25

The standard test is not good enough

The standard test can only detect misspecifications for which the test is sensitive (high power).

Rejection of the model can be due to very small misspecifications for which the test is very sensitive

Not rejection does not mean that the model is correct. The test can be insensitive for the misspecifications

26

The reasons for the problems

Only type I errors are taken into account

It is not a direct test of misspecifications but of consequences of misspecifications.

These consequences (residuals) are also affected by other characteristics of the model

27

This was not the mainstream problem

Hu and Bentler say:

“the decision for accepting or rejecting a particular model may vary as a function of sample size, which is certainly not desirable.”

This problem with the chi2 test has led to the development of a plethora of Fit indices.

28

Fit indices with cut-of criteria

29

Model evaluation with Fit indices

The traditional model evaluation method has been replaced by a similar procedure using Fit indices.For fit indices that have a theoretical upper value of 1 for good fitting models (such as AGFI and GFI) , the model being rejected if:FI < Cfi

There are however, also FIs for which a theoretical lower value of 0 indicates a good fit; for them the model is rejected if:FI > Cfi

where Cfi is a fix cut-off value developed specifically for each FI.

30

Criticism

For most indices the distribution is unknown. Only by Monte Carlo experiments, based on specific cases, arguments are made for critical values

Only consequences for the residuals are evaluated and not the misspecifications themselves.

31

Goodness of fit by approximationSteiger (1990), Browne & Cudeck (1993) and MacCallum et al. (1996), have argued:models are always simplifications of reality and are therefore always misspecified. This has led to the most popular fit index nowadays: Root Mean Squared Error of Approximation or RMSEA

Although there is truth in this argument, this is not a good reason to completely change the approach to model testing.

32

This is not necessary

One has to design tests which take into account Type 1 and type 2 errors so that:

Models with substantially relevant misspecifications should be rejected and

Models with substantially irrelevant misspecifications should be accepted.

33

Serious problems

The fit indices are functions of the fitting function

So they have the same serious problems as the standard test

Let us show that by very simple but fundamental models.

34

A model Mo with a substantively relevant misspecification

Population model M1 Hypothesized model M0

The misspecification is in the correlated disturbance terms

The size of the misspecification in .2

Without detection the misspecification b21=.2 not .0 !

This model should be rejected

35

A model Mo with a substantively irrelevant misspecification


The misspecification is in the correlated factors


For all practical purposes this model should be accepted

36

Population data

y1 y2 x1 x2

y1 1.00

y2 0.20 1.00

x1 0.40 0.00 1.00

x2 0.00 0.10 0.00 1.00

37

Population study with different values of 22

γ22 CHI2 power RMSEA CFI AGFI SRMR MI of ψ21

0.1 3.20 0.34 0.00 1.00 0.99 0.025 3.20

0.2 3.30 0.35 0.00 1.00 0.99 0.025 3.30

0.3 3.49 0.37 0.00 1.00 0.99 0.025 3.49

0.4 3.80 0.38 0.00 1.00 0.99 0.025 3.80

0.5 4.20 0.43 0.01 1.00 0.99 0.025 4.20

0.6 5.07 0.50 0.03 1.00 0.98 0.025 5.07

0.7 6.47 0.62 0.04 0.99 0.98 0.025 6.47

0.8 9.50 0.79 0.06 0.98 0.97 0.025 9.50

0.9 20.27 0.99 0.10 0.96 0.94 0.025 20.27

38

A model Mo with a substantive irrelevant misspecification




For all practical purposes this model should be accepted

39

Population study of the factor model

The better the measures are the more likely it is that the model is rejected

This is not a very attractive test

S RMSEASRMR

40

These examples show

The model with a substantively relevant misspecification will most likely not be rejected

The model with a substantively irrelevant misspecification will most likely be rejected

This is the opposite of what all of us would like

41

We see what should not happen

In contrast to what MacCallum, Browne and Sugawara (1996: 131) required:

A bad model will not be rejected

A good model will be rejected

42

Conclusion

We could say paraphrasing Hu and Bentler :

“the decision for accepting or rejecting a particular model may vary as a function of irrelevant parameters, which is certainly not desirable.”

So there are reasons enough to consider alternative procedures for testing these models.

43

Can information about the power help?

We have thought that information about the power of the test can help to test hypotheses about single parameters or small sets of parameters

Let me illustrate this by the last example

44

We want to test if the factors measure the same i.e. Correlate perfectly



What is the power of the chi2 test if the size of the misspecification is .10

45

The power of the test

46

Now we can design the test

Given that the loadings are around .8

And we accept a type I error of .05 ()

And we want to have a high power (.8) to detect a deviation of .1 or more

Then we should have a sample size of at least 300 cases

In this case the model should be rejected

If T > 3.84

47

Criticism

The problem of this test is that we have to suppose that there are no other misspecifications in the model

If there are other misspecifications they can be the cause of the rejection of the model

48

There are many other possible errors

49

The situation is even worse

The model test requires a test for all parameters

But the tests are unequally sensitive for misspecifications in different parameters

We can only expect that the test detects misspecifications for which the test is sensitive

This sensitivety depends on characteristics of the model that have nothing to do with the size of the misspecification.

50

For example

NCP

51

Model test is impossible

Given these differences in power between the different parameters one can never formulate a test for all parameters of the model

If one increases the power

minimal missspecifications in some parameters will lead to rejection of the model.

If one does not increase the power

some misspecifications will never be detected.

52

Our proposal: back to the basics

We have to test for misspecifications in the models

In this test type I and type II (or power) have to be taken into account

Only one serious misspecification is already enough to reject the model

53

A half way solution from 1987: Estimation of the EPC and MI

MI

EPC

54

What do we get for each constrained parameter ?

For each constrained parameter we can get the EPC and the MI.

This means that we get an estimate of the misspecification (EPC) and the test statistic (MI) for this misspecification.

What we still miss is an indication of the power of the test for each EPC.

55


What a relevant misspecification is depends of the progress in a discipline

In the social science the following sizes of misspecifications are certainly relevant

.1 for a causal effect or correlated error

.4 for a loading

56


We call the critical value for a misspecification So a value larger than should be detected with high likelihood

So models with a misspecification of should be rejected with high likelihood = high power

It can be shown that

NCP = (MI/EPC2) δ2

Given the NCP, one can determine the power

57

Decision table for detection misspecification of single parameters

Power

Low High

Modification index

Not significant Not informative. Inconclusive

(I)

No misspecification

(nm)

Significant Misspecification present

(m)

Inspect EPC (EPC)

58

An illustration : Model Blok and Saris

59

What the traditional tests tell chi2= 161 with df =9, SRMR =.073, RMSEA = .21, CFI = .95 and AGFI =.67.

According to the suggested cut-off values, all fit indices, with the only exception of CFI, would reject the model.

How can we be sure of this conclusion?

It is also possible that there are only very small misspecification(s) for which these test statistics and fit indices are very sensitive.

60

Testing all parameters with JRule(William van der Veld et al.)

61

Test for variances and covariances

62

The corrected model

63

What the traditional tests tell

chi2 = 3.88, with df=5 (p-value =.57), SRMR =.0076, RMSEA = .0, CFI= 1.0 AGFI = .98.

Now all indices suggest that the model fits the data.

However, this decision is also doubtful

It is possible that the power of the tests is so low for this model that the misspecifications are not detected.

64

Testing all parameters with JRule

65

Conclusions

The traditional chi2 test does not provide the information that is needed

It does not test for misspecifications and

It ignores the power of the test

The fit indices have the same problems

Our new approach directly tests for misspecifications and takes the power of the test into account.

66

Conclusions

Our new approach can detect for any parameter whether this parameter is misspecified or not or that there is not enough information to decide.

If a misspecification is detected the model should be rejected

If the power is too low to make a decision further research is needed to test the quality of the model

The latter option is completely ignored in the traditional model tests

Date post:	23-Jan-2016
Category:	Documents
Upload:	chesmu
View:	40 times
Download:	0 times

Tests of structural equation models do not work: What to do ?

Documents