Date post: | 24-Dec-2015 |
Category: |
Documents |
Upload: | megan-ethel-mosley |
View: | 213 times |
Download: | 0 times |
CONSTRUCT VALIDITY OF ACCESMENT CENTRES:
LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS 31st Annual ACSG Conference
• March 2011
What is known about the construct validity currently: Over last 50 years – popular in the assessment of personal differences for
managerial development purposes
Multi-occupation, multi-company investigation with high face validity
AC post-exercise dimension ratings (PEDRs) is more pervasive than cross situational stability in candidate ratings
Bowler, M. C., & Woehr, D. J. (2006). A meta-analytic evaluation of the impact of dimension and exercise factors on assessment center ratings. Journal of Applied Psychology, 91, 1114–1124.
Lance, Lambert, Gewin, Lievens, & Conway, (2004) found in a meta-analysis that exercise effect explain almost three times more variance than dimension ratings
Problematic for construct validity: PEDRs is a function of exercise design and not person competencies.
What is known about the construct validity currently:
Recently there has been two schools of thought to assess the construct validity of AC’s: Confirmatory Factor Analysis (CFA) [MTMM] Generalization Theory
FOUR basic models within the CFA tradition: Correlated Dimension Correlated Exercises Model (CDCE): MTMM One-dimension-correlated exercise model (1DCE) an uncorrelated dimensions, correlated exercises, plus g model (UDCE +
g) Correlated dimension-correlated uniqueness (CDCU) model
Lance, Woehr & Meade (2007). A Monte Carlo Investigation of Assessment Center Construct Validity Models. Organizational Research Methods, 10(3), 430-448
Advantages of CFA approach Partition out error variance; ALSO Partition out Exercise effects
Thus PEDR’s are a function of both exercise and dimension effects
However, technically CTCE model difficult to model (Empirical under-identification)
Prerequisite is construct validity before partitioning out exercise effects
Thus critical first step was to assess construct validity of dimensions with actual DAC data
An Example: Achievement motivation and Financial Perspective
An Example: Achievement motivation AM:
DIMENSION: ACHIEVEMENT MOTIVATION
EXCERCISES
ANALYSIS PROBLEM (AP)
SIMULATED IN BASKET (SIB)
TRAITS
INNOVATION IN_AP
ENERGY EN_AP
PROCESS SKILLS PS_AP PS_SIB
Correlation Matrix
AM: Option 1 CDCE model would be preferable: WHY? Differentiate sources of variance:
SCENARIO 2:
ACHIEVMENT MOTIVEATION
IN_AP
IN_AP
IN_AP
IN_SIB
ANALYSIS
PROBLEM
SIMULATED IN BASKET
-Empirical under-identification-We have 13 parameters to measure in the model, yet only 10 pieces of information in the covariance matrix-Thus we have to much model parameters to gauge with too little information (-3df)-Similar to equation: X + Y = 6-Unlimited possible combinations to solve the equation
ANALYSIS PROBLEM
x2
x3
δ2
lx11 δ1
δ3
lx21
lx31
x1
SIMULATED IN BASKETx4
lx42 δ3
ACHIEVEMENT MOTIVATION
lx13 x1
x2
x3lx23
lx43
lx23
φ21
AM: Technical Problems Simulated in Basket only measures one dimension (trait): Process Skills
Whereas Innovation, Energy and Process skills are gauged with analysis problem excercise
For basic CFA we need at least three indicators for each dimension.
However, if we have a single dimension and single exercise effect we need a minimum of five indicators
This have DAC design implications if we want to gauge the measurement effect in addition to the dimension effects
Literature review by Lievens and Conway (2001) suggest that median number of three exercises and five dimensions
AM: Option 2
ACHIEVMENT MOTIVEATION
IN_AP
IN_AP
IN_AP
IN_AP
GLOBAL METHOD EFFECT
• Still not enough degrees of freedom, need at least 5 indicators (10 possible sources of information yet must measure 12 parameters, thus -2 df
• SOLUTION: Include more exercises per dimension
Financial Perspective (FP)DIMENSION: FINANCIAL PERSPECTIVE
EXCERCISES
ANALYSIS PROBLEM
(AP)
GROUP DISCUSSION
(GD)
ONE:ONE(ONE)
SIMULATED IN
BASKET (SIB)
TRAITS
BROKER MARKET
(BM)BM_AP BM_GD BM_ONE BM_SIB
CROSS UP SELLING
(CUS)CUS_AP CUS_GD CUS_ONE CUS_SIB
PROFIT (PROF)
PROF_AP PS_SIB PS_ONE PS_SIB
Correlation Matrix
LARGE CORRELATIONS
BETWEEN EXCERCERCISES
FP:CTCE
BROKER
MARKET
ANALYSIS
PROBLEM
CUS_AP
CUS_GD
CUS_ONE
CUS_SIB
CROSS UP SELLING
BM_AP
BM_GD
BM_ONE
BM_SIB
PROF_AP
PROF_GDPROF_ONEPROF_SIB
GROUP DISCUSSION
ONE:ONE
SIMULATED IN BASKET
PROFIT
FP: CDCE Model did not converge although
enough df (78-44=34df) Singularity problems: Chiefly because of
multi-colinearity Go back to only dimension level without
exercise effects Thus only Broker Market, Cross up
Selling and Profit individually
CFA: Broker Market: FITCHI-SQUARE = 2.600 BASED ON 2 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.27247 THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS 2.428. FIT INDICES ----------- BENTLER-BONETT NORMED FIT INDEX = 0.942 BENTLER-BONETT NON-NORMED FIT INDEX = 0.954 COMPARATIVE FIT INDEX (CFI) = 0.985 BOLLEN'S (IFI) FIT INDEX = 0.986 MCDONALD'S (MFI) FIT INDEX = 0.997 JORESKOG-SORBOM'S GFI FIT INDEX = 0.988 JORESKOG-SORBOM'S AGFI FIT INDEX = 0.938 ROOT MEAN-SQUARE RESIDUAL (RMR) = 0.013 STANDARDIZED RMR = 0.035 ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.056 90% CONFIDENCE INTERVAL OF RMSEA ( 0.000, 0.217) RELIABILITY COEFFICIENTS ------------------------ CRONBACH'S ALPHA = 0.607 RELIABILITY COEFFICIENT RHO = 0.613
CFA: BM: Parameter estimates
BROKER MARKET1.0
BM_AP
BM_GD
BM_ONE
BM_SIB
0.41*
E1*0.91
0.53* E2*0.85
0.64*
E3*0.770.61*
E4*0.80
Figure X: EQS 6 broker market trait only Chi Sq.=2.60 P=0.27 CFI=0.98 RMSEA=0.06
0.41*
0.91
0.53* 0.85
0.64*
0.770.61*
0.80
• Thus BM showed good fit and parameter estimates• Broker Market Simulated in Basket was the best
predictor of Broker Market• All factor loadings were statistically significant
(p<0.05)
CFA: CUS: FIT• Problems with fit: BBNFI; IFI and reliability .• ERROR MESSAGE IN EQS DUE TO SINGULARITY OF
COVARIANCE MATRIX CHI-SQUARE = 1.456 BASED ON 2 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.48280 THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS 1.391. FIT INDICES ----------- BENTLER-BONETT NORMED FIT INDEX = 0.972 BENTLER-BONETT NON-NORMED FIT INDEX = 1.036 COMPARATIVE FIT INDEX (CFI) = 1.000 BOLLEN'S (IFI) FIT INDEX = 1.011 MCDONALD'S (MFI) FIT INDEX = 1.003 JORESKOG-SORBOM'S GFI FIT INDEX = 0.993 JORESKOG-SORBOM'S AGFI FIT INDEX = 0.964 ROOT MEAN-SQUARE RESIDUAL (RMR) = 0.010 STANDARDIZED RMR = 0.026 ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.000 90% CONFIDENCE INTERVAL OF RMSEA ( 0.000, 0.183) RELIABILITY COEFFICIENTS ------------------------ CRONBACH'S ALPHA = 0.630 RELIABILITY COEFFICIENT RHO = 0.635 MAXIMAL WEIGHTED INTERNAL CONSISTENCY RELIABILITY = 0.684
CFA: CUS: Parameter estimates
CROSS UP SELLING--1.00
CUS_AP
CUS_GD
CUS_ONE
CUS_SIB
0.27
0.36
0.36
0.28
0.390.24
0.29
0.10
Figure X: EQS 6 cross up selling trait only Chi Sq.=1.46 P=0.48 CFI=1.00 RMSEA=0.00
0.27
0.36
0.39
0.29
• Indicators did not do that well this time.• Best predictor was Group Discussion
CFA: PROFIT: FIT CHI-SQUARE = 0.634 BASED ON 2 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.72820 THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS 0.621. FIT INDICES ----------- BENTLER-BONETT NORMED FIT INDEX = 0.988 BENTLER-BONETT NON-NORMED FIT INDEX = 1.090 COMPARATIVE FIT INDEX (CFI) = 1.000 BOLLEN'S (IFI) FIT INDEX = 1.027 MCDONALD'S (MFI) FIT INDEX = 1.007 JORESKOG-SORBOM'S GFI FIT INDEX = 0.997 JORESKOG-SORBOM'S AGFI FIT INDEX = 0.984 ROOT MEAN-SQUARE RESIDUAL (RMR) = 0.007 STANDARDIZED RMR = 0.018 ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.000 90% CONFIDENCE INTERVAL OF RMSEA ( 0.000, 0.143) RELIABILITY COEFFICIENTS ------------------------ CRONBACH'S ALPHA = 0.633 RELIABILITY COEFFICIENT RHO = 0.642 MAXIMAL WEIGHTED INTERNAL CONSISTENCY RELIABILITY = 0.688 MAXIMAL RELIABILITY CAN
CFA: PROFIT: Parameter estimates
PROFIT--1.00
PROF_AP
PROF_GD
PROF_ONE
PROF_SIB
0.26
0.37
0.420.25
0.34
0.260.31
0.11
Figure X: EQS 6 profit trait only Chi Sq.=0.63 P=0.73 CFI=1.00 RMSEA=0.00
0.26
0.42
0.34
0.31
• Group discussion is once again the best predictor
CFA: Three dimensions no Exercise effects
BROKER MARKET
CROSS UP SELLING
PROFIT
BM_AP
BM_GD
BM_ONE
BM_SIB
CUS_AP
CUS_GD
CUS_ONE
CUS_SIB
PROF_AP
PROF_GD
PROF_ONE
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
E11
E12
• Model did not work, neither did single universal dimension work
Conclusion
The Broker Market sub-dimension worked individually but not the Cross up Selling or Profit sub-dimensions
For this reason we can not expect the combined CFA model to work which incorporates all three dimensions
Have to work out problems on sub-scale level first before moving on to global level
Because construct validity is lacking at the subscale level it does not make sense to look at the exercise effects
Must sort out construct validity on sub-scale level first
G-theory Generalizability theory (G-theory) extends the framework of classical test
theory in order to take into account the multiple sources of variability that can have an effect on test scores (Lynch & McNamara, 1999)
DAC the following sources of variance is often considered: Person Exercise Dimension Person*Dimension interaction (Cross situational specifity) Person* Exercise interaction (Low construct validity) Dimension* Exercise (Observability of particular dimension)
G-study is then designed to estimate the relative effects of these facets on test performance data.
Overall index of reliability (similar to Cronbach coefficient alpha) are expressed as phi I(Φ) coefficient and is also referred to generally as “an index of dependability’”
Meaning of different sources of variance in DAC Dimension effect: variance in ratings attributed to certain
dimensions, i.e. certain dimensions receiving higher/lower ratings compared to others
Person effect: general performance factor of persons Exercise effect: certain exercises overall receive higher/lower
ratings in comparisons with others Person*Dimension effect: amount of variance attributed to
person’s performance on dimension across exercises:- this is indicative of cross-situational specifity
Person*Exercise effect: amount of variance attributed to person receiving high/low rating on certain exercises regardless of dimension being measured
Dimension*Exercise effect: amount of variance attributed to specific dimension being measured in a specific exercise:- referred to as obervability of a particular dimension
Construct Validity G-study construct validity: person, dimension &
person*dimension variance must collectively > exercise, and person*exercise effect
Consider a practical DAC example with G-Theory
N=372
Nine dimensions with mostly two exercise: Simulated In-Basket Role Play
A Practical exampleDimension Exercises
SIB Role Play InterviewChange Orientation ✓ ✓Communication ✓Customer Service Orientation ✓ ✓Interpersonal Interaction ✓ ✓Planning & Organizing ✓ ✓Problem Analysis & Decision-making ✓ ✓Self-Management ✓ ✓Team Management ✓ ✓
A Practical example: Variance Components for entire DAC
A Practical example: Important note
In SPSS: For the ANOVA and MINQUE methods, negative variance component
estimates may occur. Some possible reasons for their occurrence are: (a) the specified model is not the correct model, or (b) the true value of the variance equals zero
In light of the foregoing example: Variance attributed to exercise effects (.108) > variance attributed
to person effects (.322)
This finding seems to be in-line with Lance et al’s (2004) contention that method effects are three time more than trait effects
In the current example 2,9 more variance was explained by exercise effects compared to dimension effects.
A Practical example: Variance Components for selected dimensions
However, could it be that the G-study on the entire DAC ironed out some of the robust dimension effects on the sub-dimension level?
I.e. are we throwing out the good with the bad?
To investigate the relative contribution of each dimension to the overall G-coefficient – one could conduct forward G-analysis on the individual dimension level
However, when we calculate the Φ coefficient on subscale level, there will be no variance component for dimension, dimension*exercise, dimension*person, or dimension*person*exercise effect
The biggest problem with the approach is that it will not be able to compare person*dimension variance with person*exercise variance since no person*exercise variance component is generated
However it is still possible to compare person variance with person*exercise variance
A Practical example: Variance of communication
A Practical example: Variance of Team Management
Final Verdict: G-study and DAC
Investigate dimensions individually to assess contribution of different sources
Poorly designed dimensions may inflate observed variance attributed to exercise, exercise by dimension, and exercise by person effects
The way G-studies is conducted have design implications for DAC: All vs some approach to design
IRT ANALYSIS Previously we noted:
Recently there has been two schools of thought to assess the construct validity of AC’s: Confirmatory Factor Analysis (CFA) [MTMM] Generalization Theory
Fairly new area: IRT modeling with interval data
Consider Achievement Motivation discussed earlier
IRT Approach Logistical model dictate that a respondents response to an
item should depend on two parameters only: Difficulty of endorsing the items (item location parameter) Standing of respondent on the latent trait (person location
parameter) The expectation is that persons with a higher standing on the
latent trait should have a higher probability of endorsing a particular item compared to a person with a lower standing on the same trait
This is a key requirement for DAC since the central aim is to discriminate between person who is low and high on the trait (dimension).
Deviations from these indications might suggest that the DAC exercises are not operating as expected
Rating scale The current DAC was rated on a 5-point response scale
with non-integer values (i.e. decimal values) Common wisdom: more response categories = more
reliable measure that resemble interval data. However it remains to be seen if people actually make
distinction between response categories. It is expected that thresholds between 5 response
categories will be sequentially ordered along the latent traits
We can examine the Graphed category response function to see if each of the 4 thresholds becomes the modal category at some point on the latent trait continuum
Empirical response categories for INN-AP
Empirical response categories for EN_AP
Empirical response categories for PS_AB
Empirical response categories for PS_SIB
Empirical response categories
ITEM DIFFICULTY MEASURE OF -1.13 ADDED TO MEASURES ------------------------------------------------------------------- |CATEGORY OBSERVED|OBSVD SAMPLE|INFIT OUTFIT||STRUCTURE|CATEGORY| |LABEL SCORE COUNT %|AVRGE EXPECT| MNSQ MNSQ||CALIBRATN| MEASURE| |-------------------+------------+------------++---------+--------| | 1 1 1 1| -4.95 -4.41| .29 .10|| NONE |( -9.04)| 1 | 2 2 27 54| -.60 -.43| .55 .62|| -6.82 | -3.31 | 2 | 3 3 15 30| 1.89 1.67| .72 .42|| 2.46 | 2.28 | 3 | 4 4 7 14| 3.55 3.31| .68 .60|| 4.36 |( 4.43)| 4 | 5 1 1| | || NONE | | 5 ------------------------------------------------------------------- OBSERVED AVERAGE is mean of measures in category. It is not a parameter estimate.
Response Scales What we see here is that although there is
supposed to be 5 response categories – raters effectively make use of three response categories when rating PEDR’s
Furthermore, person reliability is not very good. This indicates estimates the confidence we have
that people will be allocated to the same ranking order when exposed to the Achievement Motivation DAC again
This is similar to the person*dimension effect in G-studies
Fit Statistics SUMMARY OF 97 MEASURED (NON-EXTREME) PERSON ------------------------------------------------------------------------------- | TOTAL MODEL INFIT OUTFIT | | SCORE COUNT MEASURE ERROR MNSQ ZSTD MNSQ ZSTD | |-----------------------------------------------------------------------------| | MEAN 34.6 8.0 .66 .64 1.01 -.2 1.05 -.1 | | S.D. 6.0 .0 2.32 .16 .94 1.3 1.29 1.3 | | MAX. 48.0 8.0 5.09 1.29 5.40 3.4 8.32 4.2 | | MIN. 19.0 8.0 -8.77 .45 .10 -2.5 .08 -2.7 | |-----------------------------------------------------------------------------| | REAL RMSE .76 TRUE SD 2.19 SEPARATION 2.88 PERSON RELIABILITY .89 | |MODEL RMSE .66 TRUE SD 2.22 SEPARATION 3.39 PERSON RELIABILITY .92 | | S.E. OF PERSON MEAN = .24 | -------------------------------------------------------------------------------
Fit Statistics: PERSON AND ITEM PARAMETERS
ITEM STATISTICS: MISFIT ORDER -------------------------------------------------------------------------------------------------- |ENTRY TOTAL TOTAL MODEL| INFIT | OUTFIT |PT-MEASURE |EXACT MATCH| | |NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR. EXP.| OBS% EXP%| ITEM G | |------------------------------------+----------+----------+-----------+-----------+-------------| | 8 590 98 -.24 .17|1.48 2.7|2.45 4.3|A .61 .74| 53.6 64.1| TRANS_SIB 0 | | 6 632 98 1.63 .14|1.24 1.4|1.14 .6|C .80 .81| 62.9 62.0| TRANS_EN 0 | | 7 545 98 -1.53 .14| .95 -.3| .95 -.1|D .80 .78| 57.7 56.6| TRANS_PS 0 | | 5 545 98 -1.53 .14| .92 -.5| .83 -.6|d .81 .78| 58.8 56.6| TRANS_IN 0 || |------------------------------------+----------+----------+-----------+-----------+-------------| | MEAN 426.4 98.0 .00 .18|1.00 -.1|1.05 -.1| | 65.7 64.4| | | S.D. 154.5 .0 1.52 .03| .29 1.9| .58 2.0| | 8.2 5.4| | --------------------------------------------------------------------------------------------------
THUS from this Table we can see from the high ZSTD infit statistics that PS_SIB underestimates expected item scores
Expected Item Characteristic Curves: PS-SIB
Expected Item Characteristic Curves: EN_AP
Expected Item Characteristic Curves: EN_AP
Expected Item Characteristic Curves: PS_SIB
Validation problems of DAC’s
If the SEM approach is to preferred: Empirical Considerations
At least 5 exercises per dimension for an uni-dimensional construct and single exercise effect
If the 1DCE approach is used with multiple sub-dimensions than at least 3 exercises per sub-dimension is needed
Multiple raters for each dimension Sample size > 150 Minimum of 5-point rating scale
Validation problems of DAC’s Substantive considerations:
Theoretical underpinnings of DAC dimensions
Are we really measuring more than fluid intelligence (g) in DAC’s?
Have we considered discriminant and convergent validity outside the MTMM doctrine: Cross-validation with paper & pencil measures?
Rater calibration: Higher inter-rater agreement at the expense of restriction of range and construct validity
PEDR’s lies at the heart of the problem: What are we rating?
Competency potential
CompetenceObservableBehaviour
PEDR’s
? ??
PEDR’s lies at the heart of the problem: What are we rating? If we are proponing to measure competency potential -
would it not be better to use paper & pencil measures with more control (standardisation) and objectivity?
When designing exercises to measure AC dimensions – what is the constitutive meaning of the proposed dimensions? “Creative thinking & Entrepreneuric Energy”
Why not cross-validate AC constructs with “known constructs”?
For example: Empowering Leadership (DAC) – Transformational leadership (Bass & Avolio, 1995).
Rating calibration: Guidelines vs Rules! More variance in PEDR’s when raters are given more
discretion (i.e. guidelines not rules)
PEDR’s lies at the heart of the problem: What are we rating? Exercises: Uni-dimensionality is paramount Avoid conglomeration of constructs when designing
exercises Be adamant about micro measurement through
thoroughly designed scoring reports Attach scoring scale to each elicited behaviour Can raters list all observable behaviors without
guidance? Finally: Is DAC a new science? OR Can we apply some known psychometric truths to
DAC or are “behaviour to complex to measure”
Legislative Pitfalls !! LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL
ASSESSMENTS EEA implications:
The usage of psychometric test in South Africa are monitored and guided by the Employment Equity Act (Republic of South Africa, 1998) prohibiting the use of psychological tests unless it can be shown that the tests are valid and not biased against any employee or group (i.e., without measurement bias)
According to the paragraph In paragraph 8 of the Employment Equity Act (Republic of South Africa, 1998, p. 16) this position is reiterated and qualified by stating:
Psychological testing and other similar assessments of an employee are prohibited unless the test or assessment being used:
a) has been scientifically shown to be valid and reliable; b) can be applied fairly to all employees; c) is not biased against any employee or group.
Legislative Pitfalls !! According to the main propositions of the EEA the users of psychometric tests are behooved to provide
evidence that suggest that selection processes adheres to the act.
THUS, whenever allegations of discrimination is advanced the burden of proof shifts to the employer to demonstrate the job-relatedness of the selection procedure and that the inferences derived from the predictor scores are fair.
This interpretation is reinforced in Chapter II of the EEA under the heading “burden of proof”, paragraph 11: Whenever unfair discrimination is alleged in terms of the Act, the employer against whom the allegation is made
must establish that it is fair
Is it possible to immunize oneself from EEA legislation by claiming to use DAC for developmental vs. Selective purposes?
Ultimately, developmental DAC can still be discriminating unfairly, especially in promotional practices
In an effort to avoid legislation: Make sure to get psychometric “INTEL” on DAC
LEST WE FORGET THAT ASSESSMENT CENTRES ARE STILL PSYCHOLOGICAL ASSESSMENTS