Date post: | 22-Dec-2015 |
Category: |
Documents |
Upload: | hollie-wheeler |
View: | 214 times |
Download: | 0 times |
PART 2:RELIABILITY & VALIDITYEDRS 709Summer 2013
2
RELIABILITY AND VALIDITY
These two concepts are fundamentally important in testing and survey work
Validity: are we measuring what we set out to measure?There are several different kinds of
validity: criterion, content, construct, measurement, and even face validity are some of the major ones
Reliability: are we measuring the same thing each time we make out measurements?
20
13
W. D
avid
Sca
les
3
RELIABILITY A measure is reliable if it is free of measurement error
No instrument is completely reliable, but we can quantify the degree of reliability easily enough
A reliability coefficient quantifies the degree of reliability Reliability is an indication of the consistency or
stability of the measuring instrument We want to be able to measure the same thing the
same way each time, and we need our readings to be accurate
Example: If I weighed you three times on the same scale, the scale should return the same number each time
However, errors creep into our observations all the time
20
13
W. D
avid
Sca
les
4
RELIABILITY & ERROR Two major kinds of error: method error and trial
error Method error: problems associated with the
experimenter and the testing situation Examples: faulty equipment; researcher writes down
the wrong number or gives the wrong instruction; unwanted distractions during testing, etc.
Trial error: problems associated with the participants in a study Examples: participant is lying; participant
sick/sleepy/hung over at test time, etc.
20
13
W. D
avid
Sca
les
5
RELIABILITY & ERROR
Any score that we observe is some combination of a subject’s true level of ability and the degree of error that enters into our observations
Obviously, we want to minimize the error associated in our measurements so that the score we observe matches up to the true score as closely as possible
In practical terms, it’s impossible for an instrument to be perfectly reliable – we’re trying to infer into existence something that doesn’t exist in an empirical sense
20
13
W. D
avid
Sca
les
6
RELIABILITY
Conceptually, reliability looks like this:
You can see that the smaller the error is, the closer the reliability score gets to 1.0 For your purposes, assume that reliability exists
on a continuum between zero and 1.0, where higher scores mean the instrument is more reliable
Generally, anything over 0.8 is considered to have good reliability
scoreErrorscoreTruescoreTrue
liabilityRe
20
13
W. D
avid
Sca
les
7
MEASURING RELIABILITY: CORRELATIONS Correlation coefficients are measures of the
degree of relationship between two variables A correlation coefficient ranges between
–1.0 and +1.0 Coefficient of 0.0 = no relationship between two
variables Negative values reflect a negative relationship Positive values reflect a positive relationship The closer the coefficient is to +1.0 or -1.0, the
stronger the relationship is. Like reliability, a correlation above 0.8 is considered
a strong correlation
20
13
W. D
avid
Sca
les
8
CORRELATION
DOES NOT
MEAN
CAUSATION!
Note the subtlety!
20
13
W. D
avid
Sca
les
9
CORRELATION ≠ CAUSATION
An observed association between variables does NOT indicate that one variable causes the other—though often we would like this to be the case.
Sometimes it is difficult to tell which variable is causing the other.
Often the two variables are both related to some other underlying variable that we haven’t considered.
We can’t always control for the effects that other variables might have on our dependent variable.
More often than not, there’s no relationship at all, but we want to think that there is
20
13
W. D
avid
Sca
les
10
CORRELATION ≠ CAUSATION
In cities, there is a high correlation between the number of police officer and the murder rate—does having more police cause more murders? (perhaps the opposite is true, but even that is an over-simplification)
There a high correlation between the number of pedestrians killed by cars and the number of crosswalk in a city—do more crosswalks cause more deaths or vice versa?
Games of logical fallacies: In the summer, murder rates increase In the summer, ice cream sales increase Therefore, ice cream causes people to murder.
20
13
W. D
avid
Sca
les
11
TYPES OF RELIABILITY
Several ways to establish reliability:Test/retest reliability: give the test to the same
subjects twice, then correlate the resultsParallel-forms reliability or alternate-forms
reliability: a correlation of scores from similar tests with similar variances
Split-half reliability: split a test in two (odd/even or first-half/second half) and correlate the halves
Interrater reliability: a measure of consistency of agreement between two or more judges or raters
These are all measures of internal consistency
20
13
W. D
avid
Sca
les
12
RELIABILITY
The concept of true score can be a little slippery A subject’s true score is an abstract
representation of his/her latent ability on some construct
With any test, assessment or survey, we’re attempting to make manifest something that is inherently intangible or latent
Since these knowledges, skills and/or abilities (KSA’s) don’t exist tangibly in the real world, by definition the true score of any subject as a reflection of these KSA’s is also an abstraction
An observed score, therefore, is an inference of a true score (and, by extension, the underlying KSA)
20
13
W. D
avid
Sca
les
13
RELIABILITY
Reliability is calculated using one of many special forms of a correlation coefficient, and interpreted in much the same way
Reliability coefficients above .80 are considered good; coefficients below .4 are considered poor
is the generalized form of the reliability coefficient that measures the correlation between parallel tests, X and X’ (“X-prime”)
Just like any correlation coefficient, reliability presumes that you are working with the population instead of a sample, and this is reflected in all the calculations
20
13
W. D
avid
Sca
les
14
RELIABILITY
Also, just like with any correlation coefficient, squaring it will give you the proportion of variance in X accounted for by X’ – it’s very similar to any other measure of effect size
Reliability is also a measure of the ratio of true-score variance to observed-score variance:
The less error there is in your measurement, the more reliable the instrument is
2
2
1.0 0.0
0.0 1.0
XX Error
XX Error
as
as
20
13
W. D
avid
Sca
les
15
IMPLICATIONS OF =1.0
This means that all measurements are made without any errors whatsoever
The observed score equals the true score (X = T) for all examinees
All observed-score variance solely reflects true-score variance:
The correlation between true scores and observed scores is
The correlation between observed scores and errors is
2 2X T
Adapted from Luecht (2004) 20
13
W. D
avid
Sca
les
16
IMPLICATIONS OF =0.0
This means that all measurements are nothing but random error
The observed score is nothing but random error (X = E) for all examinees
All observed-score variance solely reflects error variance:
The correlation between true scores and observed scores is
The correlation between observed scores and errors is
2 2X E
20
13
W. D
avid
Sca
les
17
IMPLICATIONS OF 1.0
The observed score (and the measures used to obtain the score, or both) contain some error
X = T + E for all examinees The variance of the observed scores contains
true-score variance and error variance:
The correlation between true scores and observed scores is
The correlation between observed scores and error scores is
2 2 2X T E
XT XX
1XE XX
20
13
W. D
avid
Sca
les
18
IMPLICATIONS OF 1.0
Differences in observed scores can reflect differences in true scores (and therefore the underlying KSA’s) or random errors of measurement, or a combination of both
Reliability is therefore the proportion of observed score variance that is due to true-score variance:
The higher the reliability coefficient is, the more confidently we can estimate true scores (and therefore, the underlying KSA) from observed scores
2
2T
XXX
20
13
W. D
avid
Sca
les
19
ESTIMATING RELIABILITY
There are several different methods for calculating reliability, and each method is used for a specific reason
Reliability coefficients are all forms of correlations!
Some of the earliest work in reliability stems from the late 19th century, and reliability has always been closely linked to validity
Thorndike’s (1918) groundbreaking work concerning criterion validity -- he defined validity as how well a measure correlates with another similar measure (we’ll look at this later…)
20
13
W. D
avid
Sca
les
20
ESTIMATING RELIABILITY
By implication, Thorndike (1918) said that reliability and validity were inherently linked – in his conception, validity and reliability had a direct, almost linear relationship – high correlations between an instrument and its criterion meant that the instrument was mathematically reliable and conceptually valid
Even though his conclusion concerning the relationship is inherently flawed, Thorndike laid out the conceptual power using correlations as measures of reliability, and we use the same ideas today
20
13
W. D
avid
Sca
les
21
TEST / RETEST RELIABILITY
Test-retest reliability implies that the same examinees are tested twice using the same test
The reliability coefficient is simply the Pearson correlation between the two sets of test scores
This is often used with personality tests or behavioral assessments This is not used with cognitive assessments – more
in a minute… Example: two sets of scores on a personality
inventory, in which scores range from 0 to100
20
13
W. D
avid
Sca
les
22
TEST / RETEST RELIABILITYCase A B zA zB zAzB
1 72 87 0.755123 0.717314 0.5416612 58 79 -0.49351 -0.22859 0.1128133 70 81 0.576748 0.007883 0.0045464 46 67 -1.56376 -1.64746 2.5762285 68 83 0.398372 0.24436 0.0973466 74 86 0.933499 0.599076 0.5592367 45 75 -1.65295 -0.70155 1.1596248 56 73 -0.67188 -0.93803 0.6302429 88 97 2.182128 1.8997 4.14538910 68 86 0.398372 0.599076 0.23865511 68 81 0.398372 0.007883 0.0031412 62 92 -0.13675 1.308507 -0.1789413 48 65 -1.38538 -1.88393 2.60997314 62 86 -0.13675 0.599076 -0.0819315 68 76 0.398372 -0.58331 -0.23237
Mean 63.53333 80.93333 12.18561Std.Dev. 11.21229 8.457475 0.812374 rAB
12.185
0.81
6
2
1
15
A BAB
z zr
n
20
13
W. D
avid
Sca
les
23
TEST / RETEST RELIABILITY
This very simple method to calculate reliability is nothing more than Thorndike’s (1918) work
However, there are some problems with using this method
Carry-over effects: an examinee’s attitude or general performance could be dependent on their prior trial performance
Practice effects: being exposed to the same information more than once may influence how they respond to items
Time between testing: if the time period is too short, examinees may remember their responses; if it’s too long, maturation may occur
20
13
W. D
avid
Sca
les
24
PARALLEL-FORMS RELIABILITY
Alternate forms can be pre-constructed to be parallel meaning they have similar true-score variance and equivalent error variances However, this requires a great deal of assumption
on the part of the test maker, a deep level of conceptual understanding of the latent constructs of interest, and a large amount of empirical verification
Like before, we give the two tests C and D, correlate the results, and that gives us a measure of reliability
Problem: if either C or D or both are unreliable on their own, reliability will be confounded with parallelism
20
13
W. D
avid
Sca
les
25
PARALLEL-FORMS RELIABILITY
Like with test / retest reliability, parallel-forms reliability is simply the Pearson correlation between the two forms C and D
Alternate forms are usually assumed to be parallel, and later tested for parallelism using the means and variances of the observed scores, and correlations with other tests
If the true scores are not equal , it’s almost impossible to make a reasonable case for establishing the reliability of either form, much less being able to say that one is a reliable proxy for the other
20
13
W. D
avid
Sca
les
26
RELIABILITY & TEST LENGTH
In general, adding more items to a test increases reliability
The Spearman-Brown formula is used to evaluate the likely increase in reliability due to increasing the length of a test or survey
The same formula is also used to adjust the correlation obtained from a split-half reliability analysis, which we’ll see later
20
13
W. D
avid
Sca
les
27
RELIABILITY & TEST LENGTH
Assume that Y is a test of length k (or, test Y has k items) and that is the reliability coefficient for Test Y
What would happen if we increased the length of Test Y from k to m items?
The Spearman-Brown formula (sometimes called the Spearman-Brown prophecy formula) will compute the new reliability:
1 1
YY
XX
YY
mk
mk
20
13
W. D
avid
Sca
les
28
RELIABILITY & TEST LENGTH
Suppose we have a 10-item test with . What will happen to the reliability if we double the test length to 20 items?
, k = 10 and m = 20
Doubling the test length will raise reliability by about 18%
1 1
20 .70 1.4101.7201 1 .7010
0.824
YY
XX
YY
mk
mk
20
13
W. D
avid
Sca
les
29
RELIABILITY & TEST LENGTH
If you start with an instrument that is unreliable, you’ll have to add an awful lot of items just to get mediocre reliability, much less good reliability
However, if you start with good reliability, you don’t need to do much work adding items to improve the reliability
There is a tradeoff: raising reliability from .65 to .75 is a lot easier than raising reliability from .90 to .95
Going from .90 to .95 may require tripling or quadrupling the test length – that’s a lot of effort for a very small payoff (the Law of Diminishing Returns)
20
13
W. D
avid
Sca
les
30
RELIABILITY & TEST LENGTH
20
13
W. D
avid
Sca
les
31
RELIABILITY & TEST LENGTH
20
13
W. D
avid
Sca
les
32
RELIABILITY & TEST LENGTH
20
13
W. D
avid
Sca
les
33
20
13
W. D
avid
Sca
les
34
INTERNAL CONSISTENCY
Internal consistency is determined by splitting the instrument in half in some way, then correlating the halves together
First-half / second-half split: an instrument is split right down the middle (e.g., the first 20 items are correlated against the second 20 items)
Problem: many tests are designed so that the simplest items are given first, and the test increases in difficulty as one moves through the items – this method only works if the items are of equal difficulty or are randomized in presentation
This method also will not work for any computer-based tests that uses a right/wrong algorithm to determine which item the subject will see next
20
13
W. D
avid
Sca
les
35
INTERNAL CONSISTENCY
Even-odd split: an in instrument is split into the even-numbered and odd-numbered items
This method works substantially better than comparing the first & second half of a test – item ordering effects can be negated by more closely assuring that items of different difficulties are represented in both halves
Reliability for both first-half / second half and even-odd splits are calculated in the same way – there’s a simplified form of the Spearman-Brown formula to determine internal consistency
20
13
W. D
avid
Sca
les
36
INTERNAL CONSISTENCY
To determine internal consistency, a single form of a test is given to all examinees
The items are scored dichotomously as right or wrong (0, 1)]
Once the items are scored, they are either: Partitioned into two groups of the same number of
items; each group of items is summed to form a subtest total score; these total scores are correlated, and the correlation is then corrected for disattenuation (Spearman-Brown), or
Used to directly compute item-level variances that are then exploited to derive a general estimate of reliability (Cronbach’s alpha, KR-20, KR-21)
20
13
W. D
avid
Sca
les
37
PROBLEMS WITH SPLIT-HALF METHODS
Any test in which the items are not truly random in order is going to interfere with reliability Many tests present items that increase in difficulty;
the first-half / second-half split should never be used with these items
Local independence: a response to an item is not dependent on any item that has come before, and has no bearing on the response to any other item that follows Some tests present a scenario followed by a series
of items based on that scenario, and often the response to one item influences the response to a following item
20
13
W. D
avid
Sca
les
38
PROBLEMS WITH SPLIT-HALF METHODS
Nuisance dimensions: any outside factor or KSA that produces systematic changes over the course of the test (e.g., speeded tests, pacing skills, reading speed, language comprehension, dexterity, induced rapid guessing)
Technically speaking, split-half methods only work for parallel halves Unequal means are not a problem, in and of
themselves, as long as the halves have at least approximately equal true-score means
However, unequal variances of the halves can lead to poor and/or inappropriate reliability estimates
20
13
W. D
avid
Sca
les
39
CRONBACH’S ALPHA
Developed by Cronbach (1951) as the first in an intended series of coefficients
Cronbach’s α provides the lower bound of reliability among all possible split halves, assuming that the halves are parallel – the value of Cronbach’s α is the most conservative estimate of reliability
Cronbach’s α is the most widely-used reliability estimate
Cronbach’s α is the only one that works for dichotomous and polytomous items
20
13
W. D
avid
Sca
les
40
CRONBACH’S ALPHA
k is the total number of test items σ²i is the item variance, which equals σ²i =
piqi for dichotomous items -- the variances of all the items are summed together
σ²x is the variance of the sum scores, computed as
iXX
X
σkρ
k σ
2
211
α
iu 0,1
X
xσ
n
22
20
13
W. D
avid
Sca
les
41
ITEM RESPONSES
Subject i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 Total1 0 0 0 1 0 0 0 0 0 0 12 1 1 1 0 1 1 0 1 0 1 73 1 0 0 0 1 0 0 0 0 0 24 1 1 1 1 1 1 0 1 1 1 95 1 1 1 1 1 1 1 0 1 1 96 0 0 1 0 0 0 0 0 0 0 17 1 1 1 0 0 1 0 1 0 0 58 1 0 0 0 1 0 0 1 0 1 49 0 1 1 1 1 0 0 0 1 1 6
10 1 1 1 1 1 1 1 1 1 1 1011 1 1 1 1 1 1 0 0 0 1 712 1 1 0 1 1 1 0 0 1 1 713 0 0 1 1 1 0 0 0 0 0 314 1 1 1 1 1 1 1 1 1 1 1015 1 1 1 1 1 1 0 1 1 1 916 1 0 1 0 0 0 0 0 1 0 317 1 1 1 0 1 1 1 1 0 1 818 1 1 1 1 0 1 0 0 1 1 719 0 1 1 1 0 1 0 0 0 1 520 1 0 0 1 0 0 1 0 0 0 3
Mean 0.8 0.7 0.8 0.7 0.7 0.6 0.3 0.4 0.5 0.7 5.8Varianc
e 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 8.26
20
13
W. D
avid
Sca
les
42
ITEM RESPONSES
Subject i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 Total
… … … … … … … … … … … …
Mean 0.75 0.65 0.75 0.65 0.65 0.6 0.25 0.4 0.45 0.65 5.8Varianc
e0.18
8 0.228 0.188 0.228 0.228 0.24 0.187 0.240.247
5 0.228 8.26
iσ2 0.188 0.228 0.28
2.20
Xσ2 8.26
20
13
W. D
avid
Sca
les
43
CRONBACH’S ALPHA
= 8.26, so:
iXX
X
σkρ
k σ
2
211
10 2.21
9 8.26
=0.815
α
20
13
W. D
avid
Sca
les
44
KUDER-RICHARDSON FORMULA 20 (KR-20) KR-20 is used for dichotomous items,
k is the total number of test items pi is the item-level proportion-correct score
qi = 1 – pi is the item-level proportion-incorrect score
is the variance of the sum scores, computed as
i iXX
X
pqkρ
k σ211
α
X
xσ
n
22
FOR DICHOTOMOUS ITEMS: KR-20 = CRONBACH’S α!
20
13
W. D
avid
Sca
les
45
KUDER-RICHARDSON FORMULA 21 (KR-21)
KR-21 is also used for dichotomous items, However, KR-21 is used when you can assume
that pi does not vary extensively across items
k is the total number of test items pi is the item-level proportion-correct score
qi = 1 – pi is the item-level proportion-incorrect score
is the variance of the sum scores, computed as
XX
X
kpqkρ
k σ211
α
X
xσ
n
22
20
13
W. D
avid
Sca
les
46
INTERRATER RELIABILITY
Kappa (κ), often called Cohen’s κ, is a form of χ² that uses contingency tables to measure interrater agreement and is used as a measure of reliability (Cohen, 1960)
Cohen’s κ is used to measure the reliability of judges or raters – we want judges to rate what they see in the same way, using the same rubric or understanding of the rating criteria
Just like any piece of machinery (or assessment, survey or diagnostic tool used in psychology), raters need to be calibrated (or trained) in order to yield the most accurate measurements possible
20
13
W. D
avid
Sca
les
47
INTERRATER RELIABILITY
Like most effect sizes, there aren’t hard-and-fast rules for interpreting κ, but there are some commonly-understood guidelines (Landis & Koch, 1977):
κ Interpretation< 0.1 Poor agreement
0.1 – 0.20 Slight agreement0.21 – 0.40 Fair agreement0.41 – 0.60 Moderate agreement
0.61 – 0.80 Substantial agreement
0.81 – 1.00Almost perfect
agreement
20
13
W. D
avid
Sca
les
48
INTRACLASS CORRELATION
An intraclass correlation is another measure of interrater reliability
It’s actually highly similar to Cohen’s κ Since Cohen’s κ is based on a χ² test of
independence, it’s used when you have two judges
An intraclass correlation is used when you have three or more judges and you want to examine their level of agreement
20
13
W. D
avid
Sca
les
49
INTRACLASS CORRELATION
One could in theory correlate the ratings of all possible pairwise combinations of judges then average the results, but this will ignore differences between judges Example: if one judge always rates five points
higher than another judge, their correlation will be 1.00, but the judges are saying different things about the subjects
An intraclass correlation will allow you to take individual differences of judges into account
20
13
W. D
avid
Sca
les
50
VALIDITY Validity’s big question: are we measuring what we
set out to measure? There are several major kinds of validity we need
to take into account: Criterion validity Content validity Construct validity Consequential validity
There are also several other kinds of validity that must be considered when looking at test results or experimental results Internal and external validity
20
13
W. D
avid
Sca
les
51
VALIDITY
On top of that, there are other kinds of validity that apply only to qualitative studies or action research, and others still that apply to evaluation projects
It doesn’t help clarify things much when we realize that even if a test is a valid measure of a latent knowledge, skill or ability, often the validity of the inference made from that score is invalid
We make inferences about what we can’t see based on what we can see, but if a student’s performance isn’t adequately captured by the test, we can arrive at the wrong conclusion… and often do
20
13
W. D
avid
Sca
les
52
VALIDITY
Popham (2010) states on p.19 that “[t]here is no such thing as a valid test”, saying that the tests themselves are neither valid or invalid
In essence, he says that the inferences we make from the tests are what are either valid or invalid
This is only part of the story – a test can most certainly be valid or invalid when one looks at the fundamental definition of validity: “does a test measure what it sets out to measure?” Psychometrically speaking, we can easily determine
the degree to which a test is valid, and help define the parameters for whom the test should be taken by in order for it to produce valid results and therefore lead to valid interpretations
20
13
W. D
avid
Sca
les
53
CRITERION VALIDITY
First defined by Thorndike (1918), who defined validity as how well a measure correlates with another similar measure
Tests were judged valid based on their diagnostic value, defined in terms of their correlations with other measuresConcurrent validity: how well the measure
correlates with another given at the same time
Predictive validity: how well the measure correlates with another given at some point in the future
20
13
W. D
avid
Sca
les
54
CRITERION VALIDITY Advantage:
Clearly relevant to the plausibility of proposed interpretations; can be very objective
Disadvantages: Obtaining a good criterion often is almost
impossible Sometimes the criterion is clearly superior to the
measure; why not just use the better one rather than make a new measure?
What if the external criterion is not valid? Of all the forms of validity evidence, criterion
validity is used the least – it conflates validity with reliability, and can easily lead to the erroneous conclusion that reliability and validity are the same thing… and they’re NOT.
20
13
W. D
avid
Sca
les
55
CONTENT VALIDITY Formally introduced by Cureton (1951) A test is valid to the extent that it covers the
relevant content, and whether that coverage is at the right level of cognitive complexity All content domains must be specified
completely and to the right degree of accuracy Additionally, all items on the instrument are
written at the proper level for the intended audience; they’re not too tough or too easy
A sample of responses in some area is thought to be representative of someone’s overall ability throughout that area
Modern computer-adaptive testing and automated testing assembly has its roots here
20
13
W. D
avid
Sca
les
56
CONTENT VALIDITY Advantages:
No external criterion needed If the content domain has been specified
accurately, and the test is constructed to be representative of the domain, validity easy to establish
Disadvantages:Dependent on subjective opinion as to the
worth of the measure and/or the relevance and representativeness of the items to the content domain
Confirmatory bias: if the researcher is also the author of the instrument, there is a tendency to find evidence in one’s favor (Kane, 2006)
20
13
W. D
avid
Sca
les
57
CONSTRUCT VALIDITY Defined in a major work by Cronbach & Meehl
(1955) Construct validity is to be used when the
instrument measures something that cannot be observed directly, has no content-related operational definitions, and has no adequate criterion measure
A researcher develops a sort of map of what s/he believes the relationships of the underlying constructs to be measured are, and the validity of the instrument is defined by how well it mirrors this conceptual map (or nomological network)
A construct is defined by its role in the nomological network
20
13
W. D
avid
Sca
les
58
CONSTRUCT VALIDITY If what the researcher predicts should
happen doesn’t match up with what is seen, then at least one of three things has happened:The theory is incorrectThe measurements aren’t adequate to measure
the constructsThe relationships between the underlying
constructs have been misspecified Modern cognitive psychology owes much
of existence to these ideas; it gave researchers a way to try to quantify psychological phenomena that was ignored by the behaviorists
20
13
W. D
avid
Sca
les
59
CONSTRUCT VALIDITY Advantages:
Validity theory & practice moved from simple correlations and prediction to explanation (establishing causality)
Moved focus on test scores from being summative numbers and end results to formative or process-related indications – the test no longer drives anything; it reflects what’s going on underneath
Relevance, usefulness and importance of these scores cannot be evaluated without a deeper understanding of the processes or constructs that drive the performance on the measure
Best of all: it includes all other types of validity evidence as well as reliability and methodological evidence
20
13
W. D
avid
Sca
les
60
CONSTRUCT VALIDITY
Disadvantages:Defining the constructs and establishing
what is and isn’t valid is extremely difficult to do and to explain to someone without a similar level of knowledge of the constructs
There are no specific rules for the process of establishing validity; should validity simply be in the eye of the beholder?
Heavily driven by theory; following what few rules there are can make a researcher’s job intractable – some validation studies take longer than the original study, without any clear answers at the end
20
13
W. D
avid
Sca
les
61
CONSEQUENTIAL VALIDITY
Consequential validity: what are the consequences of taking a particular test and making a particular score? (Messick, 1989)
Validity is defined as an integrated evaluative judgment of the extent that theory and empirical evidence “support the adequacy and appropriateness of inferences and actions based on test scores” (Messick, 1989; italics in original)
This draws questions of ethics into research in psychology and education
20
13
W. D
avid
Sca
les
62
CONSEQUENTIAL VALIDITY
Advantages: Researchers should specify in advance what the
criteria for validity is; no more guesswork Focuses on the uses (and possible misuses) of
test scores, putting responsibility on the researcher to interpret and use these scores correctly
Requires evidence of validity from numerous different sources
Validity as a property of an instrument turned into validation as a continuing process
20
13
W. D
avid
Sca
les
63
CONSEQUENTIAL VALIDITY Disadvantages:
There are no specific rules for conducting a validity study
Heavily dependent on theory; less dependent on practice
Treats validity as a toolbox, in which one can pick and choose whatever method they want to use
Establishing the validity of an instrument becomes almost immaterial; focus is on what you do with the scores after you get them
In this system, the process of validation never ends – validity must always be refined and re-addressed
20
13
W. D
avid
Sca
les
64
FACE VALIDITY
Does an instrument look good on the surface? Does it look professional?
Many studies are often killed off early on because the tests/surveys used are full of typos, or are poorly designed or laid out
Often, the language used is an indicator of face validity: Is the survey dumbed down or too advanced for
the intended audience? Is the font used easily readable? Are the instructions clear? Is there a logical flow to the instrument?
20
13
W. D
avid
Sca
les
65
VALIDITY: FURTHER ISSUES Some call for theory and practice to be
separated to make the researcher’s job easier, using more advanced psychometric techniques to more accurately define constructs
Kane (2006) suggested an argument-based approach to provide evidence for validity, much like attorneys in a court case
Borsboom et al. (2004) argue for a simple causality-based approachA test is valid only if (a) the attribute exists
and (b) changes in the attribute cause changes in how someone responds to the test
20
13
W. D
avid
Sca
les
66
RELIABILITY & VALIDITY
The relationship between reliability and validity can be easily shown using The Bull’s-Eye Example
In short, an instrument can be: Reliable and valid Reliable and not valid Not reliable and not valid
You cannot have an instrument that is valid but not reliable!
20
13
W. D
avid
Sca
les
67
VALIDITY AND ACTION RESEARCH There are types of validity unique to action
research: Outcome validity: the extent to which actions lead to
the resolution of the problem Process validity: the adequacy of the processes used
in the different phases of the research Democratic validity: the extent to which the action
research is done in collaboration with all parties who have a stake in the problem
Catalytic validity: the extent to which an action research project focuses and energizes participants, opening them to a transformed view of reality in relation to their practice (the “oh, wow!” factor)
Dialogic validity: the use of extensive dialogue with peers in the formation of and review of the researcher’s findings and interpretations
20
13
W. D
avid
Sca
les
68
INTERNAL VALIDITY In short, a good researcher needs to
maximize the level of internal validity of the study
Internal validity: the extent to which the results of a study can be attributed to the manipulation of the independent variable (IV), rather than to some confounding variable A study with good internal validity has no (or
only a few minor) confounds and accurately offers only one explanation of the results
There are eleven major threats to internal validity, as well as some minor ones…
20
13
W. D
avid
Sca
les
69
THREATS TO INTERNAL VALIDITY
1. Nonequivalent control group: at the outset, the control and experimental groups must be as similar as possible – if the control group differs from the experimental group significantly (especially on the construct that you want to measure), it will be impossible to make any reasonable conclusions
2. History effect: an event outside the scope of the experiment can affect performance on the construct being measured by the dependent variable (DV)
Example: a two-month stress-reduction program in which the posttest is given during finals week – what do you think might go wrong?
20
13
W. D
avid
Sca
les
70
THREATS TO INTERNAL VALIDITY
3. Maturation effect: often, naturally-occurring changes within the participants can be responsible for the observed results
4. Testing effect (also known as practice effect): repeated testing may lead to subjects getting more familiar with the test (or the way the experimenter gives tests), rather than growth on the construct of interest due to the manipulation of the IV – the subjects are merely better at taking tests
Fatigue effect: performance decreases on repeated tests because the subjects tire of taking them
20
13
W. D
avid
Sca
les
71
THREATS TO INTERNAL VALIDITY
5. Regression to the mean: subjects with extreme scores, either high or low, will tend to score closer to the mean upon retesting
Often, subjects are selected for a study based on a score on some measure, but that score may not truly indicate their ability
Maybe they did well simply because they were lucky, or they did poorly because they hadn’t slept
Example: The Sports Illustrated hex
20
13
W. D
avid
Sca
les
72
THREATS TO INTERNAL VALIDITY
6. Instrumentation effect: Changes in the DV may be related to changes in the measuring device – this can be a major problem when the instrument is a person, like an observer – not all observers are trained equally
7. Mortality / Attrition: Most studies are effected by dropout rates – people leave studies for lots of reasons
Usually the dropout rates across the different groups are the same
If one group has a higher dropout rate, this could lead to inequalities across groups
Why might this be a problem?
20
13
W. D
avid
Sca
les
73
THREATS TO INTERNAL VALIDITY
8. Diffusion of treatment: Observed changes in behaviors or responses may be due to information received from other subjects in the study – someone tipped them off, and this affected their behavior (a version of the Hawthorne effect)
Example: if you knew there was a police roadblock ahead, wouldn’t you slow down?
This happens a lot in psychology departments that require participation in research – students regularly tell each other which experiments are the easy ones
20
13
W. D
avid
Sca
les
74
THREATS TO INTERNAL VALIDITY
9. Experimenter effect: when the experimenter consciously or unconsciously affects the study – maybe they smile more when the subjects do as they’re expected
Single-blind experiment: either the participants or the researcher doesn’t know which condition the subjects are in
Double-blind experiment: both the researcher and the participants don’t know which condition they are in
20
13
W. D
avid
Sca
les
75
THREATS TO INTERNAL VALIDITY
10. Participant effect: sometimes, subjects bias the experiment based on their own expectations – maybe they get nervous because they know someone’s watching them
Often, subjects try to be “good participants”, trying to determine what the researcher wants to see and adjusting their behavior accordingly
Placebo effect: a special kind of participant effect in which the subjects believe they’re getting the treatment when they’re not; they believe they’re getting better not because of the treatment, but because of their expectation that the treatment is having an effect
20
13
W. D
avid
Sca
les
76
THREATS TO INTERNAL VALIDITY
11. Floor & ceiling effects: any DV must be sensitive enough to detect differences between groups – If it’s not sensitive enough, real differences will be missed
Floor effect: A limitation of the measuring instrument that decreases the capability of detecting differences in scores at the bottom end of the scale – e.g., measuring rats in pounds
Ceiling effect: A limitation of the measuring instrument that decreases the capability of detecting differences in scores at the bottom end of the scale – e.g., measuring elephants on a bathroom scale
A pretest will usually tell you if your measure is sensitive enough
20
13
W. D
avid
Sca
les
77
EXTERNAL VALIDITY
External validity is the extent to which the results of an experiment can be generalized to a larger population, beyond the participants in the study and the laboratory in which the experiment was conducted
The biggest loss of external validity comes from the fact that experiments using human participants often use small samples obtained from a single geographic location or with idiosyncratic features (e.g., volunteers).
Because of this, one cannot be sure that the conclusions drawn about causal relationships actually apply to people in other geographic locations or without these features
20
13
W. D
avid
Sca
les
78
THREATS TO EXTERNAL VALIDITY
Generalization to populations: since most psychological research is done using undergraduates, it can be difficult to generalize results to a larger population –
The college sophomore problem: most conclusions in psychological studies are based on studying young people with a late-adolescent mentality who are still maturing In education, many of our results may apply only
to a specific school or to a small subset of students with a particular educational characteristic not found in the general population (e.g., students with learning disabilities)
20
13
W. D
avid
Sca
les
79
THREATS TO EXTERNAL VALIDITY Generalization from laboratory settings: Lab settings
allow researchers to maximize control, but sometimes at the cost of making the environment too artificial
Of course, we don’t want to give up control any more than we have to, because doing so will hurt our internal validity
The best way around this is through exact replication, which is repeating a study elsewhere using the exact same IVs and DVs as the original study
Conceptual replication repeats a study with a different IV, or a different DV
Systematic replication repeats a study several times, each time changing only one feature of the original study
20
13
W. D
avid
Sca
les
80
THREATS TO EXTERNAL VALIDITY
Aptitude-treatment interaction: The sample may have certain features that may interact with the independent variable – e.g., will a treatment that works on severely depressed individuals work on those with minor depression?
Reactivity: when causal relationships are found, they might not be generalizable if the effects found only occurred as an effect of studying the situation (e.g., Hawthorne effect, placebo effect)
20
13
W. D
avid
Sca
les
81
THREATS TO EXTERNAL VALIDITY
Situation effects: All situational specifics (e.g. treatment conditions, time, location, lighting, noise, treatment administration, investigator, timing, scope and extent of measurement, etc. etc.) of a study potentially limit generalizability
Pretest or posttest effects: when causal relationships can only be found when pretests or posttests are carried out
20
13
W. D
avid
Sca
les
82
REFERENCES Borsboom, D., Mellenbergh, G.J., & van Heerden, D.
(2004). The concept of validity. Psychological Review, 111(4), 1061 – 1071 .
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37 – 46.
Cureton, E.E. (1951). Validity. In E.F. Lindquist (Ed.), Educational measurement (621 – 694). Washington, DC: American Council on Education.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297 –334.
Cronbach., L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.
20
13
W. D
avid
Sca
les
83
REFERENCES Kane, M.T. (2006). Validation. In R.L. Brennan (Ed.),
Educational measurement (4th ed.). New York, NY: American Council on Education and Praeger Publishers .
Messick, S. (1989). Validity. In Linn, R.L. (Ed.), Educational measurement (3rd ed.). New York, NY: MacMillan.
Luecht, R.M. (2004). Reliability. Presentation to ERM 667: Foundations of Educational Measurement, The University of North Carolina at Greensboro. Greensboro, NC: Author.
Popham, J.W. (2010). Everything school leaders need to know about assessment. Thousand Oaks, CA: Corwin.
Thorndike, E.L. (1918). The nature, purposes and general methods of measurements of educational products. National Society for the Study of Educational Products: Seventeenth Yearbook, 16 – 24. Chicago, IL: University of Chicago Press.
20
13
W. D
avid
Sca
les