P ART 2: R ELIABILITY & V ALIDITY EDRS 709 Summer 2013.

PART 2:RELIABILITY & VALIDITYEDRS 709Summer 2013

2

RELIABILITY AND VALIDITY

These two concepts are fundamentally important in testing and survey work

Validity: are we measuring what we set out to measure?There are several different kinds of

validity: criterion, content, construct, measurement, and even face validity are some of the major ones

Reliability: are we measuring the same thing each time we make out measurements?

20

13

W. D

avid

Sca

les

3

RELIABILITY A measure is reliable if it is free of measurement error

No instrument is completely reliable, but we can quantify the degree of reliability easily enough

A reliability coefficient quantifies the degree of reliability Reliability is an indication of the consistency or

stability of the measuring instrument We want to be able to measure the same thing the

same way each time, and we need our readings to be accurate

Example: If I weighed you three times on the same scale, the scale should return the same number each time

However, errors creep into our observations all the time

20

13

W. D

avid

Sca

les

4

RELIABILITY & ERROR Two major kinds of error: method error and trial

error Method error: problems associated with the

experimenter and the testing situation Examples: faulty equipment; researcher writes down

the wrong number or gives the wrong instruction; unwanted distractions during testing, etc.

Trial error: problems associated with the participants in a study Examples: participant is lying; participant

sick/sleepy/hung over at test time, etc.

20

13

W. D

avid

Sca

les

5

RELIABILITY & ERROR

Any score that we observe is some combination of a subject’s true level of ability and the degree of error that enters into our observations

Obviously, we want to minimize the error associated in our measurements so that the score we observe matches up to the true score as closely as possible

In practical terms, it’s impossible for an instrument to be perfectly reliable – we’re trying to infer into existence something that doesn’t exist in an empirical sense

20

13

W. D

avid

Sca

les

6

RELIABILITY

Conceptually, reliability looks like this:

You can see that the smaller the error is, the closer the reliability score gets to 1.0 For your purposes, assume that reliability exists

on a continuum between zero and 1.0, where higher scores mean the instrument is more reliable

Generally, anything over 0.8 is considered to have good reliability

scoreErrorscoreTruescoreTrue

liabilityRe

20

13

W. D

avid

Sca

les

7

MEASURING RELIABILITY: CORRELATIONS Correlation coefficients are measures of the

degree of relationship between two variables A correlation coefficient ranges between

–1.0 and +1.0 Coefficient of 0.0 = no relationship between two

variables Negative values reflect a negative relationship Positive values reflect a positive relationship The closer the coefficient is to +1.0 or -1.0, the

stronger the relationship is. Like reliability, a correlation above 0.8 is considered

a strong correlation

20

13

W. D

avid

Sca

les

8

CORRELATION

DOES NOT

MEAN

CAUSATION!

Note the subtlety!

20

13

W. D

avid

Sca

les

9

CORRELATION ≠ CAUSATION

An observed association between variables does NOT indicate that one variable causes the other—though often we would like this to be the case.

Sometimes it is difficult to tell which variable is causing the other.

Often the two variables are both related to some other underlying variable that we haven’t considered.

We can’t always control for the effects that other variables might have on our dependent variable.

More often than not, there’s no relationship at all, but we want to think that there is

20

13

W. D

avid

Sca

les

10

CORRELATION ≠ CAUSATION

In cities, there is a high correlation between the number of police officer and the murder rate—does having more police cause more murders? (perhaps the opposite is true, but even that is an over-simplification)

There a high correlation between the number of pedestrians killed by cars and the number of crosswalk in a city—do more crosswalks cause more deaths or vice versa?

Games of logical fallacies: In the summer, murder rates increase In the summer, ice cream sales increase Therefore, ice cream causes people to murder.

20

13

W. D

avid

Sca

les

11

TYPES OF RELIABILITY

Several ways to establish reliability:Test/retest reliability: give the test to the same

subjects twice, then correlate the resultsParallel-forms reliability or alternate-forms

reliability: a correlation of scores from similar tests with similar variances

Split-half reliability: split a test in two (odd/even or first-half/second half) and correlate the halves

Interrater reliability: a measure of consistency of agreement between two or more judges or raters

These are all measures of internal consistency

20

13

W. D

avid

Sca

les

12

RELIABILITY

The concept of true score can be a little slippery A subject’s true score is an abstract

representation of his/her latent ability on some construct

With any test, assessment or survey, we’re attempting to make manifest something that is inherently intangible or latent

Since these knowledges, skills and/or abilities (KSA’s) don’t exist tangibly in the real world, by definition the true score of any subject as a reflection of these KSA’s is also an abstraction

An observed score, therefore, is an inference of a true score (and, by extension, the underlying KSA)

20

13

W. D

avid

Sca

les

13

RELIABILITY

Reliability is calculated using one of many special forms of a correlation coefficient, and interpreted in much the same way

Reliability coefficients above .80 are considered good; coefficients below .4 are considered poor

is the generalized form of the reliability coefficient that measures the correlation between parallel tests, X and X’ (“X-prime”)

Just like any correlation coefficient, reliability presumes that you are working with the population instead of a sample, and this is reflected in all the calculations

20

13

W. D

avid

Sca

les

14

RELIABILITY

Also, just like with any correlation coefficient, squaring it will give you the proportion of variance in X accounted for by X’ – it’s very similar to any other measure of effect size

Reliability is also a measure of the ratio of true-score variance to observed-score variance:

The less error there is in your measurement, the more reliable the instrument is

2

2

1.0 0.0

0.0 1.0

XX Error

XX Error

as

as

20

13

W. D

avid

Sca

les

15

IMPLICATIONS OF =1.0

This means that all measurements are made without any errors whatsoever

The observed score equals the true score (X = T) for all examinees

All observed-score variance solely reflects true-score variance:

The correlation between true scores and observed scores is

The correlation between observed scores and errors is

2 2X T

Adapted from Luecht (2004) 20

13

W. D

avid

Sca

les

16

IMPLICATIONS OF =0.0

This means that all measurements are nothing but random error

The observed score is nothing but random error (X = E) for all examinees

All observed-score variance solely reflects error variance:


The correlation between observed scores and errors is

2 2X E

20

13

W. D

avid

Sca

les

17

IMPLICATIONS OF 1.0

The observed score (and the measures used to obtain the score, or both) contain some error

X = T + E for all examinees The variance of the observed scores contains

true-score variance and error variance:


The correlation between observed scores and error scores is

2 2 2X T E

XT XX

1XE XX

20

13

W. D

avid

Sca

les

18

IMPLICATIONS OF 1.0

Differences in observed scores can reflect differences in true scores (and therefore the underlying KSA’s) or random errors of measurement, or a combination of both

Reliability is therefore the proportion of observed score variance that is due to true-score variance:

The higher the reliability coefficient is, the more confidently we can estimate true scores (and therefore, the underlying KSA) from observed scores

2

2T

XXX

20

13

W. D

avid

Sca

les

19

ESTIMATING RELIABILITY

There are several different methods for calculating reliability, and each method is used for a specific reason

Reliability coefficients are all forms of correlations!

Some of the earliest work in reliability stems from the late 19th century, and reliability has always been closely linked to validity

Thorndike’s (1918) groundbreaking work concerning criterion validity -- he defined validity as how well a measure correlates with another similar measure (we’ll look at this later…)

20

13

W. D

avid

Sca

les

20

ESTIMATING RELIABILITY

By implication, Thorndike (1918) said that reliability and validity were inherently linked – in his conception, validity and reliability had a direct, almost linear relationship – high correlations between an instrument and its criterion meant that the instrument was mathematically reliable and conceptually valid

Even though his conclusion concerning the relationship is inherently flawed, Thorndike laid out the conceptual power using correlations as measures of reliability, and we use the same ideas today

20

13

W. D

avid

Sca

les

21

TEST / RETEST RELIABILITY

Test-retest reliability implies that the same examinees are tested twice using the same test

The reliability coefficient is simply the Pearson correlation between the two sets of test scores

This is often used with personality tests or behavioral assessments This is not used with cognitive assessments – more

in a minute… Example: two sets of scores on a personality

inventory, in which scores range from 0 to100

20

13

W. D

avid

Sca

les

22

TEST / RETEST RELIABILITYCase A B zA zB zAzB

1 72 87 0.755123 0.717314 0.5416612 58 79 -0.49351 -0.22859 0.1128133 70 81 0.576748 0.007883 0.0045464 46 67 -1.56376 -1.64746 2.5762285 68 83 0.398372 0.24436 0.0973466 74 86 0.933499 0.599076 0.5592367 45 75 -1.65295 -0.70155 1.1596248 56 73 -0.67188 -0.93803 0.6302429 88 97 2.182128 1.8997 4.14538910 68 86 0.398372 0.599076 0.23865511 68 81 0.398372 0.007883 0.0031412 62 92 -0.13675 1.308507 -0.1789413 48 65 -1.38538 -1.88393 2.60997314 62 86 -0.13675 0.599076 -0.0819315 68 76 0.398372 -0.58331 -0.23237

Mean 63.53333 80.93333 12.18561Std.Dev. 11.21229 8.457475 0.812374 rAB

12.185

0.81

6

2

1

15

A BAB

z zr

n

20

13

W. D

avid

Sca

les

23

TEST / RETEST RELIABILITY

This very simple method to calculate reliability is nothing more than Thorndike’s (1918) work

However, there are some problems with using this method

Carry-over effects: an examinee’s attitude or general performance could be dependent on their prior trial performance

Practice effects: being exposed to the same information more than once may influence how they respond to items

Time between testing: if the time period is too short, examinees may remember their responses; if it’s too long, maturation may occur

20

13

W. D

avid

Sca

les

24

PARALLEL-FORMS RELIABILITY

Alternate forms can be pre-constructed to be parallel meaning they have similar true-score variance and equivalent error variances However, this requires a great deal of assumption

on the part of the test maker, a deep level of conceptual understanding of the latent constructs of interest, and a large amount of empirical verification

Like before, we give the two tests C and D, correlate the results, and that gives us a measure of reliability

Problem: if either C or D or both are unreliable on their own, reliability will be confounded with parallelism

20

13

W. D

avid

Sca

les

25

PARALLEL-FORMS RELIABILITY

Like with test / retest reliability, parallel-forms reliability is simply the Pearson correlation between the two forms C and D

Alternate forms are usually assumed to be parallel, and later tested for parallelism using the means and variances of the observed scores, and correlations with other tests

If the true scores are not equal , it’s almost impossible to make a reasonable case for establishing the reliability of either form, much less being able to say that one is a reliable proxy for the other

20

13

W. D

avid

Sca

les

26

RELIABILITY & TEST LENGTH

In general, adding more items to a test increases reliability

The Spearman-Brown formula is used to evaluate the likely increase in reliability due to increasing the length of a test or survey

The same formula is also used to adjust the correlation obtained from a split-half reliability analysis, which we’ll see later

20

13

W. D

avid

Sca

les

27


Assume that Y is a test of length k (or, test Y has k items) and that is the reliability coefficient for Test Y

What would happen if we increased the length of Test Y from k to m items?

The Spearman-Brown formula (sometimes called the Spearman-Brown prophecy formula) will compute the new reliability:

1 1

YY

XX

YY

mk

mk

20

13

W. D

avid

Sca

les

28


Suppose we have a 10-item test with . What will happen to the reliability if we double the test length to 20 items?

, k = 10 and m = 20

Doubling the test length will raise reliability by about 18%

1 1

20 .70 1.4101.7201 1 .7010

0.824

YY

XX

YY

mk

mk

20

13

W. D

avid

Sca

les

29


If you start with an instrument that is unreliable, you’ll have to add an awful lot of items just to get mediocre reliability, much less good reliability

However, if you start with good reliability, you don’t need to do much work adding items to improve the reliability

There is a tradeoff: raising reliability from .65 to .75 is a lot easier than raising reliability from .90 to .95

Going from .90 to .95 may require tripling or quadrupling the test length – that’s a lot of effort for a very small payoff (the Law of Diminishing Returns)

20

13

W. D

avid

Sca

les

30


20

13

W. D

avid

Sca

les

31


20

13

W. D

avid

Sca

les

32


20

13

W. D

avid

Sca

les

33

20

13

W. D

avid

Sca

les

34

INTERNAL CONSISTENCY

Internal consistency is determined by splitting the instrument in half in some way, then correlating the halves together

First-half / second-half split: an instrument is split right down the middle (e.g., the first 20 items are correlated against the second 20 items)

Problem: many tests are designed so that the simplest items are given first, and the test increases in difficulty as one moves through the items – this method only works if the items are of equal difficulty or are randomized in presentation

This method also will not work for any computer-based tests that uses a right/wrong algorithm to determine which item the subject will see next

20

13

W. D

avid

Sca

les

35


Even-odd split: an in instrument is split into the even-numbered and odd-numbered items

This method works substantially better than comparing the first & second half of a test – item ordering effects can be negated by more closely assuring that items of different difficulties are represented in both halves

Reliability for both first-half / second half and even-odd splits are calculated in the same way – there’s a simplified form of the Spearman-Brown formula to determine internal consistency

20

13

W. D

avid

Sca

les

36


To determine internal consistency, a single form of a test is given to all examinees

The items are scored dichotomously as right or wrong (0, 1)]

Once the items are scored, they are either: Partitioned into two groups of the same number of

items; each group of items is summed to form a subtest total score; these total scores are correlated, and the correlation is then corrected for disattenuation (Spearman-Brown), or

Used to directly compute item-level variances that are then exploited to derive a general estimate of reliability (Cronbach’s alpha, KR-20, KR-21)

20

13

W. D

avid

Sca

les

37

PROBLEMS WITH SPLIT-HALF METHODS

Any test in which the items are not truly random in order is going to interfere with reliability Many tests present items that increase in difficulty;

the first-half / second-half split should never be used with these items

Local independence: a response to an item is not dependent on any item that has come before, and has no bearing on the response to any other item that follows Some tests present a scenario followed by a series

of items based on that scenario, and often the response to one item influences the response to a following item

20

13

W. D

avid

Sca

les

38

PROBLEMS WITH SPLIT-HALF METHODS

Nuisance dimensions: any outside factor or KSA that produces systematic changes over the course of the test (e.g., speeded tests, pacing skills, reading speed, language comprehension, dexterity, induced rapid guessing)

Technically speaking, split-half methods only work for parallel halves Unequal means are not a problem, in and of

themselves, as long as the halves have at least approximately equal true-score means

However, unequal variances of the halves can lead to poor and/or inappropriate reliability estimates

20

13

W. D

avid

Sca

les

39

CRONBACH’S ALPHA

Developed by Cronbach (1951) as the first in an intended series of coefficients

Cronbach’s α provides the lower bound of reliability among all possible split halves, assuming that the halves are parallel – the value of Cronbach’s α is the most conservative estimate of reliability

Cronbach’s α is the most widely-used reliability estimate

Cronbach’s α is the only one that works for dichotomous and polytomous items

20

13

W. D

avid

Sca

les

40

CRONBACH’S ALPHA

k is the total number of test items σ²i is the item variance, which equals σ²i =

piqi for dichotomous items -- the variances of all the items are summed together

σ²x is the variance of the sum scores, computed as

iXX

X

σkρ

k σ

2

211

α

iu 0,1

X

xσ

n

22

20

13

W. D

avid

Sca

les

41

ITEM RESPONSES

Subject i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 Total1 0 0 0 1 0 0 0 0 0 0 12 1 1 1 0 1 1 0 1 0 1 73 1 0 0 0 1 0 0 0 0 0 24 1 1 1 1 1 1 0 1 1 1 95 1 1 1 1 1 1 1 0 1 1 96 0 0 1 0 0 0 0 0 0 0 17 1 1 1 0 0 1 0 1 0 0 58 1 0 0 0 1 0 0 1 0 1 49 0 1 1 1 1 0 0 0 1 1 6

10 1 1 1 1 1 1 1 1 1 1 1011 1 1 1 1 1 1 0 0 0 1 712 1 1 0 1 1 1 0 0 1 1 713 0 0 1 1 1 0 0 0 0 0 314 1 1 1 1 1 1 1 1 1 1 1015 1 1 1 1 1 1 0 1 1 1 916 1 0 1 0 0 0 0 0 1 0 317 1 1 1 0 1 1 1 1 0 1 818 1 1 1 1 0 1 0 0 1 1 719 0 1 1 1 0 1 0 0 0 1 520 1 0 0 1 0 0 1 0 0 0 3

Mean 0.8 0.7 0.8 0.7 0.7 0.6 0.3 0.4 0.5 0.7 5.8Varianc

e 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 8.26

20

13

W. D

avid

Sca

les

42

ITEM RESPONSES

Subject i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 Total

… … … … … … … … … … … …

Mean 0.75 0.65 0.75 0.65 0.65 0.6 0.25 0.4 0.45 0.65 5.8Varianc

e0.18

8 0.228 0.188 0.228 0.228 0.24 0.187 0.240.247

5 0.228 8.26

iσ2 0.188 0.228 0.28

2.20

Xσ2 8.26

20

13

W. D

avid

Sca

les

43

CRONBACH’S ALPHA

= 8.26, so:

iXX

X

σkρ

k σ

2

211

10 2.21

9 8.26

=0.815

α

20

13

W. D

avid

Sca

les

44

KUDER-RICHARDSON FORMULA 20 (KR-20) KR-20 is used for dichotomous items,

k is the total number of test items pi is the item-level proportion-correct score

qi = 1 – pi is the item-level proportion-incorrect score

is the variance of the sum scores, computed as

i iXX

X

pqkρ

k σ211

α

X

xσ

n

22

FOR DICHOTOMOUS ITEMS: KR-20 = CRONBACH’S α!

20

13

W. D

avid

Sca

les

45

KUDER-RICHARDSON FORMULA 21 (KR-21)

KR-21 is also used for dichotomous items, However, KR-21 is used when you can assume

that pi does not vary extensively across items

k is the total number of test items pi is the item-level proportion-correct score

qi = 1 – pi is the item-level proportion-incorrect score

is the variance of the sum scores, computed as

XX

X

kpqkρ

k σ211

α

X

xσ

n

22

20

13

W. D

avid

Sca

les

46

INTERRATER RELIABILITY

Kappa (κ), often called Cohen’s κ, is a form of χ² that uses contingency tables to measure interrater agreement and is used as a measure of reliability (Cohen, 1960)

Cohen’s κ is used to measure the reliability of judges or raters – we want judges to rate what they see in the same way, using the same rubric or understanding of the rating criteria

Just like any piece of machinery (or assessment, survey or diagnostic tool used in psychology), raters need to be calibrated (or trained) in order to yield the most accurate measurements possible

20

13

W. D

avid

Sca

les

47

INTERRATER RELIABILITY

Like most effect sizes, there aren’t hard-and-fast rules for interpreting κ, but there are some commonly-understood guidelines (Landis & Koch, 1977):

κ Interpretation< 0.1 Poor agreement

0.1 – 0.20 Slight agreement0.21 – 0.40 Fair agreement0.41 – 0.60 Moderate agreement

0.61 – 0.80 Substantial agreement

0.81 – 1.00Almost perfect

agreement

20

13

W. D

avid

Sca

les

48

INTRACLASS CORRELATION

An intraclass correlation is another measure of interrater reliability

It’s actually highly similar to Cohen’s κ Since Cohen’s κ is based on a χ² test of

independence, it’s used when you have two judges

An intraclass correlation is used when you have three or more judges and you want to examine their level of agreement

20

13

W. D

avid

Sca

les

49

INTRACLASS CORRELATION

One could in theory correlate the ratings of all possible pairwise combinations of judges then average the results, but this will ignore differences between judges Example: if one judge always rates five points

higher than another judge, their correlation will be 1.00, but the judges are saying different things about the subjects

An intraclass correlation will allow you to take individual differences of judges into account

20

13

W. D

avid

Sca

les

50

VALIDITY Validity’s big question: are we measuring what we

set out to measure? There are several major kinds of validity we need

to take into account: Criterion validity Content validity Construct validity Consequential validity

There are also several other kinds of validity that must be considered when looking at test results or experimental results Internal and external validity

20

13

W. D

avid

Sca

les

51

VALIDITY

On top of that, there are other kinds of validity that apply only to qualitative studies or action research, and others still that apply to evaluation projects

It doesn’t help clarify things much when we realize that even if a test is a valid measure of a latent knowledge, skill or ability, often the validity of the inference made from that score is invalid

We make inferences about what we can’t see based on what we can see, but if a student’s performance isn’t adequately captured by the test, we can arrive at the wrong conclusion… and often do

20

13

W. D

avid

Sca

les

52

VALIDITY

Popham (2010) states on p.19 that “[t]here is no such thing as a valid test”, saying that the tests themselves are neither valid or invalid

In essence, he says that the inferences we make from the tests are what are either valid or invalid

This is only part of the story – a test can most certainly be valid or invalid when one looks at the fundamental definition of validity: “does a test measure what it sets out to measure?” Psychometrically speaking, we can easily determine

the degree to which a test is valid, and help define the parameters for whom the test should be taken by in order for it to produce valid results and therefore lead to valid interpretations

20

13

W. D

avid

Sca

les

53

CRITERION VALIDITY

First defined by Thorndike (1918), who defined validity as how well a measure correlates with another similar measure

Tests were judged valid based on their diagnostic value, defined in terms of their correlations with other measuresConcurrent validity: how well the measure

correlates with another given at the same time

Predictive validity: how well the measure correlates with another given at some point in the future

20

13

W. D

avid

Sca

les

54

CRITERION VALIDITY Advantage:

Clearly relevant to the plausibility of proposed interpretations; can be very objective

Disadvantages: Obtaining a good criterion often is almost

impossible Sometimes the criterion is clearly superior to the

measure; why not just use the better one rather than make a new measure?

What if the external criterion is not valid? Of all the forms of validity evidence, criterion

validity is used the least – it conflates validity with reliability, and can easily lead to the erroneous conclusion that reliability and validity are the same thing… and they’re NOT.

20

13

W. D

avid

Sca

les

55

CONTENT VALIDITY Formally introduced by Cureton (1951) A test is valid to the extent that it covers the

relevant content, and whether that coverage is at the right level of cognitive complexity All content domains must be specified

completely and to the right degree of accuracy Additionally, all items on the instrument are

written at the proper level for the intended audience; they’re not too tough or too easy

A sample of responses in some area is thought to be representative of someone’s overall ability throughout that area

Modern computer-adaptive testing and automated testing assembly has its roots here

20

13

W. D

avid

Sca

les

56

CONTENT VALIDITY Advantages:

No external criterion needed If the content domain has been specified

accurately, and the test is constructed to be representative of the domain, validity easy to establish

Disadvantages:Dependent on subjective opinion as to the

worth of the measure and/or the relevance and representativeness of the items to the content domain

Confirmatory bias: if the researcher is also the author of the instrument, there is a tendency to find evidence in one’s favor (Kane, 2006)

20

13

W. D

avid

Sca

les

57

CONSTRUCT VALIDITY Defined in a major work by Cronbach & Meehl

(1955) Construct validity is to be used when the

instrument measures something that cannot be observed directly, has no content-related operational definitions, and has no adequate criterion measure

A researcher develops a sort of map of what s/he believes the relationships of the underlying constructs to be measured are, and the validity of the instrument is defined by how well it mirrors this conceptual map (or nomological network)

A construct is defined by its role in the nomological network

20

13

W. D

avid

Sca

les

58

CONSTRUCT VALIDITY If what the researcher predicts should

happen doesn’t match up with what is seen, then at least one of three things has happened:The theory is incorrectThe measurements aren’t adequate to measure

the constructsThe relationships between the underlying

constructs have been misspecified Modern cognitive psychology owes much

of existence to these ideas; it gave researchers a way to try to quantify psychological phenomena that was ignored by the behaviorists

20

13

W. D

avid

Sca

les

59

CONSTRUCT VALIDITY Advantages:

Validity theory & practice moved from simple correlations and prediction to explanation (establishing causality)

Moved focus on test scores from being summative numbers and end results to formative or process-related indications – the test no longer drives anything; it reflects what’s going on underneath

Relevance, usefulness and importance of these scores cannot be evaluated without a deeper understanding of the processes or constructs that drive the performance on the measure

Best of all: it includes all other types of validity evidence as well as reliability and methodological evidence

20

13

W. D

avid

Sca

les

60

CONSTRUCT VALIDITY

Disadvantages:Defining the constructs and establishing

what is and isn’t valid is extremely difficult to do and to explain to someone without a similar level of knowledge of the constructs

There are no specific rules for the process of establishing validity; should validity simply be in the eye of the beholder?

Heavily driven by theory; following what few rules there are can make a researcher’s job intractable – some validation studies take longer than the original study, without any clear answers at the end

20

13

W. D

avid

Sca

les

61

CONSEQUENTIAL VALIDITY

Consequential validity: what are the consequences of taking a particular test and making a particular score? (Messick, 1989)

Validity is defined as an integrated evaluative judgment of the extent that theory and empirical evidence “support the adequacy and appropriateness of inferences and actions based on test scores” (Messick, 1989; italics in original)

This draws questions of ethics into research in psychology and education

20

13

W. D

avid

Sca

les

62

CONSEQUENTIAL VALIDITY

Advantages: Researchers should specify in advance what the

criteria for validity is; no more guesswork Focuses on the uses (and possible misuses) of

test scores, putting responsibility on the researcher to interpret and use these scores correctly

Requires evidence of validity from numerous different sources

Validity as a property of an instrument turned into validation as a continuing process

20

13

W. D

avid

Sca

les

63

CONSEQUENTIAL VALIDITY Disadvantages:

There are no specific rules for conducting a validity study

Heavily dependent on theory; less dependent on practice

Treats validity as a toolbox, in which one can pick and choose whatever method they want to use

Establishing the validity of an instrument becomes almost immaterial; focus is on what you do with the scores after you get them

In this system, the process of validation never ends – validity must always be refined and re-addressed

20

13

W. D

avid

Sca

les

64

FACE VALIDITY

Does an instrument look good on the surface? Does it look professional?

Many studies are often killed off early on because the tests/surveys used are full of typos, or are poorly designed or laid out

Often, the language used is an indicator of face validity: Is the survey dumbed down or too advanced for

the intended audience? Is the font used easily readable? Are the instructions clear? Is there a logical flow to the instrument?

20

13

W. D

avid

Sca

les

65

VALIDITY: FURTHER ISSUES Some call for theory and practice to be

separated to make the researcher’s job easier, using more advanced psychometric techniques to more accurately define constructs

Kane (2006) suggested an argument-based approach to provide evidence for validity, much like attorneys in a court case

Borsboom et al. (2004) argue for a simple causality-based approachA test is valid only if (a) the attribute exists

and (b) changes in the attribute cause changes in how someone responds to the test

20

13

W. D

avid

Sca

les

66

RELIABILITY & VALIDITY

The relationship between reliability and validity can be easily shown using The Bull’s-Eye Example

In short, an instrument can be: Reliable and valid Reliable and not valid Not reliable and not valid

You cannot have an instrument that is valid but not reliable!

20

13

W. D

avid

Sca

les

67

VALIDITY AND ACTION RESEARCH There are types of validity unique to action

research: Outcome validity: the extent to which actions lead to

the resolution of the problem Process validity: the adequacy of the processes used

in the different phases of the research Democratic validity: the extent to which the action

research is done in collaboration with all parties who have a stake in the problem

Catalytic validity: the extent to which an action research project focuses and energizes participants, opening them to a transformed view of reality in relation to their practice (the “oh, wow!” factor)

Dialogic validity: the use of extensive dialogue with peers in the formation of and review of the researcher’s findings and interpretations

20

13

W. D

avid

Sca

les

68

INTERNAL VALIDITY In short, a good researcher needs to

maximize the level of internal validity of the study

Internal validity: the extent to which the results of a study can be attributed to the manipulation of the independent variable (IV), rather than to some confounding variable A study with good internal validity has no (or

only a few minor) confounds and accurately offers only one explanation of the results

There are eleven major threats to internal validity, as well as some minor ones…

20

13

W. D

avid

Sca

les

69

THREATS TO INTERNAL VALIDITY

1. Nonequivalent control group: at the outset, the control and experimental groups must be as similar as possible – if the control group differs from the experimental group significantly (especially on the construct that you want to measure), it will be impossible to make any reasonable conclusions

2. History effect: an event outside the scope of the experiment can affect performance on the construct being measured by the dependent variable (DV)

Example: a two-month stress-reduction program in which the posttest is given during finals week – what do you think might go wrong?

20

13

W. D

avid

Sca

les

70


3. Maturation effect: often, naturally-occurring changes within the participants can be responsible for the observed results

4. Testing effect (also known as practice effect): repeated testing may lead to subjects getting more familiar with the test (or the way the experimenter gives tests), rather than growth on the construct of interest due to the manipulation of the IV – the subjects are merely better at taking tests

Fatigue effect: performance decreases on repeated tests because the subjects tire of taking them

20

13

W. D

avid

Sca

les

71


5. Regression to the mean: subjects with extreme scores, either high or low, will tend to score closer to the mean upon retesting

Often, subjects are selected for a study based on a score on some measure, but that score may not truly indicate their ability

Maybe they did well simply because they were lucky, or they did poorly because they hadn’t slept

Example: The Sports Illustrated hex

20

13

W. D

avid

Sca

les

72


6. Instrumentation effect: Changes in the DV may be related to changes in the measuring device – this can be a major problem when the instrument is a person, like an observer – not all observers are trained equally

7. Mortality / Attrition: Most studies are effected by dropout rates – people leave studies for lots of reasons

Usually the dropout rates across the different groups are the same

If one group has a higher dropout rate, this could lead to inequalities across groups

Why might this be a problem?

20

13

W. D

avid

Sca

les

73


8. Diffusion of treatment: Observed changes in behaviors or responses may be due to information received from other subjects in the study – someone tipped them off, and this affected their behavior (a version of the Hawthorne effect)

Example: if you knew there was a police roadblock ahead, wouldn’t you slow down?

This happens a lot in psychology departments that require participation in research – students regularly tell each other which experiments are the easy ones

20

13

W. D

avid

Sca

les

74


9. Experimenter effect: when the experimenter consciously or unconsciously affects the study – maybe they smile more when the subjects do as they’re expected

Single-blind experiment: either the participants or the researcher doesn’t know which condition the subjects are in

Double-blind experiment: both the researcher and the participants don’t know which condition they are in

20

13

W. D

avid

Sca

les

75


10. Participant effect: sometimes, subjects bias the experiment based on their own expectations – maybe they get nervous because they know someone’s watching them

Often, subjects try to be “good participants”, trying to determine what the researcher wants to see and adjusting their behavior accordingly

Placebo effect: a special kind of participant effect in which the subjects believe they’re getting the treatment when they’re not; they believe they’re getting better not because of the treatment, but because of their expectation that the treatment is having an effect

20

13

W. D

avid

Sca

les

76


11. Floor & ceiling effects: any DV must be sensitive enough to detect differences between groups – If it’s not sensitive enough, real differences will be missed

Floor effect: A limitation of the measuring instrument that decreases the capability of detecting differences in scores at the bottom end of the scale – e.g., measuring rats in pounds

Ceiling effect: A limitation of the measuring instrument that decreases the capability of detecting differences in scores at the bottom end of the scale – e.g., measuring elephants on a bathroom scale

A pretest will usually tell you if your measure is sensitive enough

20

13

W. D

avid

Sca

les

77

EXTERNAL VALIDITY

External validity is the extent to which the results of an experiment can be generalized to a larger population, beyond the participants in the study and the laboratory in which the experiment was conducted

The biggest loss of external validity comes from the fact that experiments using human participants often use small samples obtained from a single geographic location or with idiosyncratic features (e.g., volunteers).

Because of this, one cannot be sure that the conclusions drawn about causal relationships actually apply to people in other geographic locations or without these features

20

13

W. D

avid

Sca

les

78

THREATS TO EXTERNAL VALIDITY

Generalization to populations: since most psychological research is done using undergraduates, it can be difficult to generalize results to a larger population –

The college sophomore problem: most conclusions in psychological studies are based on studying young people with a late-adolescent mentality who are still maturing In education, many of our results may apply only

to a specific school or to a small subset of students with a particular educational characteristic not found in the general population (e.g., students with learning disabilities)

20

13

W. D

avid

Sca

les

79

THREATS TO EXTERNAL VALIDITY Generalization from laboratory settings: Lab settings

allow researchers to maximize control, but sometimes at the cost of making the environment too artificial

Of course, we don’t want to give up control any more than we have to, because doing so will hurt our internal validity

The best way around this is through exact replication, which is repeating a study elsewhere using the exact same IVs and DVs as the original study

Conceptual replication repeats a study with a different IV, or a different DV

Systematic replication repeats a study several times, each time changing only one feature of the original study

20

13

W. D

avid

Sca

les

80


Aptitude-treatment interaction: The sample may have certain features that may interact with the independent variable – e.g., will a treatment that works on severely depressed individuals work on those with minor depression?

Reactivity: when causal relationships are found, they might not be generalizable if the effects found only occurred as an effect of studying the situation (e.g., Hawthorne effect, placebo effect)

20

13

W. D

avid

Sca

les

81


Situation effects: All situational specifics (e.g. treatment conditions, time, location, lighting, noise, treatment administration, investigator, timing, scope and extent of measurement, etc. etc.) of a study potentially limit generalizability

Pretest or posttest effects: when causal relationships can only be found when pretests or posttests are carried out

20

13

W. D

avid

Sca

les

82

REFERENCES Borsboom, D., Mellenbergh, G.J., & van Heerden, D.

(2004). The concept of validity. Psychological Review, 111(4), 1061 – 1071 .

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37 – 46.

Cureton, E.E. (1951). Validity. In E.F. Lindquist (Ed.), Educational measurement (621 – 694). Washington, DC: American Council on Education.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297 –334.

Cronbach., L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.

20

13

W. D

avid

Sca

les

83

REFERENCES Kane, M.T. (2006). Validation. In R.L. Brennan (Ed.),

Educational measurement (4th ed.). New York, NY: American Council on Education and Praeger Publishers .

Messick, S. (1989). Validity. In Linn, R.L. (Ed.), Educational measurement (3rd ed.). New York, NY: MacMillan.

Luecht, R.M. (2004). Reliability. Presentation to ERM 667: Foundations of Educational Measurement, The University of North Carolina at Greensboro. Greensboro, NC: Author.

Popham, J.W. (2010). Everything school leaders need to know about assessment. Thousand Oaks, CA: Corwin.

Thorndike, E.L. (1918). The nature, purposes and general methods of measurements of educational products. National Society for the Study of Educational Products: Seventeenth Yearbook, 16 – 24. Chicago, IL: University of Chicago Press.

20

13

W. D

avid

Sca

les

Date post:	22-Dec-2015
Category:	Documents
Upload:	hollie-wheeler
View:	214 times
Download:	0 times

P ART 2: R ELIABILITY & V ALIDITY EDRS 709 Summer 2013.

Documents