LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

transcript

LECTURE 06B BEGINS HERETHIS IS WHERE MATERIAL FOR EXAM 3 BEGINS

RIGOR OF ASSESSMENT IN NORM-REFERENCED TESTING(HUTCHINSON, 1996)

RIGOR OF ASSESSMENT (PART OF ASSESSING PSYCHOMETRIC ADEQUACY)Validity Extent which a procedure actually measures what it claims to measure

Reliability Consistency of response/performance elicitation

Remember: Can be applied to both norm-referenced and criterion referenced testing

RIGOR OF ASSESSMENT IN NORM-REFERENCED TESTING:SUBTOPIC = VALIDITY

ASSESSING VALIDITY IN NORM-REFERENCED TESTINGDefinition of and evidence for validityExtent which a procedure actually measures what it is supposed to measure

Defined relative to a specific purpose E.g. valid for screening, but not valid for Tx planning

Issue of the quality and extent of available evidenceLogical analysisEmpirical data

TYPES OF VALIDITY (H&P, 2012)

Construct validity“Degree to which a test measures the theoretical construct it is intended to measure”

Content validityDegree to which the content of a test is consistent with the purpose of a test

--appropriateness of items--completeness of the item sample--the way in which the items assess the content

Cf. face validity, which has surface appearance of content validity

Criterion-related validity Degree to which the test performance predict performance on other (external) criteria--subtype = predictive

Ability to predict score on future test in related area

--subtype = concurrent compared to present performance on other tests in related area

SOURCES OF EVIDENCE OF VALIDITY, (HUTCHINSON, 1996)

Evidence used to support the argument that a test is valid for its stated purposeFirst source category = Logical evidenceTest’s purpose well statedConstruct (theory/framework) well definedGood rationale for content of the test, which includes documentation that both easy and hard test items have been included, to discriminate disorder

Key concept: Are the test authors’ logically-based arguments convincing?

SOURCES OF EVIDENCE OF VALIDITY, (HUTCHINSON, 1996)

Evidence used to support the argument that a test is valid for its stated purpose Second source category = Empirical evidence

Correlation (r), a measure of relationship between ____________________ and _____________________

Good prediction of group membership with measures of __________________ and _____________________

Pattern of relationship among sub-test results should match the pattern predicted by the constructVia correlationVia factor analysisKey concept: Are the test authors’

empirically-based arguments convincing?

What are the labels on the axes when one uses correlation as evidence for validity?

Empirical evidence for validity, using correlation…Measure of relationship between _____________ and ____________

Is the test authors’ empirical argument convincing?

What evidence is given to describe the relationship between the test of interest and others considered to be similar?

Note that valid tests should also have low correlations with test measuring different parameters

Sensitivity--the test’s accuracy in correctly identifying the clients WITH the disorder

Specificity-- the test’s accuracy in correctly identifying the clients WITHOUT the disorder

Empirical evidence for validity, using measures of sensitivity and specificity…

Empirical evidence for validity, using measures of sensitivity and specificity… Let’s “visualize” these concepts

Empirical evidence for validity, using measures of sensitivity and specificity… In the test manual, we’re looking for reports of high specificity and high sensitivity. Is the test authors’

empirical argument convincing?

What evidence is given to support the accuracy of this test in classifying subjects into already-established performance categories?

Do you see how this type of evidence for validity is directly related to the purpose of norm-referenced tests?

Empirical evidence for validity, using patterns of correlations among subtests, to see if the patterns fit what the construct would predict (construct in this example = what makes up writing ability?)

What statistical data support the relationship among separate components of the test or their relationship with the overall contruct?

Empirical evidence for validity, using factor analysis of sub-test scores, e.g. to see if patterns of factor loadings follow what the construct of writing ability would predict

I: “Writer’s development of the work”II: “Writer’s fluency with mechanics”III: “Sentence structure”IV: “Writer’s orientation to the reader”

RIGOR OF ASSESSMENT IN NORM-REFERENCED TESTING:SUBTOPIC = RELIABILITY

Reliability Consistency of response/performance elicitation (includes consistency of scoring and measurement)

Remember….

TYPES OF RELIABILITY, AND EVIDENCE FOR THEMAgreement OR Inter-rater reliabilityCorrelation of scores of two raters (good = .85-.90)*

Item by item or total score

Stability OR Test-retest reliabilityCorrelation of scores from two separate test administrations with same person, across testees (good = .85-.90)* (continued….)

Can you see why the authors should optimally provide reliability scores for: 1) each age group separately? 2) both normal and disordered groups?

TYPES OF RELIABILITY, AND EVIDENCE FOR THEM (CONT.)Internal consistency OR split-half

reliabilitySplit test in two halves and obtain correlation between the two sets: Measured as rE.g. Split top from bottomE.g. split even items from odd items

Test items assigned to two halves through random assignment, and obtain r. Then do this again, and again, and again….. “Average” all the r’s = Cronbach’s coefficient alpha

What are the labels on the axes when one uses correlation as evidence for --inter-rater reliability?--test/retest reliability? --split half reliability?

Empirical evidence for reliability, using patterns of correlations…

Think:

Even when a test is very carefully designed and reliable (consistent) in its ability to measure a construct (e.g. narrative comprehension), a client’s responses to test items may not always reflect a true picture of his underlying ability (e.g. his true ability to understand narrative passages).

Error in measurement cannot be avoided, especially when measuring human performance. Even with the most reliable test, what are some of the other factors that affect a client’s performance on a test, on a given day?

Transition slide from topic of reliability to topic of Standard Error of Measurement (SEM)

Observed score = the actual raw score that a test-taker earnsTrue score = hypothetical “ideal” score that the person would have earned if there were no error in measurement

STANDARD ERROR OF MEASUREMENT SEM

If a person took a test 100 times, their scores:

1) would tend to fall near some central score (represented by a measure of central tendency, such as the average), e.g. 42

2) would deviate from the central score (due to error of measurement) in predictable way, with most of them not too far from the center

The “average deviation” (or “average distance”) from the central score is known as the standard deviation, e.g. 2.

This standard deviation (“average deviation”) due to error of measurement is called the standard error of measurement (SEM), e.g. 2 away from 42 (either above or blow)

Score42 4440 ____

Can you fill in the values that would be two SEM away from the average?

Now, test-makers don’t really calculate SEM by giving people a test 100 times! They calculate SEM using:

1)estimates of the test’s reliability (at least one of the three types)

2)the distribution of scores earned by the normative sample

3)the way in which reliability varies at different score levels

SO, clinicians don’t calculate SEM. SEM is provided in the test manual to help guide us in our interpretation of a client’s score.

Score42 4440 ____

Can you fill in the values that would be two SEM away from the average?

68% of the scores would be predicted to fall within one SEM of the average

e.g. we could predict that 68/100 would fall between 40 and 44

95% of the scores would be predicted to fall within two SEMs of the average

e.g. we could predict that 95/100 would fall between ____ and ____

Score42 4440 ____

SEM AND ITS RELATIONSHIP TO CONFIDENCE INTERVALS (See Hutchinson and H&P readings)

Observed score The actual raw score that the test taker earns

True score The score that the person would have earned if there were no measurement error

SEM AND ITS RELATIONSHIP TO CONFIDENCE INTERVALS (See Hutchinson and H&P readings)

+ 1 SEM to -1 SEM = 68% confidence interval. We can have 68% confidence that the client’s true score would fall somewhere in this range

+ 2 SEM to -2 SEM = 95% confidence interval. We can have 95% confidence that the client’s true score would fall somewhere in this range

INTERPRETATION OF CONFIDENCE INTERVAL RELATIVE TO CUT-OFF SCORE

How do we interpret performance when confidence interval :

a)is completely above the cut-off score?

b)is completely below the cut-off score?

c)straddles the cut-off score?

LECTURE 06B ENDS HERE

LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

Documents