Post on 28-Dec-2015
transcript
LECTURE 06B BEGINS HERETHIS IS WHERE MATERIAL FOR EXAM 3 BEGINS
RIGOR OF ASSESSMENT IN NORM-REFERENCED TESTING(HUTCHINSON, 1996)
RIGOR OF ASSESSMENT (PART OF ASSESSING PSYCHOMETRIC ADEQUACY)Validity Extent which a procedure actually measures what it claims to measure
Reliability Consistency of response/performance elicitation
Remember: Can be applied to both norm-referenced and criterion referenced testing
RIGOR OF ASSESSMENT IN NORM-REFERENCED TESTING:SUBTOPIC = VALIDITY
ASSESSING VALIDITY IN NORM-REFERENCED TESTINGDefinition of and evidence for validityExtent which a procedure actually measures what it is supposed to measure
Defined relative to a specific purpose E.g. valid for screening, but not valid for Tx planning
Issue of the quality and extent of available evidenceLogical analysisEmpirical data
TYPES OF VALIDITY (H&P, 2012)
Construct validity“Degree to which a test measures the theoretical construct it is intended to measure”
Content validityDegree to which the content of a test is consistent with the purpose of a test
--appropriateness of items--completeness of the item sample--the way in which the items assess the content
Cf. face validity, which has surface appearance of content validity
TYPES OF VALIDITY (H&P, 2012)
Criterion-related validity Degree to which the test performance predict performance on other (external) criteria--subtype = predictive
Ability to predict score on future test in related area
--subtype = concurrent compared to present performance on other tests in related area
TYPES OF VALIDITY (H&P, 2012)
SOURCES OF EVIDENCE OF VALIDITY, (HUTCHINSON, 1996)
Evidence used to support the argument that a test is valid for its stated purposeFirst source category = Logical evidenceTest’s purpose well statedConstruct (theory/framework) well definedGood rationale for content of the test, which includes documentation that both easy and hard test items have been included, to discriminate disorder
Key concept: Are the test authors’ logically-based arguments convincing?
SOURCES OF EVIDENCE OF VALIDITY, (HUTCHINSON, 1996)
Evidence used to support the argument that a test is valid for its stated purpose Second source category = Empirical evidence
Correlation (r), a measure of relationship between ____________________ and _____________________
Good prediction of group membership with measures of __________________ and _____________________
Pattern of relationship among sub-test results should match the pattern predicted by the constructVia correlationVia factor analysisKey concept: Are the test authors’
empirically-based arguments convincing?
What are the labels on the axes when one uses correlation as evidence for validity?
Empirical evidence for validity, using correlation…Measure of relationship between _____________ and ____________
Empirical evidence for validity, using correlation…Measure of relationship between _____________ and ____________
Is the test authors’ empirical argument convincing?
What evidence is given to describe the relationship between the test of interest and others considered to be similar?
Note that valid tests should also have low correlations with test measuring different parameters
Sensitivity--the test’s accuracy in correctly identifying the clients WITH the disorder
Specificity-- the test’s accuracy in correctly identifying the clients WITHOUT the disorder
Empirical evidence for validity, using measures of sensitivity and specificity…
Empirical evidence for validity, using measures of sensitivity and specificity… Let’s “visualize” these concepts
Empirical evidence for validity, using measures of sensitivity and specificity… In the test manual, we’re looking for reports of high specificity and high sensitivity. Is the test authors’
empirical argument convincing?
What evidence is given to support the accuracy of this test in classifying subjects into already-established performance categories?
Do you see how this type of evidence for validity is directly related to the purpose of norm-referenced tests?
Empirical evidence for validity, using patterns of correlations among subtests, to see if the patterns fit what the construct would predict (construct in this example = what makes up writing ability?)
Is the test authors’ empirical argument convincing?
What statistical data support the relationship among separate components of the test or their relationship with the overall contruct?
Empirical evidence for validity, using factor analysis of sub-test scores, e.g. to see if patterns of factor loadings follow what the construct of writing ability would predict
I: “Writer’s development of the work”II: “Writer’s fluency with mechanics”III: “Sentence structure”IV: “Writer’s orientation to the reader”
Is the test authors’ empirical argument convincing?
RIGOR OF ASSESSMENT IN NORM-REFERENCED TESTING:SUBTOPIC = RELIABILITY
Reliability Consistency of response/performance elicitation (includes consistency of scoring and measurement)
Remember….
TYPES OF RELIABILITY, AND EVIDENCE FOR THEMAgreement OR Inter-rater reliabilityCorrelation of scores of two raters (good = .85-.90)*
Item by item or total score
Stability OR Test-retest reliabilityCorrelation of scores from two separate test administrations with same person, across testees (good = .85-.90)* (continued….)
Can you see why the authors should optimally provide reliability scores for: 1) each age group separately? 2) both normal and disordered groups?
TYPES OF RELIABILITY, AND EVIDENCE FOR THEM (CONT.)Internal consistency OR split-half
reliabilitySplit test in two halves and obtain correlation between the two sets: Measured as rE.g. Split top from bottomE.g. split even items from odd items
Test items assigned to two halves through random assignment, and obtain r. Then do this again, and again, and again….. “Average” all the r’s = Cronbach’s coefficient alpha
What are the labels on the axes when one uses correlation as evidence for --inter-rater reliability?--test/retest reliability? --split half reliability?
Empirical evidence for reliability, using patterns of correlations…
Think:
Even when a test is very carefully designed and reliable (consistent) in its ability to measure a construct (e.g. narrative comprehension), a client’s responses to test items may not always reflect a true picture of his underlying ability (e.g. his true ability to understand narrative passages).
Error in measurement cannot be avoided, especially when measuring human performance. Even with the most reliable test, what are some of the other factors that affect a client’s performance on a test, on a given day?
Transition slide from topic of reliability to topic of Standard Error of Measurement (SEM)
Observed score = the actual raw score that a test-taker earnsTrue score = hypothetical “ideal” score that the person would have earned if there were no error in measurement
STANDARD ERROR OF MEASUREMENT SEM
If a person took a test 100 times, their scores:
1) would tend to fall near some central score (represented by a measure of central tendency, such as the average), e.g. 42
2) would deviate from the central score (due to error of measurement) in predictable way, with most of them not too far from the center
The “average deviation” (or “average distance”) from the central score is known as the standard deviation, e.g. 2.
This standard deviation (“average deviation”) due to error of measurement is called the standard error of measurement (SEM), e.g. 2 away from 42 (either above or blow)
Num
ber
of ti
mes
th
e pe
rson
ear
ned
the
scor
e
few
many
Score42 4440 ____
Can you fill in the values that would be two SEM away from the average?
STANDARD ERROR OF MEASUREMENT SEM
Now, test-makers don’t really calculate SEM by giving people a test 100 times! They calculate SEM using:
1)estimates of the test’s reliability (at least one of the three types)
2)the distribution of scores earned by the normative sample
3)the way in which reliability varies at different score levels
SO, clinicians don’t calculate SEM. SEM is provided in the test manual to help guide us in our interpretation of a client’s score.
Num
ber
of ti
mes
th
e pe
rson
ear
ned
the
scor
e
few
many
Score42 4440 ____
Can you fill in the values that would be two SEM away from the average?
STANDARD ERROR OF MEASUREMENT SEM
68% of the scores would be predicted to fall within one SEM of the average
e.g. we could predict that 68/100 would fall between 40 and 44
95% of the scores would be predicted to fall within two SEMs of the average
e.g. we could predict that 95/100 would fall between ____ and ____
Num
ber
of ti
mes
th
e pe
rson
ear
ned
the
scor
e
few
many
Score42 4440 ____
SEM AND ITS RELATIONSHIP TO CONFIDENCE INTERVALS (See Hutchinson and H&P readings)
Observed score The actual raw score that the test taker earns
True score The score that the person would have earned if there were no measurement error
SEM AND ITS RELATIONSHIP TO CONFIDENCE INTERVALS (See Hutchinson and H&P readings)
+ 1 SEM to -1 SEM = 68% confidence interval. We can have 68% confidence that the client’s true score would fall somewhere in this range
+ 2 SEM to -2 SEM = 95% confidence interval. We can have 95% confidence that the client’s true score would fall somewhere in this range
INTERPRETATION OF CONFIDENCE INTERVAL RELATIVE TO CUT-OFF SCORE
How do we interpret performance when confidence interval :
a)is completely above the cut-off score?
b)is completely below the cut-off score?
c)straddles the cut-off score?
LECTURE 06B ENDS HERE