Post on 09-Mar-2020
transcript
Psychological testing
Lecture 12
Mikołaj
Winiewski, PhD
Test Construction Strategies
• Content validation
• Empirical Criterion
• Factor Analysis
• Mixed approach (all of the above)
Content Validation
• Defining all aspects of the construct and create test items – Derived from theory or based on purpose of the test
– Content of the item is of the primary importance
• Consulting experts about the constructs – using qualitative methods
• Employ expert’s as judges to assess each potential item – using quantitative measures
• Perform psychometric analyses of items
Empirical Keying
• Create test items to measure one or more traits
– Derived from theory or based on purpose of the test
– Content of the item is not of the primary importance
• Administer test items to a criterion and control group
• Select items that best distinguish between these two groups
Factor Analisys
• Create test items to measure one or more traits – derived from theory
– Content of the item is of the primary importance
• Administer test items to appropriate sample derived from population of interest – large (depending on technique and base no of items)
– possibly representative
• Employ factor analysis (family of correlational techniques used to determine underlying structure of data)
Mixed approach
• Employing mixed strategy
For example
• Defining all aspects of the construct cretin test items
• Employ expert’s as judges to assess each potential item or use experts to create item pool
• Parallel – Employ factor analysis
– Administer test items to a criterion and control group
Adaptation
• Producing (adjusting existing instruments) instruments that measure target constructs adequately in target cultures
• Using set of procedures and techniques to create equivalent tool
• Purpose / application - central issue in adaptations.
Main Applications of Translations/Adaptations
• Comparative Studies (diagnosis & research)
Focus: Comparison of construct or mean scores across cultures
Strategy: Maximizing comparability
• Studies in target culture (diagnosis & research)
Focus: validity in new context
Strategy: Maximizing local suitability
Considerations
cultural equivalence
• Psychological theory / dimensions
• Psychological concepts / terms
• Behavioral indicators
• procedures
Considerations
test equivalence
• Face equivalence (superficial)
• Psychometrical – statistics, validity and reliability
• Psychological - functional
• Translation
• Construction
Adoption / translation
• Not only language!
• Literal/close translation: What is the name of the queen of the England?” – Problem: Item more difficult for American children than
for English children
• Adaptation: “What is the name of the president of the USA?” – Problem: Queen and president are not equally known in
their respective countries
equivalence
• Words – linguistic
• Meanings - psychological
Linguistic Equivalence
• (Broader than similarity of words) • Linguistic equivalence refers to similarity
of linguistic features of a text. • Examples of relevant linguistic features
are: – Lexical similarity – Grammatical accuracy – In general: emphasis on formal-textual
characteristics (cf. automatic translations)
Psychological Equivalence
• Psychological equivalence refers to similarity of (psychological) meaning and scores
• Similarity in a broad sense: – Textual, e.g.,
• Connotation of words, implied context of text
• Comprehensibility
–Metrical: • Score comparability
Relationship between Two Perspectives
Three possible relations between linguistic and psychological features, depending on the overlap:
Poorly translatable
b. partial
Essentially
non-translatable
c. none
Translatable
a. complete
psych. linguistic
Cultural adaptation Options / strategies
Adoption / transcription (Close “literal” translation) – Advantage: maintains metric equivalence – Disadvantage: adequacy (too) readily assumed, should be
demonstrated
Adaptation • translation • travesty • paraphrase
– Advantage: more flexible, more tailored to the context – Disadvantage: fewer statistical techniques available to compare
scores across cultures
Assembly (re-assembly) (composing a new instrument) – Advantage: very flexible – Disadvantage: almost no comparability maintained
Adoption / transcription
Literal translation of all items
• Focus: extreme translation fidelity
• Assumption: universality of constructs and behaviors
• Pros: – metric equivalence
• possibility of straightforward comparisons
• Cons: – language and psychometric problems
Adaptation: translation
Faithful translation of original pool of items with possible changes
• Focus: translation fidelity • Assumption: universality of constructs and
behaviors, but not language • Pros:
– better psychometric properties – better construct and ecological validity
• Cons: – Fewer comparison options – Still some language and psychometric problems
Adaptation: travesty
Free translation of original pool of items – keeping meaning and changing language adjusting to language and psychological needs
• Focus: psychological meaning • Assumption: universality of constructs but not
language and possible cultural differences in behaviors • Pros:
– better cultural adjustment – less metric equivalence but still pretty good – better psychometric properties
• Cons: – Few comparison options – Major differences between versions of the tests
Adaptation: paraphrase
• Creating new tool using original items as inspiration rather than base
• Focus: psychological meaning • Assumption: universality of constructs but not
behaviors and language • Pros:
– good cultural adjustment – good psychometric properties – cultural equivalence
• Cons: – No metric equivalence – Major differences between versions of the tests
Assembly (re-assembly)
Composing new instrument using original theoretical model and development strategy
• Focus: adaptation of tool and theory • Assumption: no cultural universality of behaviors
and language and possible differences in constructs
• Pros: – Best cultural adjustment
• Cons: – No metric equivalence – Two different tools
Item Analysis
Purpose of Item Analysis
• Evaluates the quality of each item
• Rationale: the quality of items determines the quality of test (i.e., reliability & validity)
• May suggest ways of improving the measurement of a test
• Can help with understanding why certain tests predict some criteria but not others
Item Analysis • When analyzing the test items, we have several questions
about the performance of each item. Some of these questions include:
Are the items congruent with the test objectives?
Are the items valid? Do they measure what they're supposed to measure?
Are the items reliable? Do they measure consistently?
How long does it take an examinee to complete each item?
What items are most difficult to answer correctly?
What items are easy?
Are there any poor performing items that need to be discarded?
Types of Item Analyses for CTT
Three major types: 1. Assess quality of the distractors
2. Assess difficulty of the items
3. Assess how well an item differentiates between high and low performers
1) Question…
A. Multiple-Choice
B. Multiple-Choice
C. Multiple-Choice
D. Multiple-Choice
DISTRACTOR ANALYSIS
DISTRACTORS
Correct answer
Distractor Analysis
First question of item analysis: How many people choose each response?
If there is only one best response, then all other response options are distractors.
Example (N = 35):
Which method has the best internal consistency? # a) projective test 1
b) peer ratings 1
c) forced choice 21
d) differences n.s. 12
A perfect test item would have 2 characteristics: 1. Everyone who knows the item gets it right
2. People who do not know the item will have responses equally distributed across the wrong answers.
• It is not desirable to have one of the distracters chosen more often than the correct answer.
• This result indicates a potential problem with the question. This distractor may be too similar to the correct answer and/or there may be something in either the stem or the alternatives that is misleading.
Distractor Analysis
Distractor Analysis (cont’d)
Calculate the # of people expected to choose each of the
distractors. If random same expected number for each wrong
response (Figure 10-1).
N answering incorrectly 14
Number of distractors 3
# of Persons Exp. To Choose Distractor
= = 4.7
Distractor Analysis (cont’d)
When the number of persons choosing a distractor significantly
exceeds the number expected, there are 2 possibilities:
1. It is possible that the choice reflects partial knowledge
2. The item is a poorly worded trick question
• unpopular distractors may lower item and test difficulty because
it is easily eliminated
• extremely popular is likely to lower the reliability and validity of
the test
Item Difficulty
Percentage of test takers who respond correctly
What if p = .00
What if p = 1.00?
Item Difficulty
An item with a p value of .0 or 1.0 does not
contribute to measuring individual differences and thus is certain to be useless
When comparing 2 test scores, we are interested in who had the higher score or the differences in scores
p value of .5 have most variation so seek items in this range and remove those with extreme values
can also be examined to determine proportion answering in a particular way for items that don’t have a “correct” answer
Item Difficulty (cont.)
What is the best p-value?
– most optimal p-value = .50
– maximum discrimination between good and poor performers
Should we only choose items of .50?
When shouldn’t we?
Item Difficulty (cont.)
Should we only choose items of .50? Not necessarily ... • When wanting to screen the very top group of applicants
(i.e., admission to university or medical school). Cutoffs may be much higher • Other institutions want a minimum level (i.e., minimum
reading level) Cutoffs may be much lower
Item Difficulty (cont’d)
General Rules of Item Difficulty…
p low (< .20) difficult test item
p moderate (.20 - .80) moderately diff.
p high (> .80) easy item