Psychological testingicelab.psych.uw.edu.pl/wp-content/uploads/2016/01/...Empirical Keying •Create...

transcript

Psychological testing

Lecture 12

Mikołaj

Winiewski, PhD

Test Construction Strategies

• Content validation

• Empirical Criterion

• Factor Analysis

• Mixed approach (all of the above)

Content Validation

• Defining all aspects of the construct and create test items – Derived from theory or based on purpose of the test

– Content of the item is of the primary importance

• Consulting experts about the constructs – using qualitative methods

• Employ expert’s as judges to assess each potential item – using quantitative measures

• Perform psychometric analyses of items

Empirical Keying

• Create test items to measure one or more traits

– Derived from theory or based on purpose of the test

– Content of the item is not of the primary importance

• Administer test items to a criterion and control group

• Select items that best distinguish between these two groups

Factor Analisys

• Create test items to measure one or more traits – derived from theory

– Content of the item is of the primary importance

• Administer test items to appropriate sample derived from population of interest – large (depending on technique and base no of items)

– possibly representative

• Employ factor analysis (family of correlational techniques used to determine underlying structure of data)

Mixed approach

• Employing mixed strategy

For example

• Defining all aspects of the construct cretin test items

• Employ expert’s as judges to assess each potential item or use experts to create item pool

• Parallel – Employ factor analysis

– Administer test items to a criterion and control group

Adaptation

• Producing (adjusting existing instruments) instruments that measure target constructs adequately in target cultures

• Using set of procedures and techniques to create equivalent tool

• Purpose / application - central issue in adaptations.

Main Applications of Translations/Adaptations

• Comparative Studies (diagnosis & research)

Focus: Comparison of construct or mean scores across cultures

Strategy: Maximizing comparability

• Studies in target culture (diagnosis & research)

Focus: validity in new context

Strategy: Maximizing local suitability

Considerations

cultural equivalence

• Psychological theory / dimensions

• Psychological concepts / terms

• Behavioral indicators

• procedures

Considerations

test equivalence

• Face equivalence (superficial)

• Psychometrical – statistics, validity and reliability

• Psychological - functional

• Translation

• Construction

Adoption / translation

• Not only language!

• Literal/close translation: What is the name of the queen of the England?” – Problem: Item more difficult for American children than

for English children

• Adaptation: “What is the name of the president of the USA?” – Problem: Queen and president are not equally known in

their respective countries

equivalence

• Words – linguistic

• Meanings - psychological

Linguistic Equivalence

• (Broader than similarity of words) • Linguistic equivalence refers to similarity

of linguistic features of a text. • Examples of relevant linguistic features

are: – Lexical similarity – Grammatical accuracy – In general: emphasis on formal-textual

characteristics (cf. automatic translations)

Psychological Equivalence

• Psychological equivalence refers to similarity of (psychological) meaning and scores

• Similarity in a broad sense: – Textual, e.g.,

• Connotation of words, implied context of text

• Comprehensibility

–Metrical: • Score comparability

Relationship between Two Perspectives

Three possible relations between linguistic and psychological features, depending on the overlap:

Poorly translatable

b. partial

Essentially

non-translatable

c. none

Translatable

a. complete

psych. linguistic

Cultural adaptation Options / strategies

Adoption / transcription (Close “literal” translation) – Advantage: maintains metric equivalence – Disadvantage: adequacy (too) readily assumed, should be

demonstrated

Adaptation • translation • travesty • paraphrase

– Advantage: more flexible, more tailored to the context – Disadvantage: fewer statistical techniques available to compare

scores across cultures

Assembly (re-assembly) (composing a new instrument) – Advantage: very flexible – Disadvantage: almost no comparability maintained

Adoption / transcription

Literal translation of all items

• Focus: extreme translation fidelity

• Assumption: universality of constructs and behaviors

• Pros: – metric equivalence

• possibility of straightforward comparisons

• Cons: – language and psychometric problems

Adaptation: translation

Faithful translation of original pool of items with possible changes

• Focus: translation fidelity • Assumption: universality of constructs and

behaviors, but not language • Pros:

– better psychometric properties – better construct and ecological validity

• Cons: – Fewer comparison options – Still some language and psychometric problems

Adaptation: travesty

Free translation of original pool of items – keeping meaning and changing language adjusting to language and psychological needs

• Focus: psychological meaning • Assumption: universality of constructs but not

language and possible cultural differences in behaviors • Pros:

– better cultural adjustment – less metric equivalence but still pretty good – better psychometric properties

• Cons: – Few comparison options – Major differences between versions of the tests

Adaptation: paraphrase

• Creating new tool using original items as inspiration rather than base

• Focus: psychological meaning • Assumption: universality of constructs but not

behaviors and language • Pros:

– good cultural adjustment – good psychometric properties – cultural equivalence

• Cons: – No metric equivalence – Major differences between versions of the tests

Assembly (re-assembly)

Composing new instrument using original theoretical model and development strategy

• Focus: adaptation of tool and theory • Assumption: no cultural universality of behaviors

and language and possible differences in constructs

• Pros: – Best cultural adjustment

• Cons: – No metric equivalence – Two different tools

Item Analysis

Purpose of Item Analysis

• Evaluates the quality of each item

• Rationale: the quality of items determines the quality of test (i.e., reliability & validity)

• May suggest ways of improving the measurement of a test

• Can help with understanding why certain tests predict some criteria but not others

Item Analysis • When analyzing the test items, we have several questions

about the performance of each item. Some of these questions include:

Are the items congruent with the test objectives?

Are the items valid? Do they measure what they're supposed to measure?

Are the items reliable? Do they measure consistently?

How long does it take an examinee to complete each item?

What items are most difficult to answer correctly?

What items are easy?

Are there any poor performing items that need to be discarded?

Types of Item Analyses for CTT

Three major types: 1. Assess quality of the distractors

2. Assess difficulty of the items

3. Assess how well an item differentiates between high and low performers

1) Question…

A. Multiple-Choice

B. Multiple-Choice

C. Multiple-Choice

D. Multiple-Choice

DISTRACTOR ANALYSIS

DISTRACTORS

Correct answer

Distractor Analysis

First question of item analysis: How many people choose each response?

If there is only one best response, then all other response options are distractors.

Example (N = 35):

Which method has the best internal consistency? # a) projective test 1

b) peer ratings 1

c) forced choice 21

d) differences n.s. 12

A perfect test item would have 2 characteristics: 1. Everyone who knows the item gets it right

2. People who do not know the item will have responses equally distributed across the wrong answers.

• It is not desirable to have one of the distracters chosen more often than the correct answer.

• This result indicates a potential problem with the question. This distractor may be too similar to the correct answer and/or there may be something in either the stem or the alternatives that is misleading.

Distractor Analysis

Distractor Analysis (cont’d)

Calculate the # of people expected to choose each of the

distractors. If random same expected number for each wrong

response (Figure 10-1).

N answering incorrectly 14

Number of distractors 3

# of Persons Exp. To Choose Distractor

= = 4.7

Distractor Analysis (cont’d)

When the number of persons choosing a distractor significantly

exceeds the number expected, there are 2 possibilities:

1. It is possible that the choice reflects partial knowledge

2. The item is a poorly worded trick question

• unpopular distractors may lower item and test difficulty because

it is easily eliminated

• extremely popular is likely to lower the reliability and validity of

the test

Item Difficulty

Percentage of test takers who respond correctly

What if p = .00

What if p = 1.00?

Item Difficulty

An item with a p value of .0 or 1.0 does not

contribute to measuring individual differences and thus is certain to be useless

When comparing 2 test scores, we are interested in who had the higher score or the differences in scores

p value of .5 have most variation so seek items in this range and remove those with extreme values

can also be examined to determine proportion answering in a particular way for items that don’t have a “correct” answer

Item Difficulty (cont.)

What is the best p-value?

– most optimal p-value = .50

– maximum discrimination between good and poor performers

Should we only choose items of .50?

When shouldn’t we?

Item Difficulty (cont.)

Should we only choose items of .50? Not necessarily ... • When wanting to screen the very top group of applicants

(i.e., admission to university or medical school). Cutoffs may be much higher • Other institutions want a minimum level (i.e., minimum

reading level) Cutoffs may be much lower

Item Difficulty (cont’d)

General Rules of Item Difficulty…

p low (< .20) difficult test item

p moderate (.20 - .80) moderately diff.

p high (> .80) easy item

Psychological testingicelab.psych.uw.edu.pl/wp-content/uploads/2016/01/...Empirical Keying •Create...

Documents