Variations in Mean Response Times for Questions on the · Variations in Mean Response Times for...

Variations in Mean Response Times for Questions on the

Computer-Adaptive GRE@ General Test: Implications for Fair Assessment

Brent Bridgeman Frederick Cline

GRE No. 96-20P

June 2000

This report presents the findings of a research project funded by and carried

out under the auspices of the Graduate Record Examinations Board

Educational Testing Service, Princeton, NJ 08541

Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate

Record Examinations Board Reports do not necessarily represent official Graduate Record Exarninations Board position or policy.

********************

The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,

services, and employment policies are guided by that principle.

EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.

Copyright 0 2000 by Educational Testing Service. All rights reserved.

Abstract

In a computer adaptive test (CAT), different examinees receive different sets of questions.

Questions at the same overall difficulty level and meeting roughly the same content specifications could

vary substantially in the amount of time needed to answer them. If the CAT is speeded (that is, if

substantial numbers of students either do not finish or must guess randomly at the end to finish),

individuals who happened to get a disproportionate number of questions that took a long time to answer

could be disadvantaged. The purpose of this study was to determine whether--in computer-adaptive

testing situations--the administration of a disproportionate number of questions with long, expected

response times unfairly disadvantages examinees. Data from 5,957 examinees who took the computer-

delivered Graduate Record Examination (GRE@) quantitative measure and 14,745 examinees who took

the computer-delivered GRE analytical measure were used to investigate variation in response time in

light of other factors, such as mean test score and the position of the question on the test. For both

measures, substantial variation in response times was found, even for items with the same specifications

and same difficulty level. But despite these differences, there was no indication that the scores of

students who were administered items with long expected response times were disadvantaged.

Key Words:

test fairness

validity

speededness

computer-adaptive tests

Contents

Introduction .................................................................................................................................................. 1

Study 1 .......................................................................................................................................................... 2

Method ............................................................................................................................................. 2

Results ............................................................................................................................................. 5

Study 3 I .......................................................................................................................................................... 9

Method ............................................................................................................................................. 9

Results ........................................................................................................................................... 10

Conclusion .................................................................................................................................................. 15

References .................................................................................................................................................. 17

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

List of Tables

Number of GRE Quantitative Items in Response-Time Categories by Item Type . . . . . . . . . . . . . . . . . . . . . 18

Number of Easy GRE Quantitative Items in Response-Time Categories by Item Type.. . . . . . . . . . . . 19

Number of Medium-Difficulty GRE Quantitative Items in Response-Time Categories by Item

Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .................... 20

Number of Difficult GRE Quantitative Items in Response-Time Categories by Item Type . . . . . . 2 1

Comparison of Two Category-10 Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Comparison of Two Category-2 Items ,...,............................................,.........,............................ 23

Test-Developer Ratings of Response Times for Seven Category-10 Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Means and Standard Deviations for Examinees Taking Item Sets in Group A, Position 5-9

and Position 25-29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Response Times for Group A and Group E Sets in Position 5-9 and of Comparable

Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............ 26

Table 10. Number of Logical Reasoning Items in Response-Time Categories by Position and

Level of Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

List of Figures

Figure 1. Scatterplot of item position by mean response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Figure 2. Scatter-plot of item difficulty by mean response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Introduction

Tests used for admissions to programs of professional or graduate education, such as the

Graduate Record Examination (GRE@) General Test and Subject Tests, are generally designed to be

“power” tests rather than “speed” tests. That is, they are intended to be tests of ability rather than how

quickly a student can answer. According to the GRE Technical Manual (Briel, O’Neill, & Scheuneman,

1993), the purpose of the GRE General Test is to assess “reasoning skills considered fundamental in

graduate study: verbal reasoning, quantitative reasoning, and analytical reasoning” (p. 7), and the “GRE

General and Subject Tests are not intended to be speeded” (p. 32).

Nevertheless, the tests have strict time limits. Our preliminary analyses showed than 20% of

examinees fail to finish the quantitative section of the General Test, and over 35% fail to finish the

analytical section. Even examinees who answer every question may still be negatively impacted by the

time limit; they may engage in rapid-guessing behavior near the end of the test just so they can answer

every question. Such rapid-guessing behavior has been observed on the Test of English as a Foreign

Language (Bejar, 1985; Yamamoto, 1995) as well as on a computer-administered, but nonadaptive,

version of the GRE quantitative test (Schnipke & Scrams, 1997).

Although the addition of a speed component to a power test can be problematic for any type of

test, additional unique considerations arise with a computer-adaptive test (CAT). The unidimensional

models used with CATS implicitly assume that only knowledge of the correct answer, and not speed, is

needed for a high score (Hambleton & Swaminathan, 1985). Although this is also true of paper-based

power tests, the additional complication with a CAT is that each examinee is getting a different set of

questions. For example, on the paper-based version of the GRE, post-administration equating procedures

could provide a correction if questions on one form could be answered more quickly than questions on a

different form. Although the item response theory (IRT) model used to score the computer-adaptive

version of the General Test takes difficulty differences into consideration, it does not adjust for possible

differences in the speed demands of different questions or sets of questions.

Theoretically, GRE test questions at the same overall difficulty level or meeting roughly the

same content specifications could vary substantially in the amount of time needed to answer them.

Preliminary results from the experimental, computer-adaptive Graduate Management Admissions Test

(GMAT@) suggest that some quantitative question types take twice as long, on average, to answer as

other quantitative question types (Bridgeman, Anderson, & Wightman, 1997). In this case, average

1

response times across question types varied from a low of 65 seconds for geometry data-sufficiency

questions to a high of 135 seconds for algebra problem-solving questions, with standard deviations

within question types ranging from 35 to 89 seconds.

Variations in mean solution time across question types provide useful information about the

cognitive processing demands of different question types, but by themselves they do not raise fairness

concerns, as long as the test specifications standardize the number of questions of each type that an

individual will receive. On the other hand, substantial variation within question type could signify an

equity problem. Because the verbal, quantitative, and analytical sections of the GRE General Test are

speeded --that is, substantial numbers of students either do not finish or must guess randomly at the end

to finish--individuals who happen to get a disproportionate number of questions that take a long time to

answer could be disadvantaged. Although the GMAT analysis suggests that there might indeed be

reasons for concern, that study did not explore the data in enough depth to pinpoint the nature and extent

of this potential problem. In particular, the analysis did not control for question difficulty within question

type, so in this case, it is possible that within-question-type variability could be largely explained by

difficulty differences.

In the current study, we investigated mean response times on GRE General Test questions while

controlling for question difficulty. Two separate analyses are reported here. In Study 1, we examine CAT

items from the GRE quantitative measure. Because 24 out of the 28 questions in the GRE quantitative

test are discrete questions, rather than being part of question sets, analyses for this measure were less

complex and provided a useful starting point. In Study 2, we look at CAT items from the GRE analytical

measure, which makes extensive use of question sets (that is, a series of questions based on a single

problem presentation).

Study 1

Method

The data for the GRE quantitative questions came from a 1997 item pool in which 5,957

examinees took CAT versions of the General Test. This particular item pool was administered after a

procedure called proportional adjustment for incomplete tests, which imposes a penalty for leaving

questions unanswered, was instituted. Therefore, candidates were motivated to complete the test.

Examinees had 45 minutes to try to answer 28 questions, or about 96 seconds per question. We extracted

mean response times for each item in each position in the test in which it was administered. For example,

2

question 12 might be administered as the third item in one person’s test and as the 24th item in another

person’s test, so we computed separate means for each position.’

If time ran out for an examinee before the last question attempted was answered, we excluded

that examinee’s time from the computation of the mean time for that question. We also excluded all items

that were part of sets (each examinee who finished the test responded to two sets [two question per set]

and 24 discrete questions). The item pool consisted of 252 discrete questions. In addition to mean times

for all examinees who were administered an item, we computed mean times for those who chose the

correct answer to the question and for those who chose an incorrect answer. We also computed mean

GRE quantitative scores for examinees who got the question right and for those who got the question

wrong.

For each item, the database contained the three IRT parameters: discrimination (a), difficulty (b),

and pseudo-guessing (c). The b parameter expresses difficulty as a standard score with a mean of 0 and

standard deviation of 1. We classified items into five difficulty categories as follows:

1. very easy (b < -1.5)

2. easy (-1.5 c b < -5)

3. medium (-.5 < b c .5)

4. difficult (.5 < b c 1.5)

5. very difficult (b > 1.5)

Items had been classified by test developers into 16 content categories, which the automated,

item-selection algorithm uses to make certain that CATS for all individuals are comparable on these

dimensions. The first category level classifies items as either “problem solving” (PS) questions or

“quantitative comparison” (QC) questions. The PS items are standard multiple-choice questions, each (

which offers five answer choices. The QC items, which offer four answer choices, ask the examinee to

determine which of two quantities is larger, whether they are equal, or whether there is not enough

information to determine the answer. (Each examinee’s test contained 10 PS questions and 14 QC

questions, plus the four items from two sets that were excluded from the current analyses.)

’ Interpreting means for skewed distributions can be problematic, and response time distributions tend to be positively skewed. However, the skewness for these items was generally not too severe because examinees knew that spending too much time on any question would impede their ability to finish the test. Thus, means were adequate for our purposes, and we have also supplied medians for certain analyses.

3

All questions are also categorized as either “pure” or “real.” Pure questions deal only with

numbers and symbols, while real questions refer to a name or object from the real world and are

frequently word problems. The test specifications indicate that each examinee’s test should contain 18

pure items and six real ones.

All items are further coded into four subject-matter categories: arithmetic (eight or nine items per

examinee), algebra (six or seven items), geometry (five or six items), and data interpretation (two to five

items). We assigned numerical codes to these 16 categories as follows:

1. QC, pure, arithmetic

2. QC, pure, algebra

3. QC, pure, geometry

4. QC, pure, data interpretation

5. QC, real, arithmetic

6. QC, real, algebra

7. QC, real, geometry

8. QC real, data interpretation

9. PS, pure, arithmetic

10. PS, pure, algebra

11. PS, pure, geometry

12. PS, pure, data interpretation

13. PS, real, arithmetic

14. PS, real, algebra

15. PS, real, geometry

16. PS, real, data interpretation.

Beyond these 16 categories, items are further classified into one of 79 categories that provide a

more detailed description of the item content--such as “negative exponents,” “linear inequality,” and

“ratio and proportion.” These categories are not used to select items for inclusion on individual CATS;

also, there are more of these item-content categories than there are items on any one exarninee’s test.

However, a few of these categories are used to provide the upper limit of the number of questions of a

given content type that should be included on each test; for example, no individual’s test should contain

more than two “ratio and proportion” questions, though it need not contain any items in this category at

all.

Results

Position effects. We first looked at the relationship between position in the test and mean

latencies. If each of the 252 items appeared in each of the 28 possible positions, there would be 7,056

means representing all of the possible item-by-position possibilities. In fact, however, the operation of

the item selection algorithm is such that a given item actually appears in only a limited number of the 28

positions. In this data set, 1,760 means represented all of the data points. For this and other analyses

described here, we required that a mean be based on the performance of at least 20 students; this brought

the total number of means to 1,13 1.

With this constraint, the correlation between mean time to complete the item and item position

was not significant (r = -.04). This lack of relationship between position and completion time can be seen

in Figure 1. Similar nonsignificant differences were found for analyses run separately for the PS and QC

item types. Furthermore, mean time to correct solution was almost the same for the early PS items

(positions 1-15) as it was for PS items in the last two positions (101 seconds vs. 99 seconds, with

standard deviations of 74 and 52, respectively). These results do not imply that individual examinees are

necessarily all working at a uniform rate, but only that examinees who are responding rapidly at the end

are to some extent balanced by examinees who are responding more slowly at the end. Indeed, Scrams

and Schnipke (1999) suggested that about 23% of examinees speeded up as they proceeded through a

linear version of the GRE quantitative test, and about 22% slowed down, with the remainder keeping a

relatively even pace throughout.

DiJficuZty. As suggested in Figure 2, a correlation of .44 (or .56 including a b2 term to reflect the

curvilinear increase) was observed between mean time and difficulty (b). Despite this generally positive

relationship between time to answer and difficulty, four of the easiest items (b < -1.5) took over 100

seconds, on average, to answer, and four of the hardest items were answered in less than 80 seconds.

Item type. Table 1 shows the relationship of item type to mean response time for 244 test items.

(Because eight of the 252 questions in the item pool were seen by fewer than 20 examinees, they were

excluded from the analyses). Item types 1-8 (QC items), which are designed to be answered quickly,

generally did appear to take less time than the problem solving items. However, there was still substantial

variability within the QC and PS categories, with 11 ostensibly quick QC items requiring an average of

more than 100 seconds and 12 PS items taking less than a minute.

Even within a specific question type there was substantial variation. For example, three items

classified as category 2 (QC, pure, algebra) averaged response times of over 100 seconds, while another

three items in the same category averaged less than 40 seconds. Variation was especially wide for

questions in categories 9 (PS, pure, arithmetic), 10 (PS, pure, algebra), and 11 (PS, pure, geometry). Of

the 26 category 9 items, three required an average of over two minutes for examinees to answer, while

another five took less than one minute. Similarly, examinees answered two category 10 questions in an

average of less than a minute, while they needed an average of more than three minutes for another

question in this category. And average response times for items in category 11 also ranged from under

one minute to over three minutes.

The time differences within question types that display in Table 1 might be explainable by

considering differences in item difficulty. But Tables 2, 3, and 4--which show the relationship between

item type and mean response time for easy (-1.5 < b c -0.5), medium-difficulty (-0.5 c b < S), and

difficult (0.5 < b < 1.5) GRE quantitative items, respectively--indicate that even within a relatively

narrow difficulty band, mean times for individual question types still varied over a broad range.

However, the trend for more difficult items generally taking longer also was apparent. Only one

out of the 59 easy items required more than two minutes, on average, to answer, but 14 out of the 55

difficult items required more than two minutes. Within question category 10 (PS, pure, algebra), four out

of six difficult questions averaged more than two minutes, while all of the easy questions in this category

averaged less than 100 seconds each to be solved.

Nevertheless, it is the within-category/within-difficulty-level variation that is most disturbing

from a fairness perspective. For example, if the CAT item-selection algorithm called for a difficult,

category 10 question, one examinee might get a question that, on average, students answer in less than

100 seconds, while another student could get a question that, on average, required more than 180 seconds

to answer. If, and only if, ample time were allowed to finish the test, would such time differences not be

of concern.

Table 5 presents a comparison of the characteristics of one such pair of items. The items that

display in Table 5 are from the same category, they are of nearly identical difficulty, and both were

administered to relatively large samples of examinees. However, the items differ markedly in their mean

solution times. Table 6 shows the same phenomenon for a different pair of items.

Questions that are of equal difficulty can have very different solution times, because the number

of steps needed to solve a problem is not necessarily closely linked to difficulty. For example, Item C on

Table 6 is a linear inequality that requires some processing time just to understand what is being asked.

Item D, on the other hand, is a negative exponents question, the difficulty of which apparently stems

from some examinees not knowing how to solve this type of problem; examinees who understand

negative exponents can solve it quite quickly.

The potential fairness implications of these results can be clearly seen by imagining the

difference in the testing experience of two hypothetical examinees based on a lucky or unlucky break in

item selection. Suppose the two examinees, call them Mary and Jim, took identical tests, except that the

questions administered to each in positions 15 and 18 were different. If Mary got the B and C pair while

Jim got the A and D pair (and both students got the correct answer in the average amount of time), Jim

would have almost three more minutes to complete the test than Mary.

A possible solution to this problem would be to include a category for solution time in the item-

selection algorithm. This would ensure that no individual would get a disproportionate number of

questions that require long or short response-times. This, of course, leads to the question of the

availability of solution times. For items that are pretested and calibrated during a CAT administration,

solution times are available. However, a significant proportion of GRE test items that are used to create

new item pools are calibrated in paper-and-pencil administrations, and so no solution times are available.

For these items, expert ratings of estimated solution times could be obtained.

c

To evaluate the likely success of this procedure, we asked three people with considerable

experience in developing items for the GRE quantitative measure to rank order a set of seven items from

shortest time to correct solution to longest time to correct solution. The seven items were all of the same

category (category 10: PS, PURE, Algebra) and difficulty level (medium: -0.5 c b < OS), and all had

been answered correctly by at least 450 examinees. Mean and median times, along with the rater

rankings, are shown in Table 7.

As Table 7 shows, median times are about 10 seconds shorter than mean times because of the

previously noted positive skew of the time distributions, but both mean and median times tell the same

story as to which items take the longest to solve. The rankings by the test development experts were

reasonably close to the rank order of actual solution times. The item with the shortest actual solution time

was ranked in the top three shortest time categories by all three test developers, and the item with the

longest actual solution time was rated as the longest item by one rater and as the third longest item by the

7

other two raters. However, there were also some misclassifications. For example, all three raters placed

the actual fourth-place question in sixth place. Two out of the three raters thought that this item (which

actually took 90 seconds, on average, to answer) would take longer to answer than the item that averaged

123 seconds. Thus, although useful and certainly better than no solution time estimates at all, ratings by

test developers would not substitute for actual solution times.

Although examinees who happened to get an item that took much longer than average to answer

would seem to be at a disadvantage, we saw no evidence of this here in terms of total GRE quantitative

scores. Looking back at Tables 5 and 6, note that in both comparisons, mean quantitative scores were

slightly higher for the students who answered the longer item in each pair--exactly the opposite of the

lower scores that would be expected if this were to hurt their chances to fully consider later items.

We correlated the mean time needed to respond to each item with the mean score of the

examinees who took that item separately for the two main item types, QC and PS.. Other things being

equal, if having to take an item with a longer response time lowered scores, this correlation might be

expected to be negative. However, both correlations were positive (.37 for the 134 QC items and 55 for

the 110 PS items). And of course, other things are not equal in a CAT. More difficult items, which tend

to take longer, are administered disproportionately to higher ability students; hence the positive

correlation. We attempted to compensate for this with a regression approach, predicting the mean score

of the examinees taking the item from item difficulty (IRT parameter b), and then determining whether

adding mean time made any incremental contribution to the prediction. Difficulty was indeed

substantially correlated with mean score (82 for QC and .88 for PS), but mean time did not make a

significant incremental contribution to the prediction of mean scores (multiple R increased by less than

.Ol).

Gender diflerences. Tables 5 and 6 illustrate the potential for fairness problems from an

individual perspective. Because the item selection algorithm would not systematically assign time-

consuming questions to a particular gender group, there is less concern that these time differences would

have an impact on fairness from a group perspective. Nevertheless, we attempted to determine whether

particular items could be answered more quickly by one gender group than the other.

Over the 252 items in the pool, we correlated the mean time to a correct answer for men with the

mean time to a correct answer for women. The correlation was .92, suggesting that in general, the items

that were most time consuming for men were also the items that were most time consuming for women.

For the 160 iterns that were answered correctly by at least 100 men and 100 women, we computed the

8

differences in mean times for the two gender groups. Only 8 items demonstrated differences of more than

20 seconds; men were faster for seven of these eight items. This result is not surprising, given the higher

mean scores of men in this sample (for men, mean = 596 [SD = 1291; for women, mean = 524 [SD =

1231).

A closer look at the one item that women answered more quickly than men illustrates the

difficulty in interpreting time differences on a multiple-choice examination. The item that women, on

average, answered more quickly was a very difficult (b = 1.52), category 3 (QC, pure, geometry)

problem. The mean time to a correct solution was 174 seconds for men (SD = 94; median = 148) and 141

seconds for women (SD = 92; median = 112). But only 110 of the 592 women (19%) who were

administered this item answered it correctly, which is below the random guess rate of 25% for an item

with four answer choices. (For men the comparable figures were 105 out of 332, or 32%.) The mean

GRE quantitative score of men who answered this item correctly was 706, while the mean score of men

who answered it incorrectly was 606; for women, the comparable scores were 595 and 551, respectively.

Thus, the women who got this item right were at about the same overall level as men who got it wrong.

One possible interpretation is that women were faster, on average, in this case because they gave up

sooner and made a random guess. Relatively short times to a correct solution on a multiple-choice test

may reflect either a high degree of mastery--or nonmastery with lucky guessing.

Study 2

Method

Data for the GRE analytical questions--the subject of Study 2--were obtained from a CAT item

pool that was administered in 1998. For the analytical section, examinees had 60 minutes to answer 35

questions. The analytical section has questions of two types. One type, logical reasoning (LR), consists

of discrete questions which test “the ability to understand, analyze, and evaluate arguments . . .

[including] recognizing the point of an argument, recognizing assumptions on which an argument is

based, drawing conclusions and formulating hypotheses, identifying methods of argument, evaluating

arguments and counterarguments, and evaluating evidence” (Briel, O’Neill, & Scheuneman, 1993, p. 10).

“Each item or group of logical reasoning items is based on a short argument or on a simple graph or

table, generally an excerpt from the kind of material graduate students are likely to encounter in their

academic and personal reading”(pp. 10-l 1).

The second major category of items in the GRE analytical measure, analytical reasoning (AR),

consists of item sets which test:

. . . the ability to understand a given structure of arbitrary relationships among fictitious persons, places, things or events, and to deduce information from the relationships given. Each analytical reasoning group consists of (1) a set of approximately three to seven related statements or conditions (and sometimes other explanatory material) describing the structure of relationships, and (2) three or more items that test understanding of that structure and its implications by requiring examinees to infer what is necessarily true or to determine what is or is not possible. (Briel, O’Neill, & Scheuneman, 1993, p. 10)

A given examinee is administered nine discrete LR items, plus 26 AR items arranged in six sets

(four four-item sets and two five-item sets). Each AR grouping consists of a problem stimulus with four

to eight associated questions, but any individual examinee would see only four or five of these questions,

and each would also likely see a different combination of questions. For example, one examinee might

see only questions 1, 3, 57, and 8 from a given group, while another examinee would see only questions

2, 3,4,6, and 7. Two examinees may see the same items based on the same stimulus, but they could

receive them in a different order. In order to simplify the analyses, we decided to study in detail only the

10 groups that were administered to the largest number of examinees. Five of these groups were

associated with four-item AR sets, and the remaining five were associated with five-item AR sets.

Results

Position efsects. We first evaluated the two, five-item, AR sets that were administered to each

examinee. Typically, an examinee would receive one five-item AR set in position 5-9 and the other five-

item set in position 25-29. Consider one problem statement (call it Stimulus A) with eight attached

questions. Stimulus A might be used in position 5-9 for some exarninees and position 25-29 for others. In

either position, a given examinee would be administered only 5 of the 8 possible questions, so

theoretically, 56 possible combinations of items could be generated from this one problem statement.

However, in practice, a relatively small number of combinations accounted for all of the patterns

actually administered. In position 5-9, 1,900 examinees received questions from Stimulus A. Four

combinations of items (call them sets Al-A4) accounted for all but 365 of these examinees. In position

25-29, only 832 examinees were administered questions from Stimulus A, but three of the same sets (Al,

A2, and A4) accounted for all but 99 examinees. Set A3 was administered to 113 examinees in position

5-9 but to only 4 examinees in position 25-29.

10

Table 8 shows the mean test scores and solution times for sets Al-A4 in both positions. We

included GRE quantitative score in the table because it is correlated with GRE analytical score (Y = .68),

but would itself be uninfluenced by performance on any of these sets. In all four sets, the average amount

of time spent on the set was much shorter in the 25-29 position than in the 5-9 position. Consider set A4,

which was seen by relatively large samples of examinees in both positions. The samples of students who

were administered this set in the 5-9 and 25-29 positions were very comparable in terms of their mean

GRE analytical and quantitative scores, yet, the sample of examinees who took this set in the latter

position spent six minutes less to answer it than the sample who took it in the earlier position.

This result could reflect a learning effect that would allow later sets to be answered more

quickly, or it could reflect hurrying because time was running short near the end of the test. A learning

effect would suggest that the number of items answered correctly in the set should be higher for those

who were administered the set in the latter position, while a hurrying effect would suggest that fewer

items should be answered correctly. As indicated in the last column of Table 8, for A4--and indeed for

every set--the number correct was lower when the set was administered in the later position, suggesting a

substantial hurrying effect. A similar pattern was noted with all of the other five-item sets studied. The

lack of position effects for the quantitative items and strong position effects for the AR items could result

from one or more differences between the two tests. Because AR is more speeded, greater position

effects would be expected. In addition, the task requirements of AR items may produce greater time

flexibility. AR sets require checking proposed solutions against a complex set of task requirements. If

time were running short, some of these checks could be skipped. However, in a quantitative problem, it

would be much more difficult to omit steps and still hope to get a correct answer; hence solution time

would be relatively constant whether hurried or not.

Time differences for paired sets. We paired five-item AR sets that were administered in the same

position (5-9), were of comparable difficulty (similar b values), and were administered to examinee

groups of comparable ability (as measured by GRE quantitative score). We then compared the mean time

examinees took to complete each pair of sets. Three sets from Group A were paired with three sets from

Group E. Statistics for the three A-E pairs are presented in Table 9. The table shows the IRT b parameter

for each question in each set listed in order from lowest to highest. This is not necessarily the order in

which the items were administered; there are many different administration orders possible for each set.

For each pair, response times were shortest for the set from Group E.

11

The most closely matched pair was A4-E7, with mean b values within .05 points and mean test

scores within 5 points (on the 200-800 GRE scale). Yet, on average, it took 78 seconds longer for

examinees to respond to set A4 than to set E7. An extra 78 seconds to spend on the rest of the test could

provide a significant advantage on an examination as speeded as the GRE analytical test. However, such

an advantage w.as not evident in the current data, as mean analytical scores were not systematically

higher for examinees who took items from Group E. Nevertheless, individuals who were unlucky enough

to have to take several long sets could still be disadvantaged relative to individuals who had several short

sets. (This issue is addressed more fully in a later section.)

In an ordinary, linear test, students of higher ability would generally be expected to be faster--

especially when speed is an seen as an inherent feature of the construct (as it is for AR items). A linear

test is then often more speeded for lower ability students--that is, they have greater difficulty finishing in

the time allowed. However, as Table 9 shows, in this testing situation higher ability students (on the

bottom of the table) take substantially longer than the lower ability students (on the top of the table),

because higher ability students are administered more difficult items. Thus, in a CAT, the usual

relationship between ability and speededness can be totally reversed so that the test is more speeded for

higher ability students.

Similar analyses of the four-item AR sets provided further evidence that some sets can be

answered much more quickly than others. The mostly closely matched pair of four-item sets, in terms of

mean test scores, was administered in position 16-19. Mean GRE quantitative scores for the 225

examinees that comprised one group of the pair were within two points of the mean for the 165

examinees who made up the other group (653 and 655, with standard deviations of 96 and 104), but the

mean response-time for one group was over two minutes longer than mean response-time of the other

group (473 seconds vs. 597 seconds, with standard deviations of 145 and 153, respectively). The mean

GRE analytical scores of these groups were very similar, 664 and 658 (SDS = 90 and 97) for the shorter

response-time and longer response-time groups, respectively.

Time differences for LR items. For the LR items, which are not administered in sets, Table 10

shows the spread of mean latencies for questions with approximately the same difficulty level and

position in the test. For example, the first line of the table is for very easy questions ( b less than -1.5)

that were administered in positions l-4. Of the seven questions that met these conditions, average

response times were 60-80 seconds for two items, 80-100 seconds for four items, and 100 to 120 seconds

for one item. The next row of the table refers to questions administered in positions 10 or 15. Most of the

12

items administered in one position were also administered in a different position, so that within a

difficulty range, the same items may appear on more than one row in the table. Some items were not

included in certain positions because they did not meet the minimum standard that means be based on at

least 20 examinees.

At each difficulty level, response times were notably shorter for positions 30 and 35, indicating

that at this point examinees were hurrying to complete the test. For each position in the test, response

times were longer for more difficult questions. For the easy and very easy questions in position l-4, only

5 out of 20 questions (25%) took over 100 seconds to complete, but for questions of medium-difficulty or

harder (b c -0.5), 27 out of 40 (68%) questions took over 100 seconds to complete. For the easy and very

easy items, the range of mean response times in a given position was fairly narrow; at each position at

both of these difficulty levels, two adjacent time categories were sufficient to describe almost all of the

items. However, there was a considerably greater range in the mean times for the more difficult items--

even within a given position in the test. For example, mean times for the 13 items with b values over 1.5

administered in positions l-4 ranged over six 20-second categories. Two questions in this difficulty level

had mean times greater than 140 seconds, and five items had mean times under 100 seconds. The same

pattern of decreasing times with later positions in the test that was noted for the less difficult items was

also found for these difficult items.

Zmpact of long tests on totaE scores. This range of mean times led again to the suspicion that

examinees who had more items with long mean times might be at a disadvantage. To test this, we first

determined the mean response time for each item. Because some items were administered more

frequently later in the test, and because items taken later are answered more quickly, the mean time for an

item was defined as the unweighted average of the times across positions in the test. Thus, mean time for

an item was independent of whether it happened to be administered primarily early or primarily late in

the test. For the first item in each AR set, an adjustment for time to read the stimulus was created by

comparing the time to answer an item when it was first in the set to the time for the same item when it

occurred later in the set. Next, we computed the expected mean time for each individual’s test by

summing the mean times for each item taken by that individual. Our hypothesis was that, after controlling

for general ability (using GRJ? quantitative score and GRE verbal score), GRE analytical scores would be

lower for examinees with the longest expected times. To test this, we ran a regression with GRE

analytical score as the dependent variable, entering GRE quantitative score and GRE verbal score at the

first step, and expected time at the second step.

13

Our hypothesis was not supported. Indeed, the opposite was true. Expected time had a significant

positive weight, and increased R-square from 52 to .66. Standardized weights were .30, .16, and 5 1 for

GRE quantitative score, GRE verbal score, and expected time, respectively. Thus, examinees who took

tests that should have taken longer got higher scores.

Next, we looked to see if the expected negative relationship might emerge within a narrow ability

range. We selected only the 862 examinees with GRE verbal plus GRE quantitative scores in the 1,030-

1,060 range. Within this group, as expected, GRE verbal score and GRE quantitative score were no

longer significant predictors of GRE analytical score, but expected time still had a substantial positive

weight, increasing R-square from .OO to .36. Similar results were found for a low-scoring group (650-700

range) and a high-scoring group (1,300-1,350 range), and for analyses run separately for expected times

on sets and on discrete items.

Apparently, these positive weights for expected time remained because of the relationship

between item difficulty and expected time (r = .52)--that is, more difficult items take longer. With a

CAT, more difficult items are administered to higher ability examinees, so higher ability examinees

generally get tests that take longer. Adjusting for GRE verbal score and GRE quantitative score, which

were correlated .71 with GRE analytical score (a substantial correlation, but still leaving half of the

variance unexplained), was not enough to overcome this relationship of time to item difficulty and of

item difficulty to test score.

In an attempt to compensate for the relationship of time to difficulty, we used the IRT b

parameter to predict mean response time separately for LR and AR items. Each item then had a predicted

time associated with it. We computed a time discrepancy score for each item as the difference between

the time predicted from the item’s difficulty and the actual mean time. The time discrepancy score for an

individual was the sum of the time discrepancy scores for all of the items taken by that individual. A high

time discrepancy score would then reflect a test that was especially long, taking difficulty into account.

For both the LR and AR items, the time discrepancy score was entered after GRE quantitative score and

GRE verbal score, with GRE analytical score as the dependent variable. The time discrepancy score had

a small positive weight, which did not contribute significantly (p > .05) to the prediction, even with the

sample of over 10,000 examinees. Thus, there was no evidence that examinees who got long tests, taking

item difficulty into account, were disadvantaged in terms of their total scores.

For the above analyses, expected times were based on the full set of 35 items, and examinees

with incomplete tests were excluded from the analysis. An additional analysis was run that was able to

14

include all examinees by defining a long test only in terms of the first nine items (4 LR items and the first

five-item AR set). Mean item times were recomputed to reflect only performance on these items in

positions l-9, and as before, time discrepancy scores were defined as the difference between the time

predicted from the item’s difficulty and the actual mean time. The time discrepancy score for an

individual was the sum of the time discrepancy scores for each of the items (l-9) taken by that individual.

The distribution of these summed time discrepancy scores for individuals indicated a difference

of about one minute between times at the 25th and 75th percentiles. This summed time discrepancy score

had a small negative weight for predicting the number of questions attempted after GRE verbal score and

GRE quantitative score were entered; though statistically significant in this very large sample of 14,745

examinees, the change in R was very small (.015 to .066, or a change in R-square from .OOO to .004).

The more important question was whether the summed time discrepancy score had a significant

negative weight for predicting the GRE analytical score. It did not. It had a positive weight; although

statistically significant, the change in R-square was infinitesimal (from .459 to .460). This analysis was

repeated separately for male and female examinees, and for African American, Asian, Hispanic/Latino,

and White subgroups. Results were remarkably consistent across these subgroups with a small positive

weight in each subgroup.

Conclusion

For both the GRE quantitative and analytical measures, some items clearly take longer to answer

than others. Because estimated solution time is not included in the item selection algorithm, this

difference in time could potentially create a fairness problem on a timed test. However, we could find no

evidence of an impact on total scores for examinees who got unusually long questions or tests.

Despite this lack of an overall effect on test scores, we cannot rule out the possibility that certain

individuals could be negatively impacted by receiving items that take an unusually long time to answer.

Even without clear evidence of an impact on test scores, it might be desirable to include some measure of

estimated solution time in the item selection algorithm so that no individual gets more than a fair share of

time-consuming items. The feasibility of such an approach has already been demonstrated (van der

Linden, Scrams, & Schnipke, 1999).

However, the current results suggest that including such estimates may be more complicated than

previously imagined, because an item, or a set of items, does not have a single response time, but rather

15

many response times, depending on the position in which it is administered in the test. This is especially

true for AR items. Suppose the time needed for a five-question AR set were estimated from a pretest that

administered the set near the end of the test, but the set was actually administered to some examinees

near the beginning of the test. The estimated time could be in error by more than six minutes.

Knowledge of position effects might lead to appropriate adjustments in estimated time depending on

position. In addition to clarifying these position effects, future research could investigate possible

relationships between individual differences in pacing styles (of the type described by Scrams and

Schnipke, 1999) and the time demands created by particular sets of items.

16

References

Bejar, I. I. (1985). Test sneededness under number-right scoring: An analvsis of the Test of English as a Foreign Language (Research Report RR-851 1). Princeton, NJ: Educational Testing Service.

Briel, J. B., O’Neill, K. A., & Scheuneman, J. D. (1993). GRE technical manual. Princeton, NJ: Educational Testing Service.

Bridgeman, B., Anderson, D., & Wightman, L. (1997). Overview of the GMAT CAT Pilot Test. Unpublished report.

Hambleton, R. K., & Swaminathan, H. (1985). Item resnonse theory Principles and annlications. Boston: Kluwer-Nijhoff.

Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34,213-232.

Scrams, D. J., & Schnipke, D. L.( 1999). Response time feedback on computer-administered tests. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal.

van der Linden, W. J., Scrams, D. J., & Schnipke, D. L. (1999). Using response-time constraints to control for differential speededness in computerized adaptive testing. Applied Psvchological Measurement, 23, 195210.

Yamamoto, K. (1995). Estimating the effects of test length and test time on parameter estimates using the HYBRID model (TOEFL Technical Report No. TR-10). Princeton, NJ: Educational Testing Service.

17

Table 1. Number of GRE Quantitative Items in Response-Time Categories by Item Type

Mean Time in 20-Second Intervals Item 0- 20- 40- 60- 80- lOO- 120- 140- 160- Type 20 40 60 80 100 120 140 160 180 >180 Total

1 QC,P,Arith. 1 7 18 5 2 QC,P,Alg. 3 11 15 3 QC,P,Geom. 3 9 8 4 QC,P,DI 3 2 5 QC,R,Arith. 2 3 6 QC,R,Alg. 1 3 7 QC,R,Geom. 5 8 QC,R,DI 3 9 PS ,P,Arith. 2 3 5 10 PS,P,Alg. 2 2 11 PS,P,Geom. 3 8 12 PS,P,DI 2 1 13 PS,R,Arith. 3 14 PS,R,Alg. 1 15 PS,R,Geom. 2 16 PS,R,DI 2

2 10 8 6

3 1 3

1

3

2 2 1 1 1

3 2 1 4

1 2

36 38 24 8 9 8 6 5

1 26 2 1 20 1 1 25

7 1 9

1 2 10 5 8

Total 1 15 54 68 52 23 8 13 5 5 244

QC = quantitative comparison; PS = problem solving; P = pure (numbers only); Note. R = real; DI = data interpretation.

18

Table 2. Number of Easy (-lS<bc-0.5) GRE Quantitative Items in Response-Time Categories

by Item Type

Item Type

Mean Time in 20-Second Intervals 0- 20- 40- 60- 80- lOO- 120- 140- 160- 20 40 60 80 100 120 140 160 180 >180 Total

1 QC,P,Arith. 2 QC,P,Alg. 3 QC,P,Geom. 4 QC,P,DI 5 QC,R,Arith. 6 QC,R,Alg. 7 QC,R,Geom. 8 QC,R,DI 9 PS ,P,Arith. 10 PS ,P,Alg. 11 PS,P,Geom. 12 PS,P,DI 13 PS,R,Arith. 14 PS,R,Alg. 15 PS,R,Geom. 16 PS,R,DI

5 5 4

3 2 1

1 1

1 2 1 2 1

2

2 1

1 1 2

1

3 3 3

1 2 1

1

1

11 6 6 1 2 3 1 0 8 5 7 2 3 1 0 3

Total 0 9 21 11 15 2 1 59

Item type: QC = quantitative comparison; PS = problem solving; P = pure; R = real; Note. DI = data interpretation.

19

Table 3. Number of Medium-Difficulty ( -0.5 c b < 0.5 ) GRE Quantitative Items in Response-Time

Categories by Item Type

Item TvDe

Mean Time in 20-Second Intervals o- 20- 40- 60- 80- loo- 120- 140- 160- 20 40 60 80 100 120 140 160 180 >180 Total

1 QC,P,Arith. 2 QC,P,Alg. 3 QC,P,Geom. 4 QC,P,DI 5 QC,R,Arith. 6 QC,R,Alg. 7 QC,R,Geom. 8 QC,R,DI 9 PS,P,Arith. 10 PS,P,Alg. 11 PS,P,Geom. 12 (PS,P,DI) 13 PS,R,Arith. 14 PS,R,Alg. 15 PS,R,Geom. 16 PS,R,DI

1 9 3 2 6 8

3 5 2 2

1 1 4

1 1 2 1

1 5 1

2

1 2

2

13 18 8 4

1 3 3 4 2

1 7 1 1 1 7 2 9

1 1 1

1 1 1 3 2

1 1 2 Total 0 4 22 33 15 6 2 3 1 1 87


20

Table 4. Number of Difficult (0.5 c b < 1.5) GRE Quantitative Items in Response-Time

Categories by Item Type

Mean Time in 20-Second Intervals Item 0- 20- 40- 60- 80- lOO- 120- 140- 160- Type 20 40 60 80 100 120 140 160 180 >180 Total

1 (QC,P,Arith.)

2 (QCRAlg.) 3 (QC,P,Geom.)

4 (QW’DU 5 (QC,R,Arith.) 6 (QCJWg.) 7 (QC,R,Geom.) 8 (QWUW 9 (PS ,P,Arith.) 10 (PS ,P,Alg.) 11 (PS,P,Geom.) 12 (PS ,P,DI) 13 (PS,R,Arith.) 14 (PS,R,Alg.) 15 (PS,R,Geom.) 16 (PS,R,DI)

Total

2 1 4 5 3

3 2 2 1 1

1 4 1 2

1

1 1 1 1

1 1

2 1

0 0 6 11 18 1

6 5

7 9 8 3 0 1 0 1

1 5 2 1 6

1 5 1 1

1 2 3

3 1 2 5 3 1 55


21

Table 5. Comparison of Two Category- 10 Items

Comparison Characteristics Item A Item B

Difficulty (IRT b parameter)

Number answering correctly

Number answering incorrectly

Mean GRE-quantitative score for examinees answering correctly

Mean GRE-quantitative score for examinees answering

incorrectly

Mean time, in seconds, to correct answer

Mean time, in seconds, to wrong answer

Mean time, correct, position 15 (n = 168 for A and 134 for B)

Mean time, wrong, position 15 (n = 106 for A and 89 for B)

Note. Category 10: PS, pure, algebra.

0.30 0.39

665 599

440 414

641 (69) 645 (71)

544 (74) 555 (63)

58 (33) 153 (89)

70 (48) 162 (89)

53 (25) 162 (75)

69 (40) 161 (88)

22

Table 6. Comparison of Two Category-2 Items

Comparison Characteristics Item C Item D

Difficulty (IRT b parameter)

Number answering correctly

Number answering incorrectly

Mean GRE-quantitative score for examinees answering correctly

Mean GRE-quantitative score for examinees answering

incorrectly

0.40

372

181

685 (79)

643 (63)

Mean time, in seconds, to correct answer 94 (35)

Mean time, in seconds, to wrong answer 74 (47)

Mean time, correct, position 18 (n = 137 for C and 230 for D) 101 (40)

Mean time, wrong, position 18 (n = 56 for C and 90 for D) 81 (37)

0.39

592

198

660 (68)

584 (68)

30 (23)

51 (30)

32 (28)

55 (30)

Note. Category 2: QC, pure, algebra.

23

Table 7. Test-Developer Ratings of Response Times for Seven Category-10 Items

Mean Time Median CD) Time

Actual Rank Rater 1 Rater 2 Rater 3

58 (33) 48 1 3 1 3 77 (50) 58 2 1 2 1 77 (46) 66 3 4 3 4 90 (54) 80 4 6 6 6 123 (77) 118 5 2 4 2 136 (64) 125 6 5 7 7 153 (89) 143 7 7 5 5

Note. Category 10: PS, pure, algebra.

24

Table 8. Means and Standard Deviations for Examinees Taking Item Sets in Group A, Position 5-9 and

Position 25-29

Mean (SD)

Set Position n GRE- analytical

GRE- Time in quantitative Seconds

Number Correct in Set

Al 5-9 563 442 (99) 461 (131) 565 (261) 2.1 (1.4) 25-29 204 397 (51) 437 (124) 296 (200) 1.6 (1.2)

A2 5-9 223 533 (93) 522 (116) 663 (206) 2.8 (1.4) 25-29 40 489 (66) 520 (91) 280 (240) 0.7 (0.8)

A3 5-9 113 527 (95) 528 (112) 637 (206) 3.3 (1.3) 25-29 4 500 (29) 478 (15) 483 (107) 2.5 (1.3)

A4 5-9 636 593 (100) 582 (116) 782 (226) 4.1 (1.0)

25-29 489 598 (59) 575 (100) 402 (228) 2.4 (1.6)

25

Table 9. Response Times for Group A and Group E Sets in Position 5-9 and of Comparable Difficulty

Item DiMiculty (b) Mean (SD) GRE- GRE- Median

Set Ql Q2 Q3 Q4 QS Mean n quantitative analytical Time Time

Al -1.21 -1.20 -0.97 0.28 0.30 -0.56 563 461 (131) 422 (99) 565 (261) 560 El -1.58 -0.60 -0.14 0.61 0.31 -0.28 378 472 (126) 449 (85) 544 (213) 521

A2 -1.20 -0.97 0.23 0.30 0.59 -0.21 223 522 (116) 533 (93) 663 (206) 650 E4 -1.58 -0.60 -0.14 0.16 0.59 -0.31 262 540 (115) 529 (93) 615 (205)* 579

A4 -0.97 0.23 0.30 0.59 0.72 0.17 636 582 (116) 593 (100) 782 (226) 752 E7 -0.60 -0.14 0.16 0.59 0.60 0.12 262 578 (116) 588 (97) 704 (221)* 672 “PC.02

Table 10. Number of Logical Reasoning Items in Response-Time Categories by Position and Level of Difficulty

Position Mean time in 20-Second Intervals

O-20 20-40 40-60 60-80 80-100 loo-120 120-140 140-160 160-180 Very Easy Items (b less than - 1.5)

1 to4 2 4 1 10 and 15 1 2 5 20 3 2 30 3 4

35 2 5 1

Easy Items (b between -1.5 and -0.5) 1 to4 1 8 4

10 and 15 1 6 5 20 3 8

30 2 10 35 1 11

Moderately Difficult Items (b between -0.5 and +0.5)

1 to 4 4 8 3

10 and 15 8 9 1

20 3 12 4

30 1 16 1

35 1 10 7

Difficult Items (b between +0.5 and +1.5) 1 to 4 4 6 2

10 and 15 3 7 3

20 2 4 5 1 30 1 11 35 2 8 1

Very Difficult Items (b greater than +1.5)

1 to 4 1 4 1 5 1 1

10 and 15 1 3 3 4 5 1

20 2 6 4 6 1

30 4 6 3 1

35 1 6 6

27

Scatterplot of Item Position and Mean Time 300

200

‘cn’ ? 8 8 100

2 .- I-

5

8 0

q

a cl

0 q

0 q 0

i q Oki

0 0 0

q 0

i” iii 0

0 0

0 q

0 q Cl

0

0 10 20 30

Item Position

Positions 9,10,22, and 23 were for excluded Data Interpretation sets

Figure 1. Scatterplot of item position by mean response time

28

300

200

3 s 8 100

3

f .- l-

ii

f 0

Scatterplot of Difficulty and Mean Time

0

q

q

a q 0

Cl

0

0

q

q

q

q 0

I3 0 KY

0 q *

0 El n q q

q f4 0

0

cl cl

cl

-5 -4

Difficulty (IRT

-3 -2 -

b parameter)

1 0 1 2 3

Figure 2. Scatterplot of item difficulty by mean response time

29

Date post:	23-May-2018
Category:	Documents
Upload:	buithien
View:	226 times
Download:	1 times

Variations in Mean Response Times for Questions on the · Variations in Mean Response Times for...

Documents