Variations in Mean Response Times for Questions on the
Computer-Adaptive GRE@ General Test: Implications for Fair Assessment
Brent Bridgeman Frederick Cline
GRE No. 96-20P
June 2000
This report presents the findings of a research project funded by and carried
out under the auspices of the Graduate Record Examinations Board
Educational Testing Service, Princeton, NJ 08541
Researchers are encouraged to express freely their professional judgment. Therefore, points of view or opinions stated in Graduate
Record Examinations Board Reports do not necessarily represent official Graduate Record Exarninations Board position or policy.
********************
The Graduate Record Examinations Board and Educational Testing Service are dedicated to the principle of equal opportunity, and their programs,
services, and employment policies are guided by that principle.
EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registered trademarks of Educational Testing Service.
Copyright 0 2000 by Educational Testing Service. All rights reserved.
Abstract
In a computer adaptive test (CAT), different examinees receive different sets of questions.
Questions at the same overall difficulty level and meeting roughly the same content specifications could
vary substantially in the amount of time needed to answer them. If the CAT is speeded (that is, if
substantial numbers of students either do not finish or must guess randomly at the end to finish),
individuals who happened to get a disproportionate number of questions that took a long time to answer
could be disadvantaged. The purpose of this study was to determine whether--in computer-adaptive
testing situations--the administration of a disproportionate number of questions with long, expected
response times unfairly disadvantages examinees. Data from 5,957 examinees who took the computer-
delivered Graduate Record Examination (GRE@) quantitative measure and 14,745 examinees who took
the computer-delivered GRE analytical measure were used to investigate variation in response time in
light of other factors, such as mean test score and the position of the question on the test. For both
measures, substantial variation in response times was found, even for items with the same specifications
and same difficulty level. But despite these differences, there was no indication that the scores of
students who were administered items with long expected response times were disadvantaged.
Key Words:
test fairness
validity
speededness
computer-adaptive tests
Contents
Introduction .................................................................................................................................................. 1
Study 1 .......................................................................................................................................................... 2
Method ............................................................................................................................................. 2
Results ............................................................................................................................................. 5
Study 3 I .......................................................................................................................................................... 9
Method ............................................................................................................................................. 9
Results ........................................................................................................................................... 10
Conclusion .................................................................................................................................................. 15
References .................................................................................................................................................. 17
Table 1.
Table 2.
Table 3.
Table 4.
Table 5.
Table 6.
Table 7.
Table 8.
Table 9.
List of Tables
Number of GRE Quantitative Items in Response-Time Categories by Item Type . . . . . . . . . . . . . . . . . . . . . 18
Number of Easy GRE Quantitative Items in Response-Time Categories by Item Type.. . . . . . . . . . . . 19
Number of Medium-Difficulty GRE Quantitative Items in Response-Time Categories by Item
Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .................... 20
Number of Difficult GRE Quantitative Items in Response-Time Categories by Item Type . . . . . . 2 1
Comparison of Two Category-10 Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Comparison of Two Category-2 Items ,...,............................................,.........,............................ 23
Test-Developer Ratings of Response Times for Seven Category-10 Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Means and Standard Deviations for Examinees Taking Item Sets in Group A, Position 5-9
and Position 25-29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Response Times for Group A and Group E Sets in Position 5-9 and of Comparable
Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............ 26
Table 10. Number of Logical Reasoning Items in Response-Time Categories by Position and
Level of Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
List of Figures
Figure 1. Scatterplot of item position by mean response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 2. Scatter-plot of item difficulty by mean response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Introduction
Tests used for admissions to programs of professional or graduate education, such as the
Graduate Record Examination (GRE@) General Test and Subject Tests, are generally designed to be
“power” tests rather than “speed” tests. That is, they are intended to be tests of ability rather than how
quickly a student can answer. According to the GRE Technical Manual (Briel, O’Neill, & Scheuneman,
1993), the purpose of the GRE General Test is to assess “reasoning skills considered fundamental in
graduate study: verbal reasoning, quantitative reasoning, and analytical reasoning” (p. 7), and the “GRE
General and Subject Tests are not intended to be speeded” (p. 32).
Nevertheless, the tests have strict time limits. Our preliminary analyses showed than 20% of
examinees fail to finish the quantitative section of the General Test, and over 35% fail to finish the
analytical section. Even examinees who answer every question may still be negatively impacted by the
time limit; they may engage in rapid-guessing behavior near the end of the test just so they can answer
every question. Such rapid-guessing behavior has been observed on the Test of English as a Foreign
Language (Bejar, 1985; Yamamoto, 1995) as well as on a computer-administered, but nonadaptive,
version of the GRE quantitative test (Schnipke & Scrams, 1997).
Although the addition of a speed component to a power test can be problematic for any type of
test, additional unique considerations arise with a computer-adaptive test (CAT). The unidimensional
models used with CATS implicitly assume that only knowledge of the correct answer, and not speed, is
needed for a high score (Hambleton & Swaminathan, 1985). Although this is also true of paper-based
power tests, the additional complication with a CAT is that each examinee is getting a different set of
questions. For example, on the paper-based version of the GRE, post-administration equating procedures
could provide a correction if questions on one form could be answered more quickly than questions on a
different form. Although the item response theory (IRT) model used to score the computer-adaptive
version of the General Test takes difficulty differences into consideration, it does not adjust for possible
differences in the speed demands of different questions or sets of questions.
Theoretically, GRE test questions at the same overall difficulty level or meeting roughly the
same content specifications could vary substantially in the amount of time needed to answer them.
Preliminary results from the experimental, computer-adaptive Graduate Management Admissions Test
(GMAT@) suggest that some quantitative question types take twice as long, on average, to answer as
other quantitative question types (Bridgeman, Anderson, & Wightman, 1997). In this case, average
1
response times across question types varied from a low of 65 seconds for geometry data-sufficiency
questions to a high of 135 seconds for algebra problem-solving questions, with standard deviations
within question types ranging from 35 to 89 seconds.
Variations in mean solution time across question types provide useful information about the
cognitive processing demands of different question types, but by themselves they do not raise fairness
concerns, as long as the test specifications standardize the number of questions of each type that an
individual will receive. On the other hand, substantial variation within question type could signify an
equity problem. Because the verbal, quantitative, and analytical sections of the GRE General Test are
speeded --that is, substantial numbers of students either do not finish or must guess randomly at the end
to finish--individuals who happen to get a disproportionate number of questions that take a long time to
answer could be disadvantaged. Although the GMAT analysis suggests that there might indeed be
reasons for concern, that study did not explore the data in enough depth to pinpoint the nature and extent
of this potential problem. In particular, the analysis did not control for question difficulty within question
type, so in this case, it is possible that within-question-type variability could be largely explained by
difficulty differences.
In the current study, we investigated mean response times on GRE General Test questions while
controlling for question difficulty. Two separate analyses are reported here. In Study 1, we examine CAT
items from the GRE quantitative measure. Because 24 out of the 28 questions in the GRE quantitative
test are discrete questions, rather than being part of question sets, analyses for this measure were less
complex and provided a useful starting point. In Study 2, we look at CAT items from the GRE analytical
measure, which makes extensive use of question sets (that is, a series of questions based on a single
problem presentation).
Study 1
Method
The data for the GRE quantitative questions came from a 1997 item pool in which 5,957
examinees took CAT versions of the General Test. This particular item pool was administered after a
procedure called proportional adjustment for incomplete tests, which imposes a penalty for leaving
questions unanswered, was instituted. Therefore, candidates were motivated to complete the test.
Examinees had 45 minutes to try to answer 28 questions, or about 96 seconds per question. We extracted
mean response times for each item in each position in the test in which it was administered. For example,
2
question 12 might be administered as the third item in one person’s test and as the 24th item in another
person’s test, so we computed separate means for each position.’
If time ran out for an examinee before the last question attempted was answered, we excluded
that examinee’s time from the computation of the mean time for that question. We also excluded all items
that were part of sets (each examinee who finished the test responded to two sets [two question per set]
and 24 discrete questions). The item pool consisted of 252 discrete questions. In addition to mean times
for all examinees who were administered an item, we computed mean times for those who chose the
correct answer to the question and for those who chose an incorrect answer. We also computed mean
GRE quantitative scores for examinees who got the question right and for those who got the question
wrong.
For each item, the database contained the three IRT parameters: discrimination (a), difficulty (b),
and pseudo-guessing (c). The b parameter expresses difficulty as a standard score with a mean of 0 and
standard deviation of 1. We classified items into five difficulty categories as follows:
1. very easy (b < -1.5)
2. easy (-1.5 c b < -5)
3. medium (-.5 < b c .5)
4. difficult (.5 < b c 1.5)
5. very difficult (b > 1.5)
Items had been classified by test developers into 16 content categories, which the automated,
item-selection algorithm uses to make certain that CATS for all individuals are comparable on these
dimensions. The first category level classifies items as either “problem solving” (PS) questions or
“quantitative comparison” (QC) questions. The PS items are standard multiple-choice questions, each (
which offers five answer choices. The QC items, which offer four answer choices, ask the examinee to
determine which of two quantities is larger, whether they are equal, or whether there is not enough
information to determine the answer. (Each examinee’s test contained 10 PS questions and 14 QC
questions, plus the four items from two sets that were excluded from the current analyses.)
’ Interpreting means for skewed distributions can be problematic, and response time distributions tend to be positively skewed. However, the skewness for these items was generally not too severe because examinees knew that spending too much time on any question would impede their ability to finish the test. Thus, means were adequate for our purposes, and we have also supplied medians for certain analyses.
3
All questions are also categorized as either “pure” or “real.” Pure questions deal only with
numbers and symbols, while real questions refer to a name or object from the real world and are
frequently word problems. The test specifications indicate that each examinee’s test should contain 18
pure items and six real ones.
All items are further coded into four subject-matter categories: arithmetic (eight or nine items per
examinee), algebra (six or seven items), geometry (five or six items), and data interpretation (two to five
items). We assigned numerical codes to these 16 categories as follows:
1. QC, pure, arithmetic
2. QC, pure, algebra
3. QC, pure, geometry
4. QC, pure, data interpretation
5. QC, real, arithmetic
6. QC, real, algebra
7. QC, real, geometry
8. QC real, data interpretation
9. PS, pure, arithmetic
10. PS, pure, algebra
11. PS, pure, geometry
12. PS, pure, data interpretation
13. PS, real, arithmetic
14. PS, real, algebra
15. PS, real, geometry
16. PS, real, data interpretation.
Beyond these 16 categories, items are further classified into one of 79 categories that provide a
more detailed description of the item content--such as “negative exponents,” “linear inequality,” and
“ratio and proportion.” These categories are not used to select items for inclusion on individual CATS;
also, there are more of these item-content categories than there are items on any one exarninee’s test.
However, a few of these categories are used to provide the upper limit of the number of questions of a
given content type that should be included on each test; for example, no individual’s test should contain
more than two “ratio and proportion” questions, though it need not contain any items in this category at
all.
Results
Position effects. We first looked at the relationship between position in the test and mean
latencies. If each of the 252 items appeared in each of the 28 possible positions, there would be 7,056
means representing all of the possible item-by-position possibilities. In fact, however, the operation of
the item selection algorithm is such that a given item actually appears in only a limited number of the 28
positions. In this data set, 1,760 means represented all of the data points. For this and other analyses
described here, we required that a mean be based on the performance of at least 20 students; this brought
the total number of means to 1,13 1.
With this constraint, the correlation between mean time to complete the item and item position
was not significant (r = -.04). This lack of relationship between position and completion time can be seen
in Figure 1. Similar nonsignificant differences were found for analyses run separately for the PS and QC
item types. Furthermore, mean time to correct solution was almost the same for the early PS items
(positions 1-15) as it was for PS items in the last two positions (101 seconds vs. 99 seconds, with
standard deviations of 74 and 52, respectively). These results do not imply that individual examinees are
necessarily all working at a uniform rate, but only that examinees who are responding rapidly at the end
are to some extent balanced by examinees who are responding more slowly at the end. Indeed, Scrams
and Schnipke (1999) suggested that about 23% of examinees speeded up as they proceeded through a
linear version of the GRE quantitative test, and about 22% slowed down, with the remainder keeping a
relatively even pace throughout.
DiJficuZty. As suggested in Figure 2, a correlation of .44 (or .56 including a b2 term to reflect the
curvilinear increase) was observed between mean time and difficulty (b). Despite this generally positive
relationship between time to answer and difficulty, four of the easiest items (b < -1.5) took over 100
seconds, on average, to answer, and four of the hardest items were answered in less than 80 seconds.
Item type. Table 1 shows the relationship of item type to mean response time for 244 test items.
(Because eight of the 252 questions in the item pool were seen by fewer than 20 examinees, they were
excluded from the analyses). Item types 1-8 (QC items), which are designed to be answered quickly,
generally did appear to take less time than the problem solving items. However, there was still substantial
variability within the QC and PS categories, with 11 ostensibly quick QC items requiring an average of
more than 100 seconds and 12 PS items taking less than a minute.
Even within a specific question type there was substantial variation. For example, three items
classified as category 2 (QC, pure, algebra) averaged response times of over 100 seconds, while another
three items in the same category averaged less than 40 seconds. Variation was especially wide for
questions in categories 9 (PS, pure, arithmetic), 10 (PS, pure, algebra), and 11 (PS, pure, geometry). Of
the 26 category 9 items, three required an average of over two minutes for examinees to answer, while
another five took less than one minute. Similarly, examinees answered two category 10 questions in an
average of less than a minute, while they needed an average of more than three minutes for another
question in this category. And average response times for items in category 11 also ranged from under
one minute to over three minutes.
The time differences within question types that display in Table 1 might be explainable by
considering differences in item difficulty. But Tables 2, 3, and 4--which show the relationship between
item type and mean response time for easy (-1.5 < b c -0.5), medium-difficulty (-0.5 c b < S), and
difficult (0.5 < b < 1.5) GRE quantitative items, respectively--indicate that even within a relatively
narrow difficulty band, mean times for individual question types still varied over a broad range.
However, the trend for more difficult items generally taking longer also was apparent. Only one
out of the 59 easy items required more than two minutes, on average, to answer, but 14 out of the 55
difficult items required more than two minutes. Within question category 10 (PS, pure, algebra), four out
of six difficult questions averaged more than two minutes, while all of the easy questions in this category
averaged less than 100 seconds each to be solved.
Nevertheless, it is the within-category/within-difficulty-level variation that is most disturbing
from a fairness perspective. For example, if the CAT item-selection algorithm called for a difficult,
category 10 question, one examinee might get a question that, on average, students answer in less than
100 seconds, while another student could get a question that, on average, required more than 180 seconds
to answer. If, and only if, ample time were allowed to finish the test, would such time differences not be
of concern.
Table 5 presents a comparison of the characteristics of one such pair of items. The items that
display in Table 5 are from the same category, they are of nearly identical difficulty, and both were
administered to relatively large samples of examinees. However, the items differ markedly in their mean
solution times. Table 6 shows the same phenomenon for a different pair of items.
Questions that are of equal difficulty can have very different solution times, because the number
of steps needed to solve a problem is not necessarily closely linked to difficulty. For example, Item C on
Table 6 is a linear inequality that requires some processing time just to understand what is being asked.
Item D, on the other hand, is a negative exponents question, the difficulty of which apparently stems
from some examinees not knowing how to solve this type of problem; examinees who understand
negative exponents can solve it quite quickly.
The potential fairness implications of these results can be clearly seen by imagining the
difference in the testing experience of two hypothetical examinees based on a lucky or unlucky break in
item selection. Suppose the two examinees, call them Mary and Jim, took identical tests, except that the
questions administered to each in positions 15 and 18 were different. If Mary got the B and C pair while
Jim got the A and D pair (and both students got the correct answer in the average amount of time), Jim
would have almost three more minutes to complete the test than Mary.
A possible solution to this problem would be to include a category for solution time in the item-
selection algorithm. This would ensure that no individual would get a disproportionate number of
questions that require long or short response-times. This, of course, leads to the question of the
availability of solution times. For items that are pretested and calibrated during a CAT administration,
solution times are available. However, a significant proportion of GRE test items that are used to create
new item pools are calibrated in paper-and-pencil administrations, and so no solution times are available.
For these items, expert ratings of estimated solution times could be obtained.
c
To evaluate the likely success of this procedure, we asked three people with considerable
experience in developing items for the GRE quantitative measure to rank order a set of seven items from
shortest time to correct solution to longest time to correct solution. The seven items were all of the same
category (category 10: PS, PURE, Algebra) and difficulty level (medium: -0.5 c b < OS), and all had
been answered correctly by at least 450 examinees. Mean and median times, along with the rater
rankings, are shown in Table 7.
As Table 7 shows, median times are about 10 seconds shorter than mean times because of the
previously noted positive skew of the time distributions, but both mean and median times tell the same
story as to which items take the longest to solve. The rankings by the test development experts were
reasonably close to the rank order of actual solution times. The item with the shortest actual solution time
was ranked in the top three shortest time categories by all three test developers, and the item with the
longest actual solution time was rated as the longest item by one rater and as the third longest item by the
7
other two raters. However, there were also some misclassifications. For example, all three raters placed
the actual fourth-place question in sixth place. Two out of the three raters thought that this item (which
actually took 90 seconds, on average, to answer) would take longer to answer than the item that averaged
123 seconds. Thus, although useful and certainly better than no solution time estimates at all, ratings by
test developers would not substitute for actual solution times.
Although examinees who happened to get an item that took much longer than average to answer
would seem to be at a disadvantage, we saw no evidence of this here in terms of total GRE quantitative
scores. Looking back at Tables 5 and 6, note that in both comparisons, mean quantitative scores were
slightly higher for the students who answered the longer item in each pair--exactly the opposite of the
lower scores that would be expected if this were to hurt their chances to fully consider later items.
We correlated the mean time needed to respond to each item with the mean score of the
examinees who took that item separately for the two main item types, QC and PS.. Other things being
equal, if having to take an item with a longer response time lowered scores, this correlation might be
expected to be negative. However, both correlations were positive (.37 for the 134 QC items and 55 for
the 110 PS items). And of course, other things are not equal in a CAT. More difficult items, which tend
to take longer, are administered disproportionately to higher ability students; hence the positive
correlation. We attempted to compensate for this with a regression approach, predicting the mean score
of the examinees taking the item from item difficulty (IRT parameter b), and then determining whether
adding mean time made any incremental contribution to the prediction. Difficulty was indeed
substantially correlated with mean score (82 for QC and .88 for PS), but mean time did not make a
significant incremental contribution to the prediction of mean scores (multiple R increased by less than
.Ol).
Gender diflerences. Tables 5 and 6 illustrate the potential for fairness problems from an
individual perspective. Because the item selection algorithm would not systematically assign time-
consuming questions to a particular gender group, there is less concern that these time differences would
have an impact on fairness from a group perspective. Nevertheless, we attempted to determine whether
particular items could be answered more quickly by one gender group than the other.
Over the 252 items in the pool, we correlated the mean time to a correct answer for men with the
mean time to a correct answer for women. The correlation was .92, suggesting that in general, the items
that were most time consuming for men were also the items that were most time consuming for women.
For the 160 iterns that were answered correctly by at least 100 men and 100 women, we computed the
8
differences in mean times for the two gender groups. Only 8 items demonstrated differences of more than
20 seconds; men were faster for seven of these eight items. This result is not surprising, given the higher
mean scores of men in this sample (for men, mean = 596 [SD = 1291; for women, mean = 524 [SD =
1231).
A closer look at the one item that women answered more quickly than men illustrates the
difficulty in interpreting time differences on a multiple-choice examination. The item that women, on
average, answered more quickly was a very difficult (b = 1.52), category 3 (QC, pure, geometry)
problem. The mean time to a correct solution was 174 seconds for men (SD = 94; median = 148) and 141
seconds for women (SD = 92; median = 112). But only 110 of the 592 women (19%) who were
administered this item answered it correctly, which is below the random guess rate of 25% for an item
with four answer choices. (For men the comparable figures were 105 out of 332, or 32%.) The mean
GRE quantitative score of men who answered this item correctly was 706, while the mean score of men
who answered it incorrectly was 606; for women, the comparable scores were 595 and 551, respectively.
Thus, the women who got this item right were at about the same overall level as men who got it wrong.
One possible interpretation is that women were faster, on average, in this case because they gave up
sooner and made a random guess. Relatively short times to a correct solution on a multiple-choice test
may reflect either a high degree of mastery--or nonmastery with lucky guessing.
Study 2
Method
Data for the GRE analytical questions--the subject of Study 2--were obtained from a CAT item
pool that was administered in 1998. For the analytical section, examinees had 60 minutes to answer 35
questions. The analytical section has questions of two types. One type, logical reasoning (LR), consists
of discrete questions which test “the ability to understand, analyze, and evaluate arguments . . .
[including] recognizing the point of an argument, recognizing assumptions on which an argument is
based, drawing conclusions and formulating hypotheses, identifying methods of argument, evaluating
arguments and counterarguments, and evaluating evidence” (Briel, O’Neill, & Scheuneman, 1993, p. 10).
“Each item or group of logical reasoning items is based on a short argument or on a simple graph or
table, generally an excerpt from the kind of material graduate students are likely to encounter in their
academic and personal reading”(pp. 10-l 1).
The second major category of items in the GRE analytical measure, analytical reasoning (AR),
consists of item sets which test:
. . . the ability to understand a given structure of arbitrary relationships among fictitious persons, places, things or events, and to deduce information from the relationships given. Each analytical reasoning group consists of (1) a set of approximately three to seven related statements or conditions (and sometimes other explanatory material) describing the structure of relationships, and (2) three or more items that test understanding of that structure and its implications by requiring examinees to infer what is necessarily true or to determine what is or is not possible. (Briel, O’Neill, & Scheuneman, 1993, p. 10)
A given examinee is administered nine discrete LR items, plus 26 AR items arranged in six sets
(four four-item sets and two five-item sets). Each AR grouping consists of a problem stimulus with four
to eight associated questions, but any individual examinee would see only four or five of these questions,
and each would also likely see a different combination of questions. For example, one examinee might
see only questions 1, 3, 57, and 8 from a given group, while another examinee would see only questions
2, 3,4,6, and 7. Two examinees may see the same items based on the same stimulus, but they could
receive them in a different order. In order to simplify the analyses, we decided to study in detail only the
10 groups that were administered to the largest number of examinees. Five of these groups were
associated with four-item AR sets, and the remaining five were associated with five-item AR sets.
Results
Position efsects. We first evaluated the two, five-item, AR sets that were administered to each
examinee. Typically, an examinee would receive one five-item AR set in position 5-9 and the other five-
item set in position 25-29. Consider one problem statement (call it Stimulus A) with eight attached
questions. Stimulus A might be used in position 5-9 for some exarninees and position 25-29 for others. In
either position, a given examinee would be administered only 5 of the 8 possible questions, so
theoretically, 56 possible combinations of items could be generated from this one problem statement.
However, in practice, a relatively small number of combinations accounted for all of the patterns
actually administered. In position 5-9, 1,900 examinees received questions from Stimulus A. Four
combinations of items (call them sets Al-A4) accounted for all but 365 of these examinees. In position
25-29, only 832 examinees were administered questions from Stimulus A, but three of the same sets (Al,
A2, and A4) accounted for all but 99 examinees. Set A3 was administered to 113 examinees in position
5-9 but to only 4 examinees in position 25-29.
10
Table 8 shows the mean test scores and solution times for sets Al-A4 in both positions. We
included GRE quantitative score in the table because it is correlated with GRE analytical score (Y = .68),
but would itself be uninfluenced by performance on any of these sets. In all four sets, the average amount
of time spent on the set was much shorter in the 25-29 position than in the 5-9 position. Consider set A4,
which was seen by relatively large samples of examinees in both positions. The samples of students who
were administered this set in the 5-9 and 25-29 positions were very comparable in terms of their mean
GRE analytical and quantitative scores, yet, the sample of examinees who took this set in the latter
position spent six minutes less to answer it than the sample who took it in the earlier position.
This result could reflect a learning effect that would allow later sets to be answered more
quickly, or it could reflect hurrying because time was running short near the end of the test. A learning
effect would suggest that the number of items answered correctly in the set should be higher for those
who were administered the set in the latter position, while a hurrying effect would suggest that fewer
items should be answered correctly. As indicated in the last column of Table 8, for A4--and indeed for
every set--the number correct was lower when the set was administered in the later position, suggesting a
substantial hurrying effect. A similar pattern was noted with all of the other five-item sets studied. The
lack of position effects for the quantitative items and strong position effects for the AR items could result
from one or more differences between the two tests. Because AR is more speeded, greater position
effects would be expected. In addition, the task requirements of AR items may produce greater time
flexibility. AR sets require checking proposed solutions against a complex set of task requirements. If
time were running short, some of these checks could be skipped. However, in a quantitative problem, it
would be much more difficult to omit steps and still hope to get a correct answer; hence solution time
would be relatively constant whether hurried or not.
Time differences for paired sets. We paired five-item AR sets that were administered in the same
position (5-9), were of comparable difficulty (similar b values), and were administered to examinee
groups of comparable ability (as measured by GRE quantitative score). We then compared the mean time
examinees took to complete each pair of sets. Three sets from Group A were paired with three sets from
Group E. Statistics for the three A-E pairs are presented in Table 9. The table shows the IRT b parameter
for each question in each set listed in order from lowest to highest. This is not necessarily the order in
which the items were administered; there are many different administration orders possible for each set.
For each pair, response times were shortest for the set from Group E.
11
The most closely matched pair was A4-E7, with mean b values within .05 points and mean test
scores within 5 points (on the 200-800 GRE scale). Yet, on average, it took 78 seconds longer for
examinees to respond to set A4 than to set E7. An extra 78 seconds to spend on the rest of the test could
provide a significant advantage on an examination as speeded as the GRE analytical test. However, such
an advantage w.as not evident in the current data, as mean analytical scores were not systematically
higher for examinees who took items from Group E. Nevertheless, individuals who were unlucky enough
to have to take several long sets could still be disadvantaged relative to individuals who had several short
sets. (This issue is addressed more fully in a later section.)
In an ordinary, linear test, students of higher ability would generally be expected to be faster--
especially when speed is an seen as an inherent feature of the construct (as it is for AR items). A linear
test is then often more speeded for lower ability students--that is, they have greater difficulty finishing in
the time allowed. However, as Table 9 shows, in this testing situation higher ability students (on the
bottom of the table) take substantially longer than the lower ability students (on the top of the table),
because higher ability students are administered more difficult items. Thus, in a CAT, the usual
relationship between ability and speededness can be totally reversed so that the test is more speeded for
higher ability students.
Similar analyses of the four-item AR sets provided further evidence that some sets can be
answered much more quickly than others. The mostly closely matched pair of four-item sets, in terms of
mean test scores, was administered in position 16-19. Mean GRE quantitative scores for the 225
examinees that comprised one group of the pair were within two points of the mean for the 165
examinees who made up the other group (653 and 655, with standard deviations of 96 and 104), but the
mean response-time for one group was over two minutes longer than mean response-time of the other
group (473 seconds vs. 597 seconds, with standard deviations of 145 and 153, respectively). The mean
GRE analytical scores of these groups were very similar, 664 and 658 (SDS = 90 and 97) for the shorter
response-time and longer response-time groups, respectively.
Time differences for LR items. For the LR items, which are not administered in sets, Table 10
shows the spread of mean latencies for questions with approximately the same difficulty level and
position in the test. For example, the first line of the table is for very easy questions ( b less than -1.5)
that were administered in positions l-4. Of the seven questions that met these conditions, average
response times were 60-80 seconds for two items, 80-100 seconds for four items, and 100 to 120 seconds
for one item. The next row of the table refers to questions administered in positions 10 or 15. Most of the
12
items administered in one position were also administered in a different position, so that within a
difficulty range, the same items may appear on more than one row in the table. Some items were not
included in certain positions because they did not meet the minimum standard that means be based on at
least 20 examinees.
At each difficulty level, response times were notably shorter for positions 30 and 35, indicating
that at this point examinees were hurrying to complete the test. For each position in the test, response
times were longer for more difficult questions. For the easy and very easy questions in position l-4, only
5 out of 20 questions (25%) took over 100 seconds to complete, but for questions of medium-difficulty or
harder (b c -0.5), 27 out of 40 (68%) questions took over 100 seconds to complete. For the easy and very
easy items, the range of mean response times in a given position was fairly narrow; at each position at
both of these difficulty levels, two adjacent time categories were sufficient to describe almost all of the
items. However, there was a considerably greater range in the mean times for the more difficult items--
even within a given position in the test. For example, mean times for the 13 items with b values over 1.5
administered in positions l-4 ranged over six 20-second categories. Two questions in this difficulty level
had mean times greater than 140 seconds, and five items had mean times under 100 seconds. The same
pattern of decreasing times with later positions in the test that was noted for the less difficult items was
also found for these difficult items.
Zmpact of long tests on totaE scores. This range of mean times led again to the suspicion that
examinees who had more items with long mean times might be at a disadvantage. To test this, we first
determined the mean response time for each item. Because some items were administered more
frequently later in the test, and because items taken later are answered more quickly, the mean time for an
item was defined as the unweighted average of the times across positions in the test. Thus, mean time for
an item was independent of whether it happened to be administered primarily early or primarily late in
the test. For the first item in each AR set, an adjustment for time to read the stimulus was created by
comparing the time to answer an item when it was first in the set to the time for the same item when it
occurred later in the set. Next, we computed the expected mean time for each individual’s test by
summing the mean times for each item taken by that individual. Our hypothesis was that, after controlling
for general ability (using GRJ? quantitative score and GRE verbal score), GRE analytical scores would be
lower for examinees with the longest expected times. To test this, we ran a regression with GRE
analytical score as the dependent variable, entering GRE quantitative score and GRE verbal score at the
first step, and expected time at the second step.
13
Our hypothesis was not supported. Indeed, the opposite was true. Expected time had a significant
positive weight, and increased R-square from 52 to .66. Standardized weights were .30, .16, and 5 1 for
GRE quantitative score, GRE verbal score, and expected time, respectively. Thus, examinees who took
tests that should have taken longer got higher scores.
Next, we looked to see if the expected negative relationship might emerge within a narrow ability
range. We selected only the 862 examinees with GRE verbal plus GRE quantitative scores in the 1,030-
1,060 range. Within this group, as expected, GRE verbal score and GRE quantitative score were no
longer significant predictors of GRE analytical score, but expected time still had a substantial positive
weight, increasing R-square from .OO to .36. Similar results were found for a low-scoring group (650-700
range) and a high-scoring group (1,300-1,350 range), and for analyses run separately for expected times
on sets and on discrete items.
Apparently, these positive weights for expected time remained because of the relationship
between item difficulty and expected time (r = .52)--that is, more difficult items take longer. With a
CAT, more difficult items are administered to higher ability examinees, so higher ability examinees
generally get tests that take longer. Adjusting for GRE verbal score and GRE quantitative score, which
were correlated .71 with GRE analytical score (a substantial correlation, but still leaving half of the
variance unexplained), was not enough to overcome this relationship of time to item difficulty and of
item difficulty to test score.
In an attempt to compensate for the relationship of time to difficulty, we used the IRT b
parameter to predict mean response time separately for LR and AR items. Each item then had a predicted
time associated with it. We computed a time discrepancy score for each item as the difference between
the time predicted from the item’s difficulty and the actual mean time. The time discrepancy score for an
individual was the sum of the time discrepancy scores for all of the items taken by that individual. A high
time discrepancy score would then reflect a test that was especially long, taking difficulty into account.
For both the LR and AR items, the time discrepancy score was entered after GRE quantitative score and
GRE verbal score, with GRE analytical score as the dependent variable. The time discrepancy score had
a small positive weight, which did not contribute significantly (p > .05) to the prediction, even with the
sample of over 10,000 examinees. Thus, there was no evidence that examinees who got long tests, taking
item difficulty into account, were disadvantaged in terms of their total scores.
For the above analyses, expected times were based on the full set of 35 items, and examinees
with incomplete tests were excluded from the analysis. An additional analysis was run that was able to
14
include all examinees by defining a long test only in terms of the first nine items (4 LR items and the first
five-item AR set). Mean item times were recomputed to reflect only performance on these items in
positions l-9, and as before, time discrepancy scores were defined as the difference between the time
predicted from the item’s difficulty and the actual mean time. The time discrepancy score for an
individual was the sum of the time discrepancy scores for each of the items (l-9) taken by that individual.
The distribution of these summed time discrepancy scores for individuals indicated a difference
of about one minute between times at the 25th and 75th percentiles. This summed time discrepancy score
had a small negative weight for predicting the number of questions attempted after GRE verbal score and
GRE quantitative score were entered; though statistically significant in this very large sample of 14,745
examinees, the change in R was very small (.015 to .066, or a change in R-square from .OOO to .004).
The more important question was whether the summed time discrepancy score had a significant
negative weight for predicting the GRE analytical score. It did not. It had a positive weight; although
statistically significant, the change in R-square was infinitesimal (from .459 to .460). This analysis was
repeated separately for male and female examinees, and for African American, Asian, Hispanic/Latino,
and White subgroups. Results were remarkably consistent across these subgroups with a small positive
weight in each subgroup.
Conclusion
For both the GRE quantitative and analytical measures, some items clearly take longer to answer
than others. Because estimated solution time is not included in the item selection algorithm, this
difference in time could potentially create a fairness problem on a timed test. However, we could find no
evidence of an impact on total scores for examinees who got unusually long questions or tests.
Despite this lack of an overall effect on test scores, we cannot rule out the possibility that certain
individuals could be negatively impacted by receiving items that take an unusually long time to answer.
Even without clear evidence of an impact on test scores, it might be desirable to include some measure of
estimated solution time in the item selection algorithm so that no individual gets more than a fair share of
time-consuming items. The feasibility of such an approach has already been demonstrated (van der
Linden, Scrams, & Schnipke, 1999).
However, the current results suggest that including such estimates may be more complicated than
previously imagined, because an item, or a set of items, does not have a single response time, but rather
15
many response times, depending on the position in which it is administered in the test. This is especially
true for AR items. Suppose the time needed for a five-question AR set were estimated from a pretest that
administered the set near the end of the test, but the set was actually administered to some examinees
near the beginning of the test. The estimated time could be in error by more than six minutes.
Knowledge of position effects might lead to appropriate adjustments in estimated time depending on
position. In addition to clarifying these position effects, future research could investigate possible
relationships between individual differences in pacing styles (of the type described by Scrams and
Schnipke, 1999) and the time demands created by particular sets of items.
16
References
Bejar, I. I. (1985). Test sneededness under number-right scoring: An analvsis of the Test of English as a Foreign Language (Research Report RR-851 1). Princeton, NJ: Educational Testing Service.
Briel, J. B., O’Neill, K. A., & Scheuneman, J. D. (1993). GRE technical manual. Princeton, NJ: Educational Testing Service.
Bridgeman, B., Anderson, D., & Wightman, L. (1997). Overview of the GMAT CAT Pilot Test. Unpublished report.
Hambleton, R. K., & Swaminathan, H. (1985). Item resnonse theory Principles and annlications. Boston: Kluwer-Nijhoff.
Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34,213-232.
Scrams, D. J., & Schnipke, D. L.( 1999). Response time feedback on computer-administered tests. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal.
van der Linden, W. J., Scrams, D. J., & Schnipke, D. L. (1999). Using response-time constraints to control for differential speededness in computerized adaptive testing. Applied Psvchological Measurement, 23, 195210.
Yamamoto, K. (1995). Estimating the effects of test length and test time on parameter estimates using the HYBRID model (TOEFL Technical Report No. TR-10). Princeton, NJ: Educational Testing Service.
17
Table 1. Number of GRE Quantitative Items in Response-Time Categories by Item Type
Mean Time in 20-Second Intervals Item 0- 20- 40- 60- 80- lOO- 120- 140- 160- Type 20 40 60 80 100 120 140 160 180 >180 Total
1 QC,P,Arith. 1 7 18 5 2 QC,P,Alg. 3 11 15 3 QC,P,Geom. 3 9 8 4 QC,P,DI 3 2 5 QC,R,Arith. 2 3 6 QC,R,Alg. 1 3 7 QC,R,Geom. 5 8 QC,R,DI 3 9 PS ,P,Arith. 2 3 5 10 PS,P,Alg. 2 2 11 PS,P,Geom. 3 8 12 PS,P,DI 2 1 13 PS,R,Arith. 3 14 PS,R,Alg. 1 15 PS,R,Geom. 2 16 PS,R,DI 2
2 10 8 6
3 1 3
1
3
2 2 1 1 1
3 2 1 4
1 2
36 38 24 8 9 8 6 5
1 26 2 1 20 1 1 25
7 1 9
1 2 10 5 8
Total 1 15 54 68 52 23 8 13 5 5 244
QC = quantitative comparison; PS = problem solving; P = pure (numbers only); Note. R = real; DI = data interpretation.
18
Table 2. Number of Easy (-lS<bc-0.5) GRE Quantitative Items in Response-Time Categories
by Item Type
Item Type
Mean Time in 20-Second Intervals 0- 20- 40- 60- 80- lOO- 120- 140- 160- 20 40 60 80 100 120 140 160 180 >180 Total
1 QC,P,Arith. 2 QC,P,Alg. 3 QC,P,Geom. 4 QC,P,DI 5 QC,R,Arith. 6 QC,R,Alg. 7 QC,R,Geom. 8 QC,R,DI 9 PS ,P,Arith. 10 PS ,P,Alg. 11 PS,P,Geom. 12 PS,P,DI 13 PS,R,Arith. 14 PS,R,Alg. 15 PS,R,Geom. 16 PS,R,DI
5 5 4
3 2 1
1 1
1 2 1 2 1
2
2 1
1 1 2
1
3 3 3
1 2 1
1
1
11 6 6 1 2 3 1 0 8 5 7 2 3 1 0 3
Total 0 9 21 11 15 2 1 59
Item type: QC = quantitative comparison; PS = problem solving; P = pure; R = real; Note. DI = data interpretation.
19
Table 3. Number of Medium-Difficulty ( -0.5 c b < 0.5 ) GRE Quantitative Items in Response-Time
Categories by Item Type
Item TvDe
Mean Time in 20-Second Intervals o- 20- 40- 60- 80- loo- 120- 140- 160- 20 40 60 80 100 120 140 160 180 >180 Total
1 QC,P,Arith. 2 QC,P,Alg. 3 QC,P,Geom. 4 QC,P,DI 5 QC,R,Arith. 6 QC,R,Alg. 7 QC,R,Geom. 8 QC,R,DI 9 PS,P,Arith. 10 PS,P,Alg. 11 PS,P,Geom. 12 (PS,P,DI) 13 PS,R,Arith. 14 PS,R,Alg. 15 PS,R,Geom. 16 PS,R,DI
1 9 3 2 6 8
3 5 2 2
1 1 4
1 1 2 1
1 5 1
2
1 2
2
13 18 8 4
1 3 3 4 2
1 7 1 1 1 7 2 9
1 1 1
1 1 1 3 2
1 1 2 Total 0 4 22 33 15 6 2 3 1 1 87
Item type: QC = quantitative comparison; PS = problem solving; P = pure; R = real; Note. DI = data interpretation.
20
Table 4. Number of Difficult (0.5 c b < 1.5) GRE Quantitative Items in Response-Time
Categories by Item Type
Mean Time in 20-Second Intervals Item 0- 20- 40- 60- 80- lOO- 120- 140- 160- Type 20 40 60 80 100 120 140 160 180 >180 Total
1 (QC,P,Arith.)
2 (QCRAlg.) 3 (QC,P,Geom.)
4 (QW’DU 5 (QC,R,Arith.) 6 (QCJWg.) 7 (QC,R,Geom.) 8 (QWUW 9 (PS ,P,Arith.) 10 (PS ,P,Alg.) 11 (PS,P,Geom.) 12 (PS ,P,DI) 13 (PS,R,Arith.) 14 (PS,R,Alg.) 15 (PS,R,Geom.) 16 (PS,R,DI)
Total
2 1 4 5 3
3 2 2 1 1
1 4 1 2
1
1 1 1 1
1 1
2 1
0 0 6 11 18 1
6 5
7 9 8 3 0 1 0 1
1 5 2 1 6
1 5 1 1
1 2 3
3 1 2 5 3 1 55
Item type: QC = quantitative comparison; PS = problem solving; P = pure; R = real; Note. DI = data interpretation.
21
Table 5. Comparison of Two Category- 10 Items
Comparison Characteristics Item A Item B
Difficulty (IRT b parameter)
Number answering correctly
Number answering incorrectly
Mean GRE-quantitative score for examinees answering correctly
Mean GRE-quantitative score for examinees answering
incorrectly
Mean time, in seconds, to correct answer
Mean time, in seconds, to wrong answer
Mean time, correct, position 15 (n = 168 for A and 134 for B)
Mean time, wrong, position 15 (n = 106 for A and 89 for B)
Note. Category 10: PS, pure, algebra.
0.30 0.39
665 599
440 414
641 (69) 645 (71)
544 (74) 555 (63)
58 (33) 153 (89)
70 (48) 162 (89)
53 (25) 162 (75)
69 (40) 161 (88)
22
Table 6. Comparison of Two Category-2 Items
Comparison Characteristics Item C Item D
Difficulty (IRT b parameter)
Number answering correctly
Number answering incorrectly
Mean GRE-quantitative score for examinees answering correctly
Mean GRE-quantitative score for examinees answering
incorrectly
0.40
372
181
685 (79)
643 (63)
Mean time, in seconds, to correct answer 94 (35)
Mean time, in seconds, to wrong answer 74 (47)
Mean time, correct, position 18 (n = 137 for C and 230 for D) 101 (40)
Mean time, wrong, position 18 (n = 56 for C and 90 for D) 81 (37)
0.39
592
198
660 (68)
584 (68)
30 (23)
51 (30)
32 (28)
55 (30)
Note. Category 2: QC, pure, algebra.
23
Table 7. Test-Developer Ratings of Response Times for Seven Category-10 Items
Mean Time Median CD) Time
Actual Rank Rater 1 Rater 2 Rater 3
58 (33) 48 1 3 1 3 77 (50) 58 2 1 2 1 77 (46) 66 3 4 3 4 90 (54) 80 4 6 6 6 123 (77) 118 5 2 4 2 136 (64) 125 6 5 7 7 153 (89) 143 7 7 5 5
Note. Category 10: PS, pure, algebra.
24
Table 8. Means and Standard Deviations for Examinees Taking Item Sets in Group A, Position 5-9 and
Position 25-29
Mean (SD)
Set Position n GRE- analytical
GRE- Time in quantitative Seconds
Number Correct in Set
Al 5-9 563 442 (99) 461 (131) 565 (261) 2.1 (1.4) 25-29 204 397 (51) 437 (124) 296 (200) 1.6 (1.2)
A2 5-9 223 533 (93) 522 (116) 663 (206) 2.8 (1.4) 25-29 40 489 (66) 520 (91) 280 (240) 0.7 (0.8)
A3 5-9 113 527 (95) 528 (112) 637 (206) 3.3 (1.3) 25-29 4 500 (29) 478 (15) 483 (107) 2.5 (1.3)
A4 5-9 636 593 (100) 582 (116) 782 (226) 4.1 (1.0)
25-29 489 598 (59) 575 (100) 402 (228) 2.4 (1.6)
25
Table 9. Response Times for Group A and Group E Sets in Position 5-9 and of Comparable Difficulty
Item DiMiculty (b) Mean (SD) GRE- GRE- Median
Set Ql Q2 Q3 Q4 QS Mean n quantitative analytical Time Time
Al -1.21 -1.20 -0.97 0.28 0.30 -0.56 563 461 (131) 422 (99) 565 (261) 560 El -1.58 -0.60 -0.14 0.61 0.31 -0.28 378 472 (126) 449 (85) 544 (213) 521
A2 -1.20 -0.97 0.23 0.30 0.59 -0.21 223 522 (116) 533 (93) 663 (206) 650 E4 -1.58 -0.60 -0.14 0.16 0.59 -0.31 262 540 (115) 529 (93) 615 (205)* 579
A4 -0.97 0.23 0.30 0.59 0.72 0.17 636 582 (116) 593 (100) 782 (226) 752 E7 -0.60 -0.14 0.16 0.59 0.60 0.12 262 578 (116) 588 (97) 704 (221)* 672 “PC.02
Table 10. Number of Logical Reasoning Items in Response-Time Categories by Position and Level of Difficulty
Position Mean time in 20-Second Intervals
O-20 20-40 40-60 60-80 80-100 loo-120 120-140 140-160 160-180 Very Easy Items (b less than - 1.5)
1 to4 2 4 1 10 and 15 1 2 5 20 3 2 30 3 4
35 2 5 1
Easy Items (b between -1.5 and -0.5) 1 to4 1 8 4
10 and 15 1 6 5 20 3 8
30 2 10 35 1 11
Moderately Difficult Items (b between -0.5 and +0.5)
1 to 4 4 8 3
10 and 15 8 9 1
20 3 12 4
30 1 16 1
35 1 10 7
Difficult Items (b between +0.5 and +1.5) 1 to 4 4 6 2
10 and 15 3 7 3
20 2 4 5 1 30 1 11 35 2 8 1
Very Difficult Items (b greater than +1.5)
1 to 4 1 4 1 5 1 1
10 and 15 1 3 3 4 5 1
20 2 6 4 6 1
30 4 6 3 1
35 1 6 6
27
Scatterplot of Item Position and Mean Time 300
200
‘cn’ ? 8 8 100
2 .- I-
5
8 0
q
a cl
0 q
0 q 0
i q Oki
0 0 0
q 0
i” iii 0
0 0
0 q
0 q Cl
0
0 10 20 30
Item Position
Positions 9,10,22, and 23 were for excluded Data Interpretation sets
Figure 1. Scatterplot of item position by mean response time
28
300
200
3 s 8 100
3
f .- l-
ii
f 0
Scatterplot of Difficulty and Mean Time
0
q
q
a q 0
Cl
0
0
q
q
q
q 0
I3 0 KY
0 q *
0 El n q q
q f4 0
0
cl cl
cl
-5 -4
Difficulty (IRT
-3 -2 -
b parameter)
1 0 1 2 3
Figure 2. Scatterplot of item difficulty by mean response time
29