+ All Categories
Home > Documents > A Comparison of - University of Minnesota Digital Conservancy

A Comparison of - University of Minnesota Digital Conservancy

Date post: 11-Feb-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
11
1 A Comparison of Free-Response and Multiple-Choice Forms of Verbal Aptitude Tests William C. Ward Educational Testing Service Three verbal item types employed in standardized aptitude tests were administered in four formats—a conventional multiple-choice format and three for- mats requiring the examinee to produce rather than simply to recognize correct answers. For two item types—Sentence Completion and Antonyms—the response format made no difference in the pattern of correlations among the tests. Only for a multi- ple-answer open-ended Analogies test were any sys- tematic differences found; even the interpretation of these is uncertain, since they may result from the speededness of the test rather than from its re- sponse requirements. In contrast to several kinds of problem-solving tasks that have been studied, dis- crete verbal item types appear to measure essential- ly the same abilities regardless of the format in which the test is administered. Tests in which an examinee must generate an- swers may require different abilities than do tests in which it is necessary only to choose among alternatives that are provided. A free-re- sponse test of behavioral science problem solv- ing, for example, was found to have a very low correlation with a test employing similar prob- lems presented in a machine-scorable (modi- fied multiple-choice) format; it differed from the latter in its relations to a set of reference tests for cognitive factors (Ward, Frederiksen, & Carlson, 1980). Comparable differences were ob- tained between free-response and machine-scor- able tests employing nontechnical problems, which were designed to simulate tasks required in making medical diagnoses (Frederiksen, Ward, Case, Carls®n9 ~ Samph, 1981). There is also suggestive evidence that the use of free-response items could make a contribu- tion in standardized admissions testing. The open-ended behavioral science problems were found to have some potential as predictors of the professional activities and accomplishments of first-year graduate students in psychology; the Graduate Record Examination Aptitude and Advanced Psychology tests are not good predic- tors of such achievements (Frederiksen & Ward, 1978). Problem-solving tasks like these, however, provide very inefficient measurement. They re- quire a large investment of examinee time to produce scores with acceptable reliability, and they yield complex responses, the evaluation of which is demanding and time consuming. It was the purpose of the present investigation to ex- plore the effects of an open-ended format with item types like those used in conventional ex- aminations. The content area chosen was verbal knowledge and verbal reasoning, as represented by item types-Antonyms, Sentence Com- pletion, and Analogies. The selection of these item es has several bases. First, their relevance for aptitude assess- Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227 . May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/
Transcript
Page 1: A Comparison of - University of Minnesota Digital Conservancy

1

A Comparison ofFree-Response and Multiple-Choice Formsof Verbal Aptitude TestsWilliam C. WardEducational Testing Service

Three verbal item types employed in standardizedaptitude tests were administered in four formats—aconventional multiple-choice format and three for-mats requiring the examinee to produce rather thansimply to recognize correct answers. For two itemtypes—Sentence Completion and Antonyms—theresponse format made no difference in the patternof correlations among the tests. Only for a multi-ple-answer open-ended Analogies test were any sys-tematic differences found; even the interpretation ofthese is uncertain, since they may result from thespeededness of the test rather than from its re-sponse requirements. In contrast to several kinds ofproblem-solving tasks that have been studied, dis-crete verbal item types appear to measure essential-ly the same abilities regardless of the format inwhich the test is administered.

Tests in which an examinee must generate an-swers may require different abilities than dotests in which it is necessary only to chooseamong alternatives that are provided. A free-re-sponse test of behavioral science problem solv-ing, for example, was found to have a very lowcorrelation with a test employing similar prob-lems presented in a machine-scorable (modi-fied multiple-choice) format; it differed fromthe latter in its relations to a set of referencetests for cognitive factors (Ward, Frederiksen, &

Carlson, 1980). Comparable differences were ob-tained between free-response and machine-scor-able tests employing nontechnical problems,which were designed to simulate tasks requiredin making medical diagnoses (Frederiksen,Ward, Case, Carls®n9 ~ Samph, 1981).There is also suggestive evidence that the use

of free-response items could make a contribu-tion in standardized admissions testing. Theopen-ended behavioral science problems werefound to have some potential as predictors of theprofessional activities and accomplishments offirst-year graduate students in psychology; theGraduate Record Examination Aptitude andAdvanced Psychology tests are not good predic-tors of such achievements (Frederiksen & Ward,1978).Problem-solving tasks like these, however,

provide very inefficient measurement. They re-quire a large investment of examinee time toproduce scores with acceptable reliability, andthey yield complex responses, the evaluation ofwhich is demanding and time consuming. It wasthe purpose of the present investigation to ex-plore the effects of an open-ended format withitem types like those used in conventional ex-aminations. The content area chosen was verbal

knowledge and verbal reasoning, as representedby item types-Antonyms, Sentence Com-pletion, and Analogies.The selection of these item es has several

bases. First, their relevance for aptitude assess-

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 2: A Comparison of - University of Minnesota Digital Conservancy

2

ment needs no special justification, given thatthey up one-half of verbal abilitytests such as the Graduate Record Examination

and the Scholastic Test (SAT).Thus, if it can be shown that recasting theseitem types into an open-eiided format makes asubstantial difference in the abilities they mea-sure, a st~®n~ ~~s~ will be made for the irripor-tance of the response format in themix of items that enter into tests.Second, such produce reliable withrelatively short tests. Finally, open-ended formsof these item types require only single-word or,in the case of two-word answers.should thus be easy to score, incomparison with free-response problems whoseresponses may be several sentences in andmay embody two or three ideas. Al-though not solving the difficulties inherent inthe use of open-ended in large-scale testing,therefore, they would to some to re-duce their magnitude.

Surprisingly, no published comparisons of

and multiple-choice of theseitem types are available. Several investigatorshave, however, examined the effects of responseformat on Synonyms items-items in which theexaminee must choose or ~~~e~~t~ a word withessentially the same meaning as a word(~e~~ ~ Watts 1967; Traub ~ Fisher, ~~~‘~9Vernon, 1962). All found high correlationsacross formats, but only Traub and Fisher at-to answer the of whether theabilities measured in the two formats were iden-tical or only related. They concluded that the

the attribute by thetest and does affect the attribute measured a factortest and that there was ~~~~~ evidence of a factor

specific to open-ended verbal items. Unfortu-nately, they did not have scores on a sufficientvariety of to provide an unambiguous testfor the existence of a verbal factor.The present study was to allow a fac-tor-analytic of the influence of re-sponse format. Each of three stem wasin each of four formats, varied in the de-gree to which they require of an-

swers. It was thus possible to examine the fit ofthe data to each of two &dquo;ideaf9 of factorstructure: one in which only item-typewould be found, t at of a giventype essentially the same thing regard=less of the format; and one involvingonly format factors, indicating that the responserequirements of the task are of impor-tance than are differences in the kind of k~®v~~~

tested.

Method

of the Tests

Three item were employed. Antonyms~t~ ~ ~w~~~ given in the standard multiple-choice format) required the to selectthe one of five words that was most nearly oppo-site in to a given word. Com-pletions required the identification of the oneword ~rh~~~9 when into a blank space ina sentence, best fit the of the sentenceas a whole. Analogies, f~~~~~1y9 ~~.~~~d for theselection of the pair of words expressing arelationship to that expressed in a givenpair. oThree formats in addition to the multiple-

choice one were used. For Antonyms, for ex-ample, the &dquo;single-answer&dquo; format required theexaminee to think of an and to writethat word in an answer space, The &dquo;multiple-ar,-swer&dquo; format was still more the ex-aminee was to think of and write up to three dif -ferent for each word given. Finally,the ‘gk~y~~st~9 format the examinee tothink of an opposite, to locate this word in a 90-item alphabetized and to record its numberon the answer sheet. This latter format was in-cluded as a machine-scorable for atruly f~~~~~~~p~n~~ test.With two all item

were ones single-word Theexceptions were the single-answer multiple-Analogies tests. Here the examinee was

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 3: A Comparison of - University of Minnesota Digital Conservancy

3

required to produce pairs of words having thesame relationship to one another as that shownby the two words in the stem of the question.

Instructions for each test paraphrased closelyemployed in the GRE Aptitude Test, ex-cept as dictated by the specific require-ments of each format, With each set of instruc-tions was given one question and a briefrationale for the answer or answers suggested.for the tests, two or threefully acceptable answers were for eachsample question.The tests varied somewhat in number of items

and in time limits. Each testconsisted of 20 items to be in 12 min-utes. Slightly times (15 were al-lowed for forms including 20 or 20keylist The multiple-answer al-

lowed still more time per item-15 minutes for15 Antonyms or Analogies or for 18 Sen-Completion items. On the basis of exten-sive it was that these timelimits would be to avoid problems oaftest and that the number of itemswould be sufficient to scores with relia-bilities on the order of .7.

Test ~ ~~t~~~

Subjects 315 paid volunteers ~°®~ ~.

state university. more than te~~®

thirds were juniors and seniors.The small number (13’7o) for whom GRE Apti-tude Test scores were obtained were a somewhatselect group9 with means of 547, and 616 onthe Verbal, and Analyticrespectively ~t appears that the sample is asomewhat more able one than college studentsin general but probably less select the grad-uate school applicant pool.Each student participated in one 4-hour test-

session. Included in the session were 12 testsall combinations of the three itemwith four response and a briefquestionnaire to the student’s academicbackground, accomplishments, and interests.

The tests were presented in a randomizedorder, subject to the restriction no two suc-cessive tests should either the same itemtype or the response format. Four syste-matic variations of this order employed topermit an of and adjustment for pos-sible practice or effect. Each of the fourgroups tested, including Sl to 60 subjects,received tests in one of these sequences; the re-

mainder of the sample, in groups of 30 to40, all given in the first of the fourorders. o

~~®~~~

For each of the open-ended tests, scoring keysdeveloped that distinguished two ofappropriateness of an answer. Answers in oneset were judged fully acceptable, while those inthe second were of marginal appropriateness.An example of the latter would be an Antonymsresponse that identified the evaluationby a word but failed to an im-nuance or the force of the evaluation, Itwas through a trial that par-tial credits were unnecessary for two of the key-list tests-Antonyms and Analogies. Responsesto the remaining tests were coded topermit computer of several differentscores, on the credit to be givento marginally

Preliminary scoring were checked for

by an examination of about 20%of the answer sheets, Most of the tests were then~~~~~d by ~ highly clerk andby her Two tests, however,presented more complex problems. Forboth single-answer multiple-answer Anal-ogies, the scoring keys consisted of rationalesand rather than a list of pos-sible answers. Many decisions thereforeinvolved a substantial exercise of ~~d~~~~~to ~research assistant scored each of these tests, andthe author scored 25 answer sheets of each inde-Total scores derived from the two

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 4: A Comparison of - University of Minnesota Digital Conservancy

4

scorings correlated .95 for one test and .97 forthe other.

Resets

Results

~~c~la~ of data. No instances were foundin which subjects appeared not to take their taskseriously. Three answer sheets were missing orspoiled; sample mean scores were forthese. On 32 occasions a subject failed to at-tempt at least half the items on a test; but no in-dividual subject was responsible for more thantwo of these. It appeared that data from all sub-jects were of acceptable quality.Score derivation, The three multiple-choice

tests were scored using a standard correction forguessing: for a five-choice item, the score wasnumber correct minus one-fourth the numberincorrect. Two of the keylist tests were simplyscored for number correct. It would have been

possible to treat those tests as 90-alternative,

a~itnpi~~~h®ice tests and to apply the guessingcorrection, but the effect on the scores wouldhave been of negligible magnitude.For the remaining tests, scores were generated

in several ways. In one, scoring credit was givenonly for answers deemed fully acceptable; in asecond, the same credit was given to both fullyand acceptable answers; and in athird, marginal answers received half the creditgiven to fully acceptable ones. This third ap-proach was found to yield slightly more reliablescores than either of the others and was there-fore employed for all further analyses.Test order. Possible differences among

groups receiving the tests ~ different orderswere examined in two ways. One analysis wasconcerned with the level of performance; an-other considered the standard error of measure-

ment, a statistic that informationabout both the standard deviation and the relia-

bility of a test score and that indicates the preci-sion of measurement. In neither case were there

systematic differences associated with the orderin which the tests were administered,. Order wastherefore in all further analyses.

Test difficulty. Test means and standard de-viations are shown in Table 1. Most of the testswere of middle difficulty for this s~,~pl~9 two ofthe keylist tests were easy, whereas multiple-choice Antonyms was very difficult. Means forthe multiple-answer tests were low in relation tothe maximum possible score but represent oneto one-and-a-half fully acceptable answers peritem.Test speededness. Tests such as the GRE

Aptitude Test are considered unspeeded if atleast 75% of the examinees attempt all items andif virtually everyone attempts at least three-fourths of the items. By these criteria only one ofthe tests, multiple-answer Analogies, had anyproblems with speededness: About 75% of thesample reached the last item, but I4Vo failed toattempt the 12th item, which represents thethree-fourths point. For all the remaining tests,95% or more of the subjects reached at least allbut the final two items. Table I shows the per-cent of the sample completing each test.

Test ~°~~a~~~l~~. Reliabilities (coefficient al-

pha) are also shown in Table 1. ey rangedfrom .45 to .80, with a median of .69. There whereno differences in reliabilities associated with the

response format of the test-the medians rangedfrom .68 for multiple-choice tests to .75 for mul-tiple-answer forms. There were differences asso-ciated with item type; medians were .75 for

Antonyms, .71 for Sentence Completions, and.58 for Analogies. The least reliable of all thetests was the multiple-choice Analogies. The dif-ferences apparently represent somewhat lesssuccess in creating good analogies items ratherthan any differences inherent in the open-endedformats.

~~ ~®~~ the ~~~~

C®~~~~~~®~a~ ~ ®~~ tests. Zero-order cor-

relations the 12 are shown in the up-per part of Table 2. The correlations from.29 to .69, with a of .53. The seven lowestcoefficients in the table, the only ones below.40, are correlations involving the multiple-an-swer test. Correlations for

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 5: A Comparison of - University of Minnesota Digital Conservancy

5

Table 1

Descriptive Statistics for Tests

Table 2

Zero-Order and Attenuated Correlations Among 1 e s t s

Decimal points omitted. Zero-order correlations are presentedabove the main diagonal, while correlations corrected forattenuation are presented below.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 6: A Comparison of - University of Minnesota Digital Conservancy

6

attenuation are shown in the lower part of thetable; the correction is based on coefficient

alpha reliabilities. The correlations from.45 to .97 and have a median of .80.These coefficients indicate that the various

tests share a substantial part of their true vari-ance, but they do not permit a conclusion as towhether there are systematic differences amongthe Three analyses that ~.ddb~ss this questi~xlare presented below.Factor analyses. A preliminary principal

components analysis produced the set of eigen-values displayed in Table 3. The first component

Table 3

Principal Components ofthe Correlations Matrix

was very large, for 57%® of the totalvariance, while the next largest accounted foronly 7~® of the variance. one rule of thumbfor number of that of thenumber of eigenvalues greater than there isonly a single factor represented in these results.another, that of differences in ofsuccessive eigenvalues, there is some evidencefor a second factor but none at all for more thantvv®a

It was originally to use a confirma-tory factor analytic approach to the analysis(Jöreskog, 1970) in order to contrast two ~d~~.l®ized models of test relations-one involvingthree item-type factors and one four

response-format factors. In view of the ofthe principal components analysis, however,either of these would clearly be a distortion ofthe data. It was decided, therefore, to use an ex-ploratory factor analysis, which could be fol-lowed by confirmatory analyses comparingsimpler models if such a comparison seemedwarranted from the results. The analysis was aprincipal axes factor analysis with iterated com-munalities.A varimax (orthogonal) rotation of the two-

factor solution produced unsatisfactory re-

sults-10 of the 12 scores had appreciable load-ings on both factors. The results of the oblimin(oblique) rotation for two factors are presentedin Table 4. The two factors were cor-related (r = .67). Ten of the 12 scores had theirhighest loading on Factor I, one (single-answerAnalogies) divided about equally thetwo, and only (multiple-answer Analogies)had its loading on the second factor.

~®r tw® it~~ typ~s9 ~~r~t~n~e CompletionAntonyms, these results leave no ambiguity as tothe effects of response format. The use of an

open-ended format no in theattribute measures the test. The interpreta-tion for the Analogies is less clear. Thesecond factor is (just under 5% of the com-mon factor variance), it is poorly defined,with only one test having its primary loading onthat factor. the one test that did loadheavily on Factor 11 also the only test in thebattery that was at all There is a rea-sonable of Factor II as a speedfactor (Donlon, 1980); the rank-order correla-tion between Factor III loadings the numberof subjects to attempt the last item of atest was .80 (p < .01).Factor analyses also performed taking

into account the academic level of the student.The sample included two groups large enough tobe considered for separate analyses -seniors(l~l m 75~ ~~d ~ juniors (N = 141). For each groupa one-factor solution was indicated. A combined

analysis was also carried out after formean and variance dl~~r~r~~es 1~ the data forthe two groups. The eigenvalues suggested either

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 7: A Comparison of - University of Minnesota Digital Conservancy

7

Table 4Factor Pattern for Two-Factor Analysis

a one- or a two-factor solution; the two-factorsolution, however, all tests having theirhighest loading on the first factor only mul-tiple-answer Analogies an di-vision of its variance between the two factors.

Thus, there was no strong evidence for the ex-istence of a factor ~ the data. There wereweak indications that the multiple-answerAnalogies and, to a much lesser extent, thesingle-answer Analogies provided somewhatdistinct measurement from the remainder of thetests in the evidence is clear that Sen-Completion Antonyms item typesmeasure the same attribute of the for-mat in which the item is administeredMultitrait-multimethod analysis. The data

may also be considered within the framework

provided by multitrait-multimethod analysis(Campbell & 1959). of the threeitem types a &dquo;trait,&dquo; while each ofthe four response formats constitutes a 66 eth&reg;old.&dquo; The data were following a schemesuggested by and Werts (1966). All thecorrelations relevant for each werecorrected for attenuation and then us-

Fishees ~°~t~&reg;~ transformation, Results aresummarized in Table 5.Each row in the upper of the table pro-

vides the average of all those correlations thatrelations for a item as mea-sured in different formats and of all thosecorrelations that relations betweenthat item and other item when the twotests different response formats. Thus,for the Sentence item the entryin the first column is an average of all six cor-

relations among Sentence Completion scoresfrom the four formats. The in the secondcolumn is an average of 24 correlations: for eachof four Sentence Completion scores, the six cor-relations representing relations to each item typeother than Sentence Completion in each of threeformats. The lower part of the table is organizedit for each response for-mat a of average correlations withinformat with those between formats for all test

pairs different item types.Results in the upper of the table show

that there was some variance associated withtrait for both Sentence Completion and

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 8: A Comparison of - University of Minnesota Digital Conservancy

8

Table 5

Multitrait-Multimethod Summary of Average Correlations

*By Mann-Whitney U Test, the two entries in a roware significantly different at the 5% level ofconfidence. &reg;

Antonyms item types ~by ~~n&reg; btneyl test,p < .05). Analogies tests did not, however, relateto one another any more strongly they re-lated to tests of other item types.The lower part of the table shows differences s

attributable to response format. There is an ap-parent tendency toward a difference in favor ofstronger relations among multiple-choice teststhan those tests have with tests in other formats,but this tendency did not approach significance~ > For the truly open-ended response for-there were no differences whatsoever. Likethe factor analyses, this approach to correlation-al comparisons showed no tendency for open-ended tests to cluster according to the responseformat; to the slight degree that any differenceswere found, they represented clustering on thebasis of the item type rather than the responseformat employed in a test.Correlations corrected for &dquo;alternate forms

&dquo;

reliabilities. °f°he ultltr~it-multimeth&reg;d cor-

relational comparison made use of internal con-sistency reliability coefficients to correct correla-tions for their unreliability. Several interestingcomparisons can also be made using a surrogatefor alternate forms reliability coefficients. Thebattery, of course, contained only one instanceof each item-type by response-format combina-

tion, so that no true alternate form examinationscould be made. It may be reasonable, however,to consider the two truly open-ended forms of atest-multiple-answer and sin~le~~r~sw~r&reg;~.stwo forms of the same test given under &dquo;open&dquo;conditions, and the two remaining f&reg;r~s-~~1~tiple-choice and keylist-as two forms of thesame test given under &dquo;closed&dquo; conditions. Onthis assumption, relations across open andclosed formats for a given item type can be esti-mated by the average of the four relevant cor-relations and corrected for reliabilities repre-sented by the correlations within open and with-in closed formats.The corrected correlations were .97 for Sen-

tence Completion, .88 for Analogies, and 1.05for Antonyms. It appears that relations acrossthe two kinds of formats did not differ from 1.0,except for error in the data, for two item types.Analogies tests may fail to share some of theirreliable variance across open and closed formatsbut still appear to share most of it.

with V mdables

Students completed a questionnaire dealingwith their academic background, accomplish-ments, and interests. Included were questions

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 9: A Comparison of - University of Minnesota Digital Conservancy

9

concerning (1) plans for graduate school atten-dance and advanced degrees, (2) undergraduategrade-point average overall and in the majorfield of study, (3) preferred career activities, (4)self-assessed skills and competencies within themajor field, and (5) independent activities andaccomplishments within the current academicyear. Correlations were obtained between ques-tionnaire variables and scores on the 12 verbaltests.

Most of the correlations were very low. Onlyfour of the questions produced a correlationwith any test as high as .20; these were level ofplanned, self-reported grade-point aver-age (both overall and for the major field ofstudy), and the choice of writing as the individ-ual’s single most preferred professional activity.No systematic differences in correlations associ-ated with item type or response format were evi-dent.Information was also available on the stu-

dent’s and year in school. No significantcorrelations with gender were obtained. Ad-vanced students tended to obtain higher testscores, with no evidence of differencesamong the tests in the magnitude of the rela-tions.

GRE Aptitude Test were available for asmall number of students $N &reg; 41). Correlationswith the GRE Verbal score were substantial in

magnitude, ranging from .50 to .74 with a medi-an of .59. Correlations with the GRE Quantita-tive and Analytical scores were lower but still ap-preciable, having medians of .36 and A7, respec-Here also there were no systematic differ-ences associated with item types or test formats.These results, like the analyses of correlations

among the experimental tests, suggest that re-sponse format has little effect on the nature ofthe attributes measures the item types underexamination.

Discussion

This study has shown that it is possible to de-velop open-ended forms of several verbal apti-tude item types that are approximately as good,

in terms of score reliability, as multiple-choiceitems and that require only slightly greater timelimits than do the conventional items. These

open-ended items, however, provide little newinformation. There was no evidence whatsoeverfor a general factor associated with the use of afree-response format. There was strong evidenceagainst any difference in the abilities measuredby Antonyms or Sentence Completion items as afunction of the response format of the task. OnlyAnalogies presented some ambiguity in interpre-tation, and there is some reason to suspect thatthat difference should be attributed to the slightspeededness of the multiple-answer Analogiestest employed.

It is clear that an open-ended response formatwas not in itself sufficient to determine whatthese tests measured. Neither the requirement togenerate a single response, nor the more difficulttask of producing and writing several differentanswers to an item, could alone change the abil-ities that were important for successful perfor-mance. What, are the characteristics of anitem that will measure different attributes de-

pending on the response format employed? Acomparison of the present tests with those em-pl&reg;yed ln the earlier problem-solving research ofWard et al. (1980) and Frederiksen et al. (1981)suggests a number of possibilities. In the prob-lem-solving work, subjects had to read and tocomprehend passages containing a number ofitems of information relevant to a problem. Theywere required to determine the relevance of suchinformation for themselves and often to applyreasoning and inference to draw conclusionsfrom several items of information. Moreover,they needed to draw on information not pre-sented -specialized knowledge concerning thedesign and interpretation of research studies, forthe behavioral problems, and more gen-eral knowledge obtained from everyday life ex-periences, for the nontechnical problems. Final-ly, subjects composed responses that often en-tailed relating several complex ideas to one an-other.The verbal aptitude items, in contrast, are

much more self-contained. The examinee has

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 10: A Comparison of - University of Minnesota Digital Conservancy

10

only to deal with the meaning of one word, of apair of words, or at most of the elements of ashort sentence. In a sense, the statement of the

problem includes a specification of what infor-mation is relevant for a solution and of whatkind of solution is appropriate. Thus, the verbaltests might be described as &dquo;well-structured&dquo;and the problem-solving tests as &dquo;ill-structured&dquo;problems (Simon, 1973). The verbal tests also, ofcourse, require less complex responses-a singleword or, at most, a pair of words.Determining which of these features are criti-

cal in distinguishing tests in which an open-ended format makes a difference will requirecomparing a number of different item types inmultiple-choice and free-response formats. It

will be of particular interest to develop itemtypes that eliminate the confounding of com-plexity in the information search required by aproblem with complexity in the response that isto be produced.For those concerned with standardized apti-

tude testing, the present results indicate that oneimportant component of existing tests amountsto sampling from a broader range of possibletest questions than had previously been demon-strated. The discrete verbal item types presentlyemployed by the GRE and other testing pro-grams appear to suffer no lack of generality be-cause of exclusive use of a multiple-choice for-mat ; for these item types at least, use of open-ended questions would not lead to measurementof a noticeably different ability cutting acrossthe three item types examined here. It remainsto be seen whether a similar statement can be

made about other kinds of questions employedin the standardized tests and whether there are

ways in which items that will tap &dquo;creative&dquo; or

&dquo;divergent thinking&dquo; abilities can be presentedso as to be feasible for inclusion in large-scaletesting.

Reference

Campbell, D. T., & Fiske, D. W. Convergent and dis-criminant validation by the multitrait-multimeth-

od matrix. Psychological Bulletin, 1959, 56,81-105.

Donlon, T. F. An exploratory study of the implica-tions of test speededness. (GRE Board Profession-al Report GREB No. 76-9P). Princeton NJ: Edu-cational Testing Service, 1980.

Frederiksen, N., & Ward, W. C. Measures for the

study of creativity in scientific problem-solving.Applied Psychological Measurement, 1978, 20,1-24.

Frederiksen, N., Ward, W. C., Case, S. M., Carlson,S. B., & Samph, T. Development of methods forselection and evaluation in undergraduate medi-cal education (Final Report to the Robert WoodJohnson Foundation). Princeton NJ: EducationalTesting Service, 1981.

Goldberg, L. P., & Werts, C. W. The reliability ofclinicians’ judgments: A multitrait-multimethodapproach. Journal of Counseling Psychology,1966, 30, 199-206.

Heim, A. W., & Watts, K. P. An experiment on mul-tiple-choice versus open-ended answering in a vo-cabulary test. British Journal of Educational Psy-chology, 1967, 37, 339-346.

J&ouml;reskog, K. G. A general method for analysis ofcovariance structures. Biometrika, 1970, 57,239-251.

Simon, H. A. The structure of ill-structured prob-lems. Artificial Intelligence, 1973, 4, 181-201.

Steel, R. G. D., & Torrie, J. H. Principles and proce-dures of statistics. New York: McGraw-Hill, 1960.

Traub, R. E., & Fisher, C. W. On the equivalence ofconstructed-response and multiple-choice tests.

Applied Psychological Measurement, 1977, 1,355-369.

Vernon, P. E. The determinants of reading compre-hension. Educational and Psychological Measure-ment, 1962, 22, 269-286.

Ward, W. C., Frederiksen, N., & Carlson, S. B. Con-struct validity of free-response and machine-scor-able forms of a test. Journal of Educational Mea-surement, 1980,17, 11-29.

c &reg;w! e

Appreciation is due to Carol erg Fred Godshalk,and Leslie Peirce for their assistance in developingand reviewing items; to Sybil Carlson and David Du-pree for arranging and conducting test administra-tions ; to Henrietta Gallagher and Hazel Klein forcarrying out most of the test scoring; and to Kirsten

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/

Page 11: A Comparison of - University of Minnesota Digital Conservancy

11

Yocum for assistance in data analysis. LedyardTucker provided extensive advice on the analysis andinterpretation of results. This research was supportedby a grant f-rom the Graduate Record ExaminationBoard.

Author’§ Address

Send requests for reprints or further information toWilliam C. Ward, Senior Research Psychologist, Ed-ucational Testing Service, Princeton NJ 08541,U.S.A.

Downloaded from the Digital Conservancy at the University of Minnesota, http://purl.umn.edu/93227. May be reproduced with no cost by students and faculty for academic use. Non-academic reproduction

requires payment of royalties through the Copyright Clearance Center, http://www.copyright.com/


Recommended