Test Analysis Program Evaluation: TIM Item Statistics as Feedback to Test … · 2011. 5. 15. ·...

ARI Research Note 91-03

Test Analysis Program Evaluation:TIM Item Statistics as Feedback

to Test Developers

NPeter J. Legree

I U.S. Army Research Institute

Field Unit at Fort Gordon, GeorgiaMichael G. Sanders, Chief

Training Research LaboratoryJack H. Hiller, Director

October 1990 DTICj ELECTE f

OPDEO -4 19MD

United States ArmyResearch Institute for the Behavioral and Social Sciences

Approved for public release; distribution is unlimited.

C0 I,

U.S. ARMY RESEARCH INSTITUTEFOR THE BEHAVIORAL AND SOCIAL SCIENCES

A Field Operating Agency Under the Jurisdictionof the Deputy Chief of Staff for Personnel

EDGAR M. JOHNSON JON W. BLADESTechnical Director COL, IN

Commanding

Technical review by

William J. York, Jr.

NOTICES

DISTRIBUTION: This report has been cleared for release to the Defense Technical InformationCenter (DTIC) to comply with regulatory requirements. It has been given no primary distributionother than to DTIC and will be available only through DTIC or the National TechnicalInformation Service (NTIS).

FINAL DISPOSITION: This report may be destroyed when it is no longer needed. Please do notreturn it to the U.S. Army Research Institute for the Behavioral and Social Sciences.

NOTE: The views, opinions, and findings in this report are those of the author(s) and should notbe construed as an official Department of the Army position, policy, or decision, unless sodesignated by other authorized documents.

UNCLASSIFIEDURITY CLASSIFICATION OF THIS PAGE

Form ApprovedREPORT DOCUMENTATION PAGE OMB No. 0704-0188

REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS

nclassified --

SECURITY CLASSIFICATION AUTHORITY 3. DISTRIBUTION/AVAILABILITY OF REPORT___Approved for public release;

* DECLASSIFICATIONIDOWNGRADINO SCHEDULE distribution is unlimited.

PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S)

,RI Research Note 91-03

.NAME OF PERFORMING ORGANIZATION 6b. OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATIONI.S. Army Research Institute (If applicable)

'ort Gordon Field Unit PERI-IG

ADDRESS (City, State, and ZIP Code) 7b. ADDRESS (City, State, and ZIP Code)

\ttn: PERI-IG (Bldg. 41203)

ort Gordon, GA 30905-5230

* NAME OF FUNDING/SPONSORING t8b. OFFICE SYMBOL 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBERORGANIZATION U.S. Army Research (If applicable) __

Institute for the Behavioraland Social Sciences PERI-I

* ADDRESS (City, State, and ZIP Code) 10 SOURCE OF FUNDING NUMBERSPROGRAM PROJECT TASK WORK UNIT

5001 Eisenhower Avenue ELEMENT NO. NO. NO. ACCESSION NO.Alexandria, VA 22333-5600 63007A 795 3303 H01

* TITLE (Include Security Classification)

Test Analysis Program Evaluation: Item Stat-istics as Feedback to Test Developers

PERSONAL AUTHOR(S)

Legree, Peter J.

Ia. TYPE OF REPORT 13b. TIME COVERED 14. DATE OF REPORT (Year, Month, Day) iS. PAGE COUNT

Final IFROM 89/02 TO 90/03 1990, October 16

i SUPPLEMENTARY NOTATION

COSATI CODES 18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)

FIELD GROUP SUB-GROUP Test analysis Item analysis ReliabilitySystems Approach to Item consistency Validity

Training (SAT) Item difficulty -,

I, ABSTRACT (Continue on reverse if necessary and identify by block number)

') The test analysis program was evaluated to determine the feasibility of using a per-

sonal computer to provide course developers with item statistics. This project was under-

taken because of Signal School concern that course tests do not accurately measure student

school performance. The evaluation focused on the usefulness of providing item statistics

to course test developers and demonstrated that many of the tests contain poorly written

items. The evaluation indicates that a computerized test analysis program can be used to

identify questionable test items and help ensure Signal School tests are adequate to

validate lessons and courses. . . ,

I. DISTRIBUTION/ AVAILABILITY OF ABSTRACT 21. ABSTRACT SECURITY CLASSIFICATIONr3WUNCLASSIFIED/UNLIMITED 0 SAME AS RPT 0 DTIC USERS Unclassified

a. NAME OF RESPONSIBLE INDIVIDUAL 22b TELEPHONE (include Area Code) 22c OFFICE SYMBOL

Peter J. Legree (404) 791-5523/5524 PERI-I(

I Form 1473, JUN 86 Previous editions are obsolete. SECURITY CLASSIFICATION OF THIS PAGEUNCLASSIFIED

TEST ANALYSIS PROGRAM EVALUATION: ITEM STATISTICS AS FEEDBACK TOTEST DEVELOPERS

CONTENTS

Page

INTRODUCTION AND BACKGROUND .......... ................ 1

Statement of Problem ........ ... .................. 1Criterion Referenced Testing at the

Signal School ............ .................... 1Description of SQT Test Item Development

and Standards ................................ 2Test Analysis Program Recommendation ... .......... 4

METHOD ............... .......................... 4

RESULTS AND DISCUSSION ....... .... .................. 5

Feedback to the Course Developers .... ... ........... 5Item Statistics Summary .......... ................ 5

CONCLUSIONS .............. ........................ 9

REFERENCES ........... ........................ 11

APPENDIX A. RELATIONSHIP BETWEEN TEST RELIABILITYAND TRAINING EVALUATION ... ........... A-1

LIST OF TABLES

Table 1. Summary statistics for the 29E and31M tests ............ .................... 6

2. Proportion of test items not meetingATSC standards ....... ... ................. 7

IDT iC

P Accession For

_ NTIS GRA&I_ , 4-

DTIC TAB 0UnannouncedJustificatio

Byv Distribution/

Availabjil1ty Codes

Dist Speoia.

TEST ANALYSIS PROGRAM EVALUATION:ITEM STATISTICS AS FEEDBACK TO TEST DEVELOPERS

Introduction and Background

This project was undertaken to address the concern thatperformance on Signal course tests does not accurately measurestudent school performance. The project goal was to assess onemethod that could be used to improve Signal School test quality:a computerized item analysis program that identifies questionabletest items.

The Silnal School is concerned with test accuracy becausethe quality of the school's lessons is partially dependent on thequality of the course tests since changes in the lessons must bevalidated using actual students (Tradoc Regulation 350-7, 1988).This policy applies to both minor and major lesson modificationsand revisions, as well as to the implementation of new trainingtechnologies and approaches to teach lessons. Course testquality is important because practical constraints usually limitthe collection of the validation data to student performance oncourse tests. The Signal School also recognizes that theaccuracy of the course test scores limits the quality of student-related managerial decisions and is important to the maintenanceand development of student motivation.

Statement of Problem

The Deputy Assistant Commandant at the Signal Schoolrequested assistance from the Army Research Institute toimplement procedures to improve the accuracy of Signal Schooltests. This research project focused on methods that could beused to improve the technical quality of Signal School coursetests. Test content issues were not addressed because the SignalSchool closely follows the Systems Approach to Training (SAT)guidelines in order to insure test content, and because SubjectMatter Experts indicate that test content is not problematicwithin the Signal School.

Criterion Referenced Testing at the Signal School

At the Signal School, the development of course tests issimilar to the development of Skill Qualification Tests (SQTs).Both types of tests are criterion referenced and are based ontask lists. A correspondence between the task lists and the testitems is required for both types of tests to insure that thetests are representative of the tasks for that MilitaryOccupational Specialty (MOS). In fact, SMEs at the Signal Schoolare often tasked to revise Signal Programs of Instruction (POIs)

1

and Signal SQTs, i.e. SMEs are given dual responsibilities. TheSignal School has promoted test quality by requiring course testdevelopers to follow SAT guidance and by sponsoring testconstruction workshops for SQT developers. The primary contentdifference between the two types of tests is that the SQTs aredesigned to test a broader range of skills because the fieldexperience of soldiers includes activities that can not be taughtor tested at the Signal School.

An important procedural difference between the developmentand refinement of SQTs and that of course tests is that test iteminformation is provided to SQT developers by the Army TrainingSupport Command (ATSC). Signal School SQT developers use theinformation to help identify and correct problematic test items.In contrast, test item information is not available for SignalSchool course test developers and has not been integrated intothe SAT guidance on course test development. The successful useof item information by SQT develope-s suggests that thisinformation might be used to refine Signal course tests.

Description of SOT Test Item Development and Standards

The initial development of SQT items is based on thecomparison of the test performance of groups of soldiers that canperform a task, versus groups that can not perform the task. Tobe included on an SQT, the items must be answered correctly byover 50 percent of the performers and the performers must scorehigher than the non-performers (M. Andriliunas, personalcommunication, March 1990, ATSC, Fort Eustis, VA; TRADOC Reg 351-2). This approach is problematic because practical constraintson item development resources limits the size of the two groupsto a maximum of ten individuals. It is noteworthy that if thisprocedure were modified to use large groups, it would insure itemconsistency by identifying test items that discriminate betweengroups of soldiers.

After the SQTs have been formally administered to largegroups of the soldiers, the ATSC calculates test item statistics.The statistics are returned to each Military OccupationalSpecialty proponent and are provided to SQT developers for use inthe revision of the tests. The ATSC sets standards for theproponents to follow to insure that the MOS proponents utilizesimilar guidelines during SQT revision.

Although the ATSC recommends reviewing test items and testitem distractors on the basis of the item statistics, the ATSCdoes not require that the test items be changed on the basis ofthe item statistics. The item statistics and recommendations aredesigned as an aid to assist SQT developers in identifyingproblematic test items.

2

The ATSC uses a computer program to identify questionabletest items by monitoring item difficulty, item consistency, anddistractor attractiveness. The following paragraphs describe theitem statistic standards that have been set for SQT scores.

Item difficulty is defined as the proportion of examineeswho correctly answer each test item. The item difficulty valueranges from 0 to 100 percent, indicating that between 0 and 100percent of the responses were correct for that test item.According to TRADOC Regulation 351-2, test item difficulty shouldvary between 50 percent and 95 percent. Item difficulty valuesthat are less than 50 percent usually indicate an error in theanswer key and are relatively rare. More commonly, anunchallenging or poorly written question will be correctlyanswered by a high proportion of the examinees. Item difficultyis monitored to insure that the test items measure variance inthe content areas that are tested.

Item consistency is estimated by the point biserialcorrelation between performance on each test item and performanceon the remainder of the test. The point biserial will range from-1.0 to 1.0. Positive values indicate that performance on thatitem is consistent with performance on the rest of the test whilenegative values indicate that the better test performers werebelow average for that test item. The item consistency index canbe used to identify test questions that discriminate between thebetter and poorer students. ATSC has the goal of obtaining anitem consistency index greater than 0.20 for SQT items (M.Andriliunas, personal communication, March 1990, ATSC, FortEustis, VA).

Distractor attractiveness quantifies the extent to whichexaminees find incorrect distractors plausible on multiple choicetests. This is important because examinees can eliminateimplausible distractors and choose the correct answer withoutadequate knowledge of the content area of the item. According toTRADOC Regulation 351-2, the attractiveness of each distractorshould exceed 5 percent. This implies that the item difficultyof multiple choice test items should be less than .85 because thethree distractors on a standard four-choice multiple choice testquestion should be chosen by over 15 percent of the population;however, the less restrictive standard, .95, was adopted to beconsistent with the ATSC Item Difficulty standard.

Across Army SQTs, a high proportion of the test items do notmeet the ATSC standards; for example, 4,000 of 17,000 recentlyanalyzed test items have item consistencies less than 0.20 while800 of the items were less than 0.00 (M. Andriliunas, personalcommunication, March 1990, ATSC, Fort Eustis, VA)

3

Test Analysis Program Recommendation

In order to provide course developers with item statistics,ARI recommended utilizing the Test Analysis Program (TAP) toanalyze the course tests of two Basic Non-commissioned OfficerCourses (BNCOCs). The TAP runs on a microcomputer and has beenintegrated with a Scantron sheet reader. This system allowsstudent responses to be read by a Scantron form reader and placedinto a data tile. The data are then analyzed by the TAP and itemstatistics are calculated for each test item. This system canquickly provide the course developer with information that wouldotherwise be impractical to calculate.

The TAP statistics include estimates of item consistency,item difficulty, and distractor attractiveness. These statisticsare very similar to those calculated by ATSC. The maindifference between the ATSC system and the TAP is that TAP isdesigned for micro-computer and can be used at the Signal Schoolby course developers. The TAP can also be used to compute itemconsistencies for competency areas, thus it is possible todetermine if responses on items are consistent with either theentire test, or with related groups of items. The TAP also hasitem banking capabilities that are designed to produce equivalentforms of tests. This capability could be used by the SignalSchool to fulfill TRADOC regulations requiring three testversions for each course.

In addition to the item statistics, the TAP computes twosingle form estimates of test reliability. (Test reliabilityestimates have implications for power analysis, see Appendix 1.)The reliability estimates are based on the Coefficient Alpha andthe Spearman Brown Split-half formulae. The TAP will alsocompute reliability estimates for test subscales or competencyareas.

Method

Two BNCOCs, the 29E and the 31M, were chosen for thisevaluation because these courses utilize multiple choice testsand have a higher throughput than other Signal School BNCOCs.The Signal School estimated that 120 students would be trained inthe 29E BNCOC and 420 would be trained in the 31M BNCOC during1989. The multiple choice tests, which these courses use, wereadapted for computerized scoring and grading using Scantron formsas answer sheets.

4

During the data collection period, a total of 212 studentswere tested in the 31M course while 72 were tested in the 29Ecourse. Each item statistic is based on the test results of asubset of the two sets of students because the Signal School isrequired to maintain three versions of each annex test to avoidcompromising tests and to allow the retesting of students whofail course tests. Only test items, for which more than 20soldiers were tested, were analyzed for this report.

Results and Discussion

Feedback to the Course Developers

The Test Analysis Program was used to compute itemstatistics and analyze the responses to the multiple choicetests. For each test item, summary statistics were calculatedand were provided to the course developers. Questionable itemswere flagged in feedback given to the course developers.

The test items were evaluated with standards that are verysimilar to those used by the ATSC. An item was slated to beflagged if the point biserial correlation was less than 0.20, orif the item difficulty was higher than .95. In addition, itemdistractors were to be flagged if less than 5 percent of thesoldiers chose that distractor.

It was necessary to modify the standards for feedback to thecourse developers because the original standards led to anextremely high proportion of flagged items. The standards werechanged so that no more than half of the items for any one testwere flagged. Distractor attractiveness information wasprovided to the course test developers, but questionabledistractors were not individually flagged because most of thetest items contained distractors that did not meet the ATSCstandard.

Item Statistics Summary

Table 1 summarizes the item statistics by class and containsclass estimates of item consistency, item difficulty, anddistractor attractiveness for the BNCOC tests. The threedistractor attractiveness columns contain the proportion ofstudents who chose the most attractive, second most attractiveand the least attractive distractor for each test item. Forcomparison purposes, Table 1 also summarizes the item statisticsfor the 29E Skill Level 2 SQTs for 1989.

5

Table 1. Summary Statistics for the 29E and 31M Tests

Test Mean Mean Test Sample Percentage Choosing ReliabtItem Item Size Size Corr Distractor CoeffctDiff Cons Answ 1 2 3 Alpha

29-EAv3 90.9 .30 20 46.9 91 7 2 0 .55Bv2 88.3 .30 30 24.0 88 10 2 0 .54Bv3 88.0 .23 30 28.9 88 9 2 0 .57Cv3 92.6 .24 20 42.9 93 6 1 0 .35Elv2 92.6 .19 40 33.8 93 6 1 0 .55Elv3 86.0 .31 20 31.0 87 10 2 1 .59SQT 84.5 .26 118 227.0 84 10 4 1 .80

31-MAvl 85.6 .21 50 210.9 86 11 2 1 .60D 89.5 .26 50 163.8 90 9 1 0 .75Fvl 89.3 .14 50 74.8 89 8 2 0 .38Fv2 86.4 .18 50 34.0 86 10 3 1 .62Fv3 87.9 .17 50 61.9 88 9 2 0 .42

Table 2 contains the proportion of items that did not meetthe ATSC standards for each of three statistics: item difficulty,item consistency, and distractor attractiveness. The tabledemonstrates that a very high proportion of the items did notmeet the standards for each test. This was true regardless ofwhether the original standards were used to identify questionabletest items or whether the standards for the test items werelowered to limit the number of flagged items.

According to Table 2, many of the test items are extremelyeasy; approximately 37 percent of the course test items have anease index that is greater than .95 while 72 percent of the itemshave an ease index greater than .85. Table 2 also indicates that48 percent of the test items across the course tests have lowitem consistency estimates, i.e. less than .20.

The distractor attractiveness columns in Table 2 indicatethat only 50 percent of the test items have at least onedistractor that attracts more than 5 percent of the response.Table 2 also indicates that very few items have more than oneattractive distractor as shown by the fact that 89 and 98 percentof the test items do not have second and third distractors thatare chosen by more than 5 percent of the students.

A comparison of the item difficulty and item consistencyestimates indicates that 78 percent of the items, which do notmeet the item difficulty standard (.95), do not meet the itemconsistency standard. The overlap indicates that the item easeindex is nearly as effective as the item consistency measure inidentifying items with low consistency estimates. This may be

6

relevant to test redesign because most SMEs find item difficultyestimates easier to understand and compute than item consistencyestimates.

Table 2. Proportion of test items not meeting ATSC Standards

Test Item Item Answer DistributionDifficulty Consistency Correct Distractor

Response 1 2 331M

Avl .24 .45 .63 .39 .84 .85D .30 .34 .78 .38 .96 1.00Fvl .38 .67 .77 .65 .88 1.00Fv2 .35 .59 .63 .46 .74 .96Fv3 .33 .64 .71 .53 .89 .98

29EAv3 .47 .29 .76 .53 .94 1.00Bv2 .52 .52 .57 .52 .87 1.00B3 .40 .40 .70 .40 .80 1.00CV3 .40 .40 .85 .65 1.00 1.00Elv2 .53 .57 .84 .63 .95 1.00E3v3 .21 .37 .63 .37 .89 1.00

SummaryMean CTs .37 .48 .72 .50 .89 .9829E-SQT .20 .33 .56 .31 .74 .96

ATSC 50%-95% >.20 <85% >5% >5% >5%Standards

The relationship between item difficulty and itemconsistency also indicates that the more difficult questions aremore consistent with overall test performance. This suggeststhat while the procedures followed by the course developers areadequate to insure item consistency, too many of the items arenot sufficiently challenging. By increasing the difficulty ofthe test items, it can be expected that the item consistencyestimates will also increase.

The distractor attractiveness data are relevant to the issueof increasing item difficulty because the data suggest that manySignal test items utilize distractors that are not plausible tothe users. In effect, many of the test items function as twochoice rather than four choice questions because two of thedistractors are not reasonable choices. It follows thatimproving course test item distractor attractiveness wouldproduce more challenging and useful questions.

Tables 1 and 2 allow the comparison of test item statisticsobtained for the SQTs and the course tests. The tables indicatethat SQTs have more acceptable item characteristics than the

7

course tests. The difference in item statistics is alsoreflected by the higher reliability estimates of the SQTs. Giventhe similarity in expertise between the SQT developers and thecourse test developers, a major reason for this difference may bethe availability of item information to the SQT developers. Thisinterpretation is confirmed through informal feedbak with coursedevelopers, who report that the item information is useful ihLidentifying poor test items.

8

Conclusions

Data from the TAP were utilized to identify questionableitemr and item distractors. By using the TAP, it can be expectedthat better test item distractors will be identified and thatmore challenging test items will be created. This process willhelp insure that Signal School course tests can be used for thevalidation of lessons and courses. This conclusion is consistentwith the comparisons between the technical quality of the SQTsand the course tests and is reinforced by reports from SignalSchool SMEs that the item information is useful in identifyingquestionable test items.

The comparisons between the technical qualities of the SQTsand the course tests indicate that test items could be improvedby providing distractor attractiveness and item difficultyinformation to Signal School course test developers. The impactof providing item consistency data would probably be minimalbecause most of the items that are flagged for item consistencyare also flagged for item difficulty and distractorattractiveness. The analyses conform to the view of SignalSchool SMEs that the content of the test items is adequate, andthat the test items should be designed to be more challenging.

9

References

Bloom, B. S. (1984). The 2 sigma problem: The search for methodsof group instruction as effective as one-on-one tutoring.Educational Researcher, 13. 4-16.

Cohen, J. (1977). Statistical power analyses for the behavioralsciences. New York, NY: Academic Press.

Gulliksen, H. 0. (1950). Theory of mental tests. New York, NY:Wiley.

Jensen, A. R. (1980). Bias in mental testing. New York, NY:Free Press.

Lord, F. & Novick, M. (1968). Statistical theories of mentaltest scores. Reading, MA: Addison Wesley.

McNemar, Q. (1969). Psychological statistics. New York, NY:Wiley.

TRADOC REGULATION 350-7. (1988). Training: Systems approach totraining.

TRADOC REGULATION 351-2. (1986). Schools: Skill qualificationtest and common task test development policy andprocedures.

ii

Appendix A.

Relationship Between Test Reliability and Training Evaluation

Test reliability is of limited importance if the test isused solely for the purpose of determining whether theperformance of a student has reached an agreed-upon criterion.However, the Signal School has recognized the importance of usingthe tests for other purposes, such as training evaluation. Testreliability limits the ability of the researcher to evaluate theeffectiveness of new training procedures by making groupdifferences harder to demonstrate and by underestimating themagnitude of an experimentally induced effect.

One effect of a deficiency in test reliability is that thesample size needed to evaluate a new training procee e willincrease. The increase occurs because an inference _n only beconcluded when the ratio of mean differences between groups tomean score variance exceeds some constant. For example the t-test is given by McNemar (1969) as:

t-ratio=(Meany1-Meany2)/Sman.y, where sman..y=Sy/SQRT(N).

Note that a decrease in reliability is equivalent to anincrease in observed score variance as shown by (e.g. Gulliksen,1950): r = sdt2/sdx2,where r equals the test's reliability, sdx measures the test'svariance, and sdt estimates the variance that would have beenobtained had a perfectly reliable test been used.

It follows from the definition of Sman-y that a decrease intest reliability may be offset by an increase in sample size, N.Thus the data collection cost of a training evaluation willincrease when low reliability tests are used.

A second effect of low test reliability occurs because theeffect size of a new training procedure relative to analternative approach is calculated as the ratio of the meandifference between the groups to the square root of variance ofthe test (Cohen, 1977; Bloom 1984):

ES = (ml-m2)/sdx .The relationship between test reliability and variance is givenby (e.g. Gulliksen, 1950):

r = sdt2/sdx2,where r equals the test's reliability, sdx measures the test'svariance, and sdt estimates the variance that would have beenobtained had a perfectly reliable test been used. Because adecrease in test score reliability results in an increase in

A-I

test score variance, i.e. sd,, it follows that attenuation ofthe reliability of a test leads to an underestimation of theeffect size that is being calculated.

It is noteworthy that these formulae can be rearranged tocorrect the effect size estimate for attenuation of reliability:

EScorrected = (ml-m2)/sdt = (ml-m2)/(sd,*SQRT(r)).Obviously, this estimate is most credible when a liberal estimateof test reliability is used.

The Test Analysis Program uses coefficient alpha and thesplit-half approach to estimate a test's reliability. Both ofthese estimates are frequently used to estimate test reliabilitybecause they represent a lower bound on test reliability (Jensen,1980).

An alternate approach to estimating test reliability is toseparately calculate coefficient alpha for each subscale orcontent area. The subscale reliability estimates and contentcorrelations may then be used to obtain a higher estimate of thetest's reliability. This formula is most useful when estimatingthe reliability of a heterogenous test because the effect of lowsubscale correlations is minimized. The approach and methodologycan be found in Lord and Novick (1974).

A-2

Date post:	31-Mar-2021
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Test Analysis Program Evaluation: TIM Item Statistics as Feedback to Test … · 2011. 5. 15. ·...

Documents