+ All Categories
Home > Documents > DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars...

DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars...

Date post: 14-Mar-2018
Category:
Upload: vanhanh
View: 214 times
Download: 2 times
Share this document with a friend
51
DOCUMENT RESUME ED 196 935 TM B10 050 AUTHOR Anderson, Beverly L.: And Others TITLE Educational Testing Facts and Issues: A Layperson's Guide to Testing in the Schools. INSTITUTION California State Dept. of Education, Sacramento. Office of Program Evaluation and Research.: Nero and Associates, Inc., Portland, Oreg.; Northwest Regional Educational Lab., Portland, Oreg. SPONS AGENCY National Inst. of Education (ED), Washington, D.C. PUB DATE Sep 80 CONTRACT 400-79-0059 NOTE 56p.; For related documents, see TM 810 047-049. EDFS PRICE MF01/PC03 Plus Postage. DESCRIPTORS *Educational Practices: *Educational Testing; Elementary Secondary Education; Lay People; Public Schools: *Testing Problems: *Test Interpretation IDENTIFIERS *Test Use ABSTRACT This booklet addresses the role of testing in today's public education system, and presents a series of questions and answers which will be of particular interest to school board members, legislators, lawyers and journalists. These questions are grouped into two major categories: (1) test purposes and users; and (2) current testing issues. Current testing issues include how teachers view testing, why achievement test scores are declining, the meaning of the truth in testing legislation, the meaning of test bias, issues related to IQ testing, educational and legal issues surrounding minimum competency testing, and the evaluation of teachers in the schools. In addition, an annotated bibliography, a glossary of measurement terms, and a summary of common test scores are included that aid in the layperson's quest for information related to the issues. (Author/EL) *********************************************************************** * Reproductions supplied by EDRS are the best that can be made * * from the original document. * ***********************************************************************
Transcript
Page 1: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

DOCUMENT RESUME

ED 196 935 TM B10 050

AUTHOR Anderson, Beverly L.: And OthersTITLE Educational Testing Facts and Issues: A Layperson's

Guide to Testing in the Schools.INSTITUTION California State Dept. of Education, Sacramento.

Office of Program Evaluation and Research.: Nero andAssociates, Inc., Portland, Oreg.; Northwest RegionalEducational Lab., Portland, Oreg.

SPONS AGENCY National Inst. of Education (ED), Washington, D.C.PUB DATE Sep 80CONTRACT 400-79-0059NOTE 56p.; For related documents, see TM 810 047-049.

EDFS PRICE MF01/PC03 Plus Postage.DESCRIPTORS *Educational Practices: *Educational Testing;

Elementary Secondary Education; Lay People; PublicSchools: *Testing Problems: *Test Interpretation

IDENTIFIERS *Test Use

ABSTRACTThis booklet addresses the role of testing in today's

public education system, and presents a series of questions andanswers which will be of particular interest to school board members,legislators, lawyers and journalists. These questions are groupedinto two major categories: (1) test purposes and users; and (2)

current testing issues. Current testing issues include how teachersview testing, why achievement test scores are declining, the meaningof the truth in testing legislation, the meaning of test bias, issuesrelated to IQ testing, educational and legal issues surroundingminimum competency testing, and the evaluation of teachers in theschools. In addition, an annotated bibliography, a glossary ofmeasurement terms, and a summary of common test scores are includedthat aid in the layperson's quest for information related to theissues. (Author/EL)

************************************************************************ Reproductions supplied by EDRS are the best that can be made ** from the original document. ************************************************************************

Page 2: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

EducationalTesting Factsand Issues:

Beverly L. AndersonRichard J. StigginsDavid W. Gordon

a laypersonsguide totesting

the schools

National Institute of Education, U.S. Education DepartmentContract No. 400-79-0059

Coordinated by:Nero and Associaten, Inc.520 S.W. Sixth Avenue, Suite 820Portland, OR 97204Susan W. Rath, Project Director

Materials developed by:Northwest Regional Educational LaboratoryAssessment and Measurement Program710 S.W. Second AvenuePortland, OR 97204

California State Department of EducationOffice of Program Evaluation and Research721 Capitol Mall, 4th floorSacramento, CA 95814

Page 3: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Acknowledgements:Special thanks are due to the many workshop participants and sponsorswho provided helpful comments during the development of thisbooklet. Appreciation is also due to the legislators, school beardmembers, journalists, measurement specialists, lawyers, and teatpublisher representatives who reviewed it, and Carol DeWitte who wasresponsible for its production.

Designed and illustrated by Warren Schlegel

Edited by Jane Loftus

September 1980

This booklet is intended to to used in conjunction dith workshops andseminars conducted by measurement specialists using the training methodsdescribed in Training Citizen Groups on Educational Testing Issues: ATrainer's Manual, developed under this same contract.

These materials are in the public domain and may be reproduced withoutpermission. The following acknowledgement is requested on materialswhich are reproduced: Developed by the Northwest Regional EducationalLaboratory, Portland, Oregon and the California Department of Education.

This booklet was prepared by the Northwest Regional EducationalLaboratory, a private nonprofit corporation and the California Departmentof Education under a subcontract with Nero and Associates, Inc..,Portland, OR. The work contained herein has been :developed under acontract with the National Institute of Education, U.S. EducationDepartment pursuant to Contract No. 400-79-0059/5130408(a)-79-C-197. Theopinions expressed in this publication oo not necessarily reflect theposition of the National Institute of Education, and no officialendorsement oy the Institute should be interred. Mention of trade names,commercial products, or organizations does not imply endorsement by theU.S. Government.

Page 4: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Table of Contents

INTRODUCTION

OVERVIEW OF TEST PURPOSES AND USERS

Page

1

3

Who uses teats? 3

What are the most common types of tests? 3

What are the major purposes of testing? 4

What are limitations of tests? 7Who is responsible for initiating testing? 8

Who constructs tests? 9What are the costs of testing? 9

CURRENT TESTING ISSUES 11

How do teachers view testing? 11

Why are achievement test scores declining? 11What is the meaning of the truth in testing legislation? 13What is the meaning of test bias? 14

What are the issues related to IQ testing? 17What are the educational and legal issues surrounding

minimum competency testing? 21Are tests being used to evaluate teachers in schools? 26

ANNOTATED BIBLIOGRAPHY 29

APPENDIX A: A GLOSSARY OF MEASUREMENT TERMS 35

APPENDIX B: SUMMARY OF COMMON TEST SCORES 45

Page 5: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Introduction

This booklet addresses the role oftesting in today's public educationsystem, and presents a series o:questions and answers which will be ofpartiCular interest to school boardmembers, legislators, lawyers andjournalists. These questions aregrouped into two major categories:

Test Purposes and UsersCurrent Testing Issues

Before presenting these issues, ashort scenario from a typical schoolmay help in establishing a context forthe role of testing in schools today.

An interviewer recently visited ajunior high school to learn more aboutthe role of testing in the school.Walking down the hall, the firstperson the interviewer met was astudent leaving a room marked with asign "Testing - Do Not Disturb."

The interviewer said, "Hil I'mvisiting your school. and want to findout what kind of testing is donehere. It looks like you just tooksome tests."

"Yes," the student replied."We're taking a series of tests thisweek to find out what classes weshould be taking. They just gave mesome tests in math and reading."

The interviewer asked a teacherabout the testing that was beingdone. "Yes, we use those results togroup students. But if a teacherdisagrees with the placement of astudent, the teacher's opinion istaken into account as well as the testresults."

After several more stops, theinterviewer found that in the historyand social studies classes, nostandardized achievement tests weregiven; rather, all the testing done inthose classes was designed by theclassroom teacher.

At the dini.rict testing

specialist's oiffice located at thejunior high, the interviewer disc:tweedthe district testing program with thespecialist.

INTERVIEWER: What are the majorreasons for testing in your district?

SPECIALIST: The districtwide testingis for three major purposes: first,to determine trends in studentperformance over the years; second,for program evaluation; and third, todetermine student placement.Diagnostic testing is done at thediscretion of teachers andprincipals. It is not determined at adistrict level.

INTERVIEWER: What types of tests are'used?

SPECIALIST: Let me give you anexample of what a typical studentwould experience in grades K through12. During their first two months inkindergarten, students are given ascreening test. It is essentially anobservation of a student's physicaldevelopment, verbal and other academicskills.

In grades 1 through 6, the studenttakes a standardized reading and mathtest each spring. In grades 7, 9 and11, the student takes a language artstest as well as the reading and mathtest. In grades 3, 7, 9 and 11, anaptitude test is given along with theachievement battery. The purpose ofthe aptitude test is to establishexpected levels of performance on theachievement test.

INTERVIEWER: How many hours oftesting do you think the typicalstudent experiences?

Page 6: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

SPECIALIST* Well, the distriotwidetesting 1 mentioned taken about twohours in the first grade with theAmount of time increasing progressive-ly to nearly six hours in the fifthgrade. From the fifth grade on, itfluctuates between four and six hours.

INTERVIEWER* What about students whoare having difficulties in certainareas or appear to be in need ofspecial education?

SPECIALIST: Now you have hit on animportant purpose for testing.Students in special programs such asTitle I, Follow Through, or abilingual program experience much moretesting. Nearly all federal or statefunded programs require programevaluation; typically, students aretested both in fall and spring forthis purpose. We wish the testingcould be coordinated with districtwide testing, but an evaluationfrequently requires a different test;thus these students take at least twomore tests during the year.Furthermore, programs like Title Ifrequently require diagnostic testingthroughout the year. Students in suchprograms may participate in double ortriple the amount of testing of thetypical student.

INTERVIEWER: I hear a lot aboutminimum competency testing. Are youdoing such testing in your district?

SPECIALIST: Not yet, but we will bestarting next year. Our school bcardfeels that minimum competency testingwill be ve::y useful in identifyingstudents who should receive remedialinstruction. They are still debatingwhether or not to require passage ofthe test for graduation. They havedecided to wait until after nextyear's testing to decide. We havespent a lot of time this year workingwith teachers, administrators andcommunity members to decide whatcompetencies to test with the MCT, as

we call it. We contracted with anedunational service agency to preparetho tont once we had the competenciesand skills identiCied.

INTERVIEWER: Are people concernedabout cultural bias in testing?

SPECIALIST: Yes, there is much talkabout cultural bias. Unfortunatelythere are so many different interpre-tations of what cultural bias in thatwe have a very difficult time dealing'with it. I'm going to a workshop nextmonth on the topic which will hopefullyhelp me determine how to handle thisissue. Partly out of concerti aboutcultural bias, we are serioislyconsidering eliminating our aptitudetesting, belt I'm not ready torecommend that yet.

INTERVIEWER: Another topic I amhearing more and more about is teacherevaluation and the use of testing forthat purpose. Is that an issue inyour district?

SPECIALIST: Do you mean the use ofstudent test scores in evaluatingteacher performance or actuallytesting teacher cmpet.encies?

INTERVIEWER: I was thinking of theformer but both topics are of interest.

SPECIALIST: Because of the manyproblems inherent ir, using studenttest scores for teacher evaluation, wedo not use them for that purpose. Weare getting pressure from parents,however, to at least consider lookingat the scores of students over severalyears when a particular teacher'sperformance is questioned. As far astesting teachers, we just startedgiving teacher applicants a test ofbasic skills competeocies. Teachersalready in the district are not tested.

The district described inhis imaginary interview is meant tobe representative of many districtsacross the country. The issues raisedhere are discuss,:i in the followingpages.

Page 7: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Overview of Test Purposes and Users

Who uses tests?

Tests are used by many people.Teachers use tests to determinestudents' progress in learningspecific skills. Parents use testscores to tell them how their child isdoing in school or to see how theirschool compares with other schools.School board members and legislatorsuse test data to help set policy andallocate funds. School principals,guidance counselors, districtpersonnel and state department ofeducation staff also requireinformatin on how well students arelearning. News reporters oftenrequest student test scores forreports on quality of schools.Lawyers may find test scores to beimportant in certain legal cases.State, federal, or private agencieswhich fund sperlial programs oftenrequire student test scores toevaluate the program's effectiveness.And, of course, students use testscores to determine if they arelearning what they are expected tolearn.

What are the most common typesof tests?*

There are wweral types ofmeasurement devices used in theschools. Some tests measure knowledgeand skills and some measure othercharacteristics. There are two maintypes of cognitive measures used intoday's elementary and secondaryschools--achievement tests andaptitude tests. Other measures such

*See Appendices for a glossary ofmeasurement terms and descriptions oftest scores.

as attitude inventories and interestinventories are also used.

ACHIEVEMENT TESTS

These tests measure how much astudent has learned or what skills thestudent has acquired. Achievementtests are developed by teachers forclassroom use or by test publishersfor use by schools and schooldistricts in large-scale testingprograms. In either case, the test isdeveloped by outlining the material tobe tested and writing test itemsrepresentative of that material.Achievement test scores are used byteachers and students to help plan andmanage instruction (diagnoseWeaknesses, assign grades, etc.), tocertify mastery of minimum essentialskills, to select students foradmission to college, to plan careerdirections, ai.d to evaluate thequality os7 educational programs.

Achievement tests come in twobasic forms: those used to compareone student's learning with that ofanother student and those used todetermine if a student has masteredparticular knowledge and skillsregardless of how other studentsscore. Many achievement tests givenare standardized. These tests covermaterial taught in most schools insubject matter areas such as reading,language arts, mathematics, science,and social studies. Once developed,the tests are administered to largenational samples of several thousandstudents. Student performance is thenanalyzed and ranking by scores isestablished. These comparative ornorm referenced tests are then used atthe local district level where theyallow the comparison of student testscores within the district. For

3 S

Page 8: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

example, a student may he at the 40thpercentile compared to a national normgroup, but at the 50th percentilecompared to a local norm group. Thiswould indicate that the district as awhole was performing lower than thenational group.

Norm referenced testa are used toselect students for remedial oradvanced programs. In addition, thesetests are used as a guidance tool forthe long-term educational andvocational planning of the student.

Achievement tests can also showthe quantity of specific knowledge andskills (learning objectives) that thestudent has mastered. These tests,known as criterion or objectivereferenced tests, are most useful fordiagnosing specific strengths andweaknesses in individual students, forcertifying mastery of minimalcompetencies, and for evaluatingspecific educational programs.Objective referenced tests are mostoften developed by teachers. However,nearly all major test publishers haveobjective referenced tests available.In some cases, test publishers mayprovide both objective and normreferenced interpretations for thesame test. Increasing numbers oflocal districts employ testingspecialists to develop their ownobjective referenced diagnostictests--either for districtwide testingor for local diagnostic use byteachers. Some states, California,Michigan, Oregon, Texas and NewJersey, among others, are alsodeveloping objective referenced testsfor statewide assessment purposes.

APTITUDE TESTS

Aptitude tests are designed tomeasure the ability to do schoolwork. These tests can measure theability to use language, to solveproblems, to deal with mechanics andto think in terms of mathematics.

4

Theme abilittaa are not inherent orunchanging. They oar; be influenced bymany factors; experience, family',culture, emotions and health.Aptitude relatea to achievement inthat abilities provide a basis forachieving. Aptitude influences theamount of learning that takes place.Aptitude test sCoree are commonly normreferenced or comparative.

A summary of the various testscores commonly used for the differentcognitive measutes is presented inAppendix B.

ATTITUDE INVENTORIES

Another common test investigateshow students feel toward school, ortoward a particular subject or personwithin the educational system. Suchinventories are frequently used inevaluating special programs. Seldomare they administered districtwide.While such measures are available fromcommercial publishers, theseinventories are usually developedlocally to answer questions ofinterest to a particular district.They often have low or unknownvalidity and even when appropriatelyused must be interpreted cautiouslyand in conjunction with other data.

INTEREST INVENTORIES

These instruments attempt topinpoint any interest that mayinfluence a student's learning orcareer plans. Usually a guidancecounselor or teacher has responsi-bility for interpreting the results.

What are the major purposes oftesting?

Tests are used for threepurposes: instructional management,entry-exit decisions and programmaticdecisions. instructional management

Page 9: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

and entry-exit deoinionm require tentdata for each student. Programmingdecisions can be made baned on groupdata, which allows a sampling ofstudents rather than testing everystudent.

INSTRUCTIONAL MANAGEMENT

Tests play an important role ininstructional management decisions.Data from these tests are used for thediagnosis of students' strengths andweaknesses, student placement, andeducational-vocational studentguidance.

,Diagnosis. Perhaps the mostfrequent use of tests is to diagnosethe educational development ofindividual students. Here, theteacher is the primary decision maker,although students may also beinvolved. Teachers often use testsand other performance indicators toassess the student's currentdevelopment so that the next, mostappropriate instructional unit isselected. Tests useful in diagnosticdecision making are those that revealprecisely what skills and knowledgethe student has or has not mastered.

Placement. If diagnosisdetermines what instructional unitswithin a course a student needs tomaster, then placement groups thestudent according to the next level ofinstruction best suited to thatstudent's skills. In this case, thedecisions are made by administrators,teachers, and guidance counselors whomust place each student in the mostappropriate course. Math tests, forexample, might be used to placestudents at the appropriate level in ahigh school math course sequence. Atest which indicates student abilityin math will ensure that lents willnot be assigned to course A.ch aretoo advanced or too elementary forthem. Placement tests usually cover abroader range of knowledge and skillsthan diagnostic tests and are only

5

used onoe or tw:tae a year. DLIgnostiotents may he used on a day-to-daybasin. However, completion of gradesand oournes are also considered inplacement decisions.

Testing is the major method usedto identify students who would benefitfrom placement in special programs(bilingual programs, special educationprograms, remedial reading and math) orparticular educational experiences.Standardized achievement tests are themost frequently used measures forplacing students in compensatoryeducation programs. In addition,aptitude and psychomotor tests areoften used to identify students whoneed special education.

Guidance. While diagnosis matchesthe student to an instructional unit,and placement matches a student to a

course, guidance can determine anentire program of study. Here,students and their parents assisted byguidance counselors make thedecisions. When students decide whicheducational and vocational program topursue, they must consider theirchances of success and satisfaction.These career planning decisions,typically made in junior and seniorhigh school are assisted by the use oftests that cover broad academic areasand tell the students where they standin relation to other students. Thesetests scores can also determinestudents' strengths and weaknesseswhich will aid them in makingchoices. Test scores, of course,should never serve as the sole basisfor any guid, . decision. Thestudent's aca. ,; record, interestsand aspirations all merit consider-ation.

Guidance testing, which isgenerally determined by school ordistrict administrators and guidancecounselors, is usually a secondaryresult of placement or diagnostictesting.

Page 10: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

ENTRY OR EXIT DECISIONS

'.Cents are also used to determineif a student :Mould be placed in aneducational program or to determineA student had completed a program'srequirements. For example, tests maybe administered in order to selectmtudents for programs with limitedenrollment (e.g., college entrance ortrade school), or to certify minimumcompetencies (e.g., for high schoolgraduation or occupational licensing).

Selection. The difference betweenselection and placement is not alwaynclear. Placement, as previouslydescribed, groups students in the mostappropriate level of instruction.This is an instructional managementdecision. Selection refers to aprocess whereby students are screenedfor admission to an educationalprogram which has a limited number ofparticipants. Admission is based onwho is likely to benefit. Here, thekey decision makers are teachers andadministrators. A test used for thepurpose of selection focuses onstudents' skills and knowledgeconsidered essential for success inthe program, and compares students'relevant skills and knowledge so thatthose most likely to succeed ateidentified. Admission to college orinto a particular course (for example,airline pilot training) are primeexamples of selection. However, testscores are not the ssle basis forselection decisions. Previousacademic record and other performancecriteria may also be considered.

Perhaps, the most common use ofselection testing is the collegeentrance examination. Collegesrequire a specific entrance examina-tion and interested students registerwith test publishers who carefullycontrol the administration of thetests at various locations across thecountry.

Certification. Tests often playan important role in certifyingacceptable minimum levels of

eucattonal development in students.For example, A teacher might usetest to oertity mastery oi hoginnihqverbal skills required for comptetionof a certain course. Or, a districtadministrator applying Board ofEducation graduation standards mightuse an examination in order to tent 4student's mantury of minimallyacceptable skills. Or, members of Acertain technical profession might usea test to certify competence in Shatprofession. Since, in each case,those taking tho exam must pass thetest to be certified, the test Rustfocus specifically on clearly statedminimal competencies.

PROGRAMMATIC DECISIONS

A third use of tests is to assistin program planning. In thisinstance, test data may be helpful inproviding the basis for developing anew program, allocating funds orevaluating existing programs. Suchtesting falls into three categories:survey assessment, formative programevaluation and summative programevaluation.

Survey_ Assessment. Probably themost common use of testing ineducation is to survey studentachievement and analyze tends overtime in order to assist in programplanning. This kind of testing isusually designed to raise issues forfurther investigation. For example,the test results might prompt suchquestions as, why are math scoresgradually declining in the district(or state or nation)? Or, why arereading scores of fourth gradersconsistently below national averageswhile those in other grades are aboveaverage? The test data are used toidentify which aspects of theeducational system need to be morethoroughly investigated as well aspossible reasons for unsatisfactoryperformance. For this purpose,achievement test scores--sometimes

611

Page 11: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

from rantiom aaMP100 of atUdenta--akogathered Annually, than averagedMIMEOS the entire nohooi, diattiot olstate, and tined Co indicate the levelof student developmon.. In order h)

allow trends, teat scores Arefrequently compared (tom year toyear. This information then heoomea aballs for dotting educational policyand allocating tondo, Typically,educational administrators Are theprimary deoiaion makers, but they moatjustify the decinione to theultimate decision maker, thetaxpayer. Testa used to acmes aneducational program must cover broadcontent and skill areas in order toprovide valid information for programchanges.

Formative Evaluation. In

formative evaluation, the goal is todetermine which instructional units orfeatures of a specific educationalprogram (e.g., remedial reading), areeffective and which need revision. Inthis instance, test are used tomeasure what the students learn in aspecific program and the results areused to help shape or revise theprogram during its formative stages.

Summative Evaluation. Summativeevaluation reveals a program's overallmerit, and suggests whether or not aprogram should be continued,terminated, or expanded. Testsdesigned to assess knowledge gainedfrom a program are an important partof such an evaluation. Teachers,program, building or districtadministrators, and the public,represented by the board of education,may be involved in summativeevaluation decisions. Tests may begiven both before and afterinstruction, with retesting after aninterval to determine the student'sretention of knowledge.

It should now be obvious thattests are used for many differentpurposes in education. Many decisionsusing test data affect individualstudents, while other decisions affectwhole groups. The implications of

these deoisiono vorY. thisto h4v0

tat--roaohinl, long- feria orrootA,

OthaV4 and Woo 401i0i14. ';data oatt ltdVdltichl, bilk tacit

MAdo

What are tho limitations of tosts'?

TOO: kiaarri should conaidar ghat:

testa repreaent only one of many typeaof performance indicatora. In thoclaaaroom, day-to-day claaeroomactivities and clantwork representimportant and valuable sources ofinformation about student developmentthat should be used to supplement testinformation in making educationaldecisions. Tests are also supplementedwith professional teacher judgments.

Tests are designed for certainuses; a single test cannot serve allpurposes. Tests are limited in termsof the range of decisions they canhelp with. Generally, a test iscapable of assisting in ono or two ofthe decisions previously discussed.The key to using testa effectively isto know what decision is to be made,to determine what material needs to betested to aid that decision and to becertain that the test used actuallycovers that material.

Tests are also limited in thematerial they cover. Generally, testscover only a sample of the content orskills taught. It is almost neverfeasible, both in terms of time andmoney, to test every aspect of thesubject matter taught. As a result ofthis sampling procedure, as well asuncontrollable factors such asmotivation and fatigue, test scoresare subject to some variability. Thatis, if the same test was taken twiceby the same student, the score mightvary slightly due to the imprecisionof the test. Therefore, a scoreshould seldom be seen as completelyprecise or unchanging. Rather, itshould be seen as a generalperformance index.

1 c)

Page 12: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Another limitation of tests isthat they are easy to misuse. Theyare readily available and relativelyeasy to construct, especially ifquality is disregarded. Therefore,they are easy to misuse. Misuse canonly be avoided by knowing preciselyhow the test score is to be used andby selecting or building a testspecifically designed to serve thatpurpose.

Who is responsible for initiatingtesting?

Often it is assumed that tests areinitiated, for the most part, byteac!-,rs who need information toimprove instruction. This isgenerally true, however, mainly ofteacher-made tests and curriculum-related tests. It is not the casewith most standardized tests ordistrict and state-developed tests.Decision makers at all levels--federal,state, and district--need informationfrom these tests.

At the federal level, the primaryimpetus for testing comes fromfederally-funded special programs,which usually require the evaluationobtained by using standardizedachievement testing. Title I of theElementary and Secondary EducationAct, which provides funding forcompensatory education, is a case inpoint. As the largest single item inthe United States education budget,Title I programs are subjected torigorous evaluation to demonstrateeffectiveness. Although current TitleI evaluation procedures require localprograms to either use standardizedtests or the combination of nonnormedtests and a standardized test,specific recommendations for whichparticular tests to use are carefullyavoided.

At the state level, the mostcommon reasons for testing arestatewide assessment for accounta-

bility, minimum competency, and forevaluation of state-funded specialprograms. Legislators, who wishevidence that schools are doing thejob they're being funded to do, oftencall for statewide assessmenttesting. The late 1960's saw manysuch assessment programs established.Following the state assessmentmovement was the public outcry forstudents to achieve certain minimumcompetencies before high schoolgraduation. In response, at least 38states have enacted legislationrequiring minimum competency testing.Evaluation of state-funded specialprograms also provides an impetus forstate-level testing.

Generally, federal and stateregulations allow state and localeducation agencies considerablelatitude in setting their own testingprocedures. For example, althoughTitle I evaluation requires the use ofstandardized tests, many differentstandardized tests are available.Although states may put somelimitations on which tests areacceptable, final selection isgenerally a local decision.

Most district-initiated testing isdone to ensure accountability, toplace students in special programs, toevaluate program results, and makeinstructional management decisions.Typically, the district decides whichtest is to be used for evaluatingfederal- and state-funded specialprograms. District level testingpolicy beyond that required by federaland state regulations is determined bymany factors: public pressure foraccountability, teacher andadministrator demands that tests bereflective of program goals andcontent, pressures from teachers'associations to avoid using studenttest results in teacher evaluation,and requests from teachers andadministrators to reduce the amount oftesting. District administrators andschool boards are frequently in aquandary when establishing a testing

8 13

Page 13: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

program that responds to theseconflicting pressures. At thebuilding level, the amount ofadditional testing beyond districtrequirements varies greatly.Generally, districts allow schoolsconsiderable autonomy, and theprincipal's perspective on testing canbe a major influence.

At the classroom level, teachersas individuals or teams often conductadditional testing at theirdiscretion. Some teachers employcomprehensive diagnostic systems,particularly in the basic skill areasof reading and math. They also mayadminister unit tests which accompanytextbooks. Teachers generally needmore diagnostic test information onlower performing students than onothers.

In general, frequency of tests isdetermined by federal, state anddistrict mandates for evaluation,accountability, student placement andcertification rather than by requestsfrom teachers or local administrators.

Who constructs tests?

Until recently, tests were almostexclusively constructed by either theclassroom teacher or the commercialtest publisher. But within the last15 years, state departments ofeducation and local school districtshave begun to develop their own tests.

Classroom teachers generallyconstruct tests to measure thespecific instructional content beingtaught. These tests often take theform of a short weekly quiz, amid-term examination or an end-of-the-course test. The test results areprimarily used for grading or forhelping students identify specificcourse content which they have notmastered.

The most frequently used testsdeveloped by commercial publishers arethe standardized achievement and

9

aptitude measures. These testsrequire careful development ofquestions as well as extensive admini-stration to establish interpretabletest scores. During development,tests are administered to a carefullyselected sample of students in aspecified age or grade level. Theresults are used to establish scaleswhich permit comparison of a student'sscore to national averages. Thedevelopment of these "normative"scales is a costly process.

Commercial publishers also developcriterion or objective referencedtests. These tests are not tied toany one textbook series, but arefocused on particular knowledge orskills that can be taught by a varietyof methods or materials. These tests,for example, may measure a student'sability to add whole numbers regard-less of the textbook or method ofinstruction used.

Publishers also develop testswhich are contained in or related tospecific textbooks. These tests,which may be used at the end of aunit, are tied to information in aparticular text or set of curriculummaterials.

The tests developed by statedepartments of education and localschool districts are frequentlydesigned to measure the school'ssuccess in teaching course contentconsidered important in that state ordistrict. Publishers' tests, based onthe content most frequently taughtacross the nation, may not exactlymatch local curriculum content. Suchtests should be carefully screened andselected to match local needs.

What are the costs of testing?

The actual cost of testing varieswith the type of test used and itsorigin. For instance, objective testsscored by counting the number of testitems answered correctly, and

Page 14: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

performance tests which require theobservation and evaluation of aprocess or product by a qualifiedjudge differ in cost. These tests maybe purchased from a test developer ortest publisher, or they may bedeveloped by local educators for localuse. The costs of testing depend onthe combination of these factors.

In all cases, there are threecategories of costs: developmentalcosts, costs of test administration,and test scoring costs.

When an objective test ispurchased, developmental costs include(1) the cost of time required to planthe testing context which includesthinking through the decision to bemade and the kind of test needed, (2)

the cost of time to review availabletests, and (3) the costs incurred inactually purchasing test booklets,answer sheets, administration manuals,etc. Test administration costs willinclude time to (1) plan testadministration, (2) train testadministrators, (3) coordinatedistribution of materials, and (4)administer the test and collectmaterials. Test scoring costs include(1) the time required to count theitems answered correctly or (2) costsof optical scanning and computerscoring of answer sheets. There arealso costs involved in disseminatingthe scores and interpretativeinformation to the decision maker in atimely manner.

When an objective test is to bedeveloped locally for local use,developmental costs include timerequired to (1) plan the test context,(2) write the test items, and (3)assemble the final test. If the testis to be used for very important largegroup decisions such as certifyingproficiency for graduation, additionaldevelopmental costs will be incurredto pilot test the items before theyare used in order to ensure a highquality test. Test administration andscoring costs will be the same asthose previously discussed.

When a performance-based test isto be used, the scoring becomes moreexpensive because qualified judgesmust be used to score the test. Whensuch a test is to be purchased,developmental costs include (1) timeto plan the test context, (2) time tolocate, review and evaluate availabletest exercises and scoring (rating)procedures, and (3) the costs ofpurchasing test materials. Testadministration costs will generally bethe same as those involved in theobjective test. Test scoring costs,when such tests are used on a largescale, include time required to (1)plan scoring procedures, (2) selectjudges, (3) train judges, (4) scorethe test, and (5) process scores forthe decision makers. Individualclassroom use of these tests reqq4resonly planning the scoring procedures,scoring the test, and preparingresults.

And finally, when a performancetest is to be locally developed forlocal use, the test developer must (1)plan the test context, (2) developexercises, (3) plan scoring standardsand procedures and (4) conduct qualitycontrol research (for large-scaleuse). Test administration and scoringcosts will be the same as thosediscussed above.

The point is that there are realand significant costs associated withsound (fair and useful) testing.However, money spent for goodassessment will pay dividends in theform of high quality educationaldecisions.

Page 15: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Current Testing Issues

In view of the variety of testpurposes and users previouslydiscussed, there are several importantissues that need to be addressed.

Issue 1: How do teachers view testing?

Throughout the educationalcommunity there is growing concernabout the role of testing in theschools. At all levels - federal,state, and local - educators are awareof the possibility of overtesting.Administrators are reviewing testingprograms to ensure that the fewestnumber of tests are being used andthat the purposes for testing areclearly defined. Teachers as well asother educators are opposed to testswhich damage a student's self-concept,perpetuate negative expectations, arebiased against economically disad-vantaged students or students withdifferent cultural or linguisticbackgrounds, or which are used as thebasis for inappropriate comparisons ofstudents or schools. Many educatorsare also opposed to the use ofstandardized tests for teacherevaluation and are particularlyconcerned that tests not be used asthe sole criterion for importanteducational decisions. They are,however, supportive of testing todiagnose learning needs, prescribeinstructional activities and measureprogress in the curriculum contentusing tests prepared or selected byclassroom teachers. Two majorteachers' associations, the NationalEducation Association and the AmericanFederation of Teachers have takensteps to investigate the issue oftesting. For example, the NationalEducation Association last yearpublished two booklets, Parents &

Testing and Teachers & Testing (seebibliography) to assist its members inunderstanding testing issues. TheAmerican Federation of Teachers is inthe process of preparing a handbook toimprove understanding and use ofstandardized tests in the classroom.

Issue 2: Why are achievement testscores declining?

Since the mid-1960s there has beena well-publicized decline in theachievement test scores of students inthe United States. This decline hasbeen found in nearly all suhiects andall regions of the country, -, inalmost all national testing _Ygrams,ranging from college entrance tests toelementary school achievement testbatteries. Although precise amountsof score decline are difficult todetermine, declines tend to be morepronounced through the higher gradelevels and there seem to bedifferences in decline between maleand female students. As we move intothe 1980s, there is some evidence thatthe decline may have leveled out, butyear to year test score patterns willhave to be carefully observed in thefuture.

During the mid and late 1970s, agreat deal of educational researchfocused on reasons for the decline.Early studies dealt with explanationsrelated to test characteristics,hypothesizing that the decline mightbe a technical, rather than a real,phenomenon. These hypotheses were notsupportedl, leading to the

1See Modu, C.C. and J. Stern. Thestability of the SAT score scale.Research Bulletin BB-75-9, April1975,. Educational Testing Service,Berkeley, CA.

11

Page 16: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

conclusion that the decline was a realand significant socio-educationalfact. Subsequent efforts focused onsocial-educational reasons for thedecline.

One example is the work done atCEMREL, a research institute in St.Louis. In this study (consultannotated bibliography for completereference), researchers collected andsummarized evidence on the test scoredecline and sought possible causes inthe school environment. Informationwas gathered and interpreted on thepotential role of such factors ascurriculum, course enrollments, andamount of schooling, as well astelevision watching and familybackground and environment. Theresearchers concluded that there is noevidence of changing teacher qualifi-cations, and school organization andstudent motivation do not seem relatedto the decline. However, there isevidence of declining drop out ratesaccompanied by increasing absenteeism.This has the effect of leaving morelow-achieving pupils in school. Thereis also evidence of a pronounceddecline in the number of andenrollment in academic and collegepreparatory courses in high schools.In addition, some evidence was foundthat such non-school factors as TVwatching, drug use, and familystructure are potential contributorsto the decline. From these initialexploratory efforts, the researchersconcluded that there are many causesfor the score decline and much addedresearch is needed to provide a moreconcrete explanation for achievementdrops.

Two additional attempts to findexplanations for the declining collegeadmission test scores were conductedby the College Entrance ExaminationBoard (CEEB) and The American CollegeTesting Program (ACT). CEEB formed anadvisory panel of noted scholars andeducators to examine the decline inScholastic Aptitude Test (SAT)scores. After a year of study, the

12

committee concluded that the declinecan probably best be explained interms of changes in the population ofstudents taking this particular testand changes in the socio-educationalfabric of the United States. SinceSAT and ACT tests are taken by aselect group of students, the panelconcluded that the current SAT testedgroup is more broadly representativeof American youth today than it was adecade ago when colleges were beingmore selective. Factors discovered toinfluence the socio-educationalenvironment included increasingelectives in high school, decliningseriousness of educational purpose insociety, television watching, changingfamily roles, the social unrest of the

-early 1970s, and motivation ofstudents.

ACT assembled evidence ofdeclining ACT Assessment Program testscores and combined it with evidencefrom other national testing programsto conclude, as had CEEB, that thecollege bound student population ischanging. With more middle and lowachieving students now consideringcollege and participating in college -entrance testing--because of availableopportunities and financial aid--theeffect'has resulted in a lowering ofthe average test score. In thisinstance, the test score decline couldbe interpreted as evidence ofincreasing diversity in educationalopportunity--a positive statement--rather than an indictment of theeducational system.

The conclusion from these studiesis that there is no single explanationfor the decline in test scores.Rather, a large number of complexfactors has caused the score patternswe now observe. However, even in theabsence of a clear explanation for thedecline, the publicity it has recei'edhas had a pronounced impact onschools. That impact has been felt intesting and instruction. Teachershave carefully scrutinized the testsused to show declining achievement and

Page 17: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

have challenged their appropriate-ness. And in response to the demandfor alternatives, newly developed andspecifically focused minimumcompetency tests covering relevantschool and life skills have emerged.The effects on instruction have alsobeen profound. Much more attention isbeing given to basic skillsinstruction in reading, writing andmath from elementary school throughcollege.

Issue 3: What is the meaning of the"Truth in Testing" legislation?

The debate over "truth in testing"resembles many of the arguments overconsumer protection laws in the1960s. At the center of the debateare two definitions of "fairness." Onone side are the proponents ofdisclosure legislation, who argue thatas a matter of simple fairness

students should be able to see thetest instrument (including thequestions, the answers and relatedtest data) used to make importantdecisions about their lives.Proponents feel that tests are socialpolicy instruments that should, in ademocratic society, be open toscrutiny. The opponents of suchlegislation argue that test securityinsures fairness, so disclosure of thetests will, by breaching security,affect the validity of the tests,increase the costs and lessen collegeadmissions officers' confidence instandardized tests, all of which willmake fair decision-making moredifficult. They feel that securestandardized tests give everyone anequal chance and are more democraticinstruments for policy making than arealternatives that permit theintroduction of various biases.

Proponents of the legislationbelieve that the principle of fairnessoutweighs technical objections to opentesting. They contend that security

is not essential for test validity andthat the burden of proof rests uponthe test companies. Specifically,they ask that the test companies provetheir allegations that full disclosurewill weaken test validity, increasedevelopment costs, exhaust the numberof test questions that can be asked,erode confidence in tests and lead tounfairness in decisions that involvetest scores.

Opponents of the legislation, onthe other hand, argue that the burdenof proof rests upon the supporters oftesting legislation. They ask forproof that the allegation that asubstantial problem with test use orabuse exists, that the legislationwill correct any misuses and abuses,that the added complexity of testdevelopment required for open testingis necessary and that substantialbenefits will accrue to individualsand society through test disclosure.

CURRENT LEGISLATIVE ACTION

The first law requiring testpublishers to disclose information totest takers and the public wasCalifornia's SB 2005, enacted inSeptember 1978. The law applies toany standardized test used forpostsecondary education admissionsselection of more than 3,000students--in other words, such testsas the Scholastic Aptitude Test (SAT)and the American College Testing (ACT)

'Assessment. The law requires that atest's sponsor must file with theCalifornia Postsecondary EducationCommission various kinds of datadescribing the test's features,limitations and use; must provide testtakers with various kinds ofinformation about the test and how itwill be used; and must submit dataabout the administration of the test,the income realized and the expensesincurred in its administration.

New York enacted a similar law in1979. Like the California law, it

13

1t

Page 18: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

applies only to tests used forpostsecondary or professional schooladmissions afid requires testpublishers to file background reportsabout their tests and provide testtakers with test information. In

addition, the New York law requiresthe test agencies to file the contentsof the tests with the New YorkCommissioner of Education within 30days of release of scores, and,thereafter, to provide them to testtakers upon request.

In addition to these laws, similarbills--some requiring total disclosureof the test (such as the New York billstipulates), have been filed inFlorida, Maryland, Ohio, Texas,Colorado, Massachusetts, Pennsylvaniaand New Jersey, although none have, asyet, been enacted. Other state billsappear to be imminent. Two federalbills were introduced in 1979--the"Truth in Testing Act of 1979," knownas the Gibbons 5111 or H.R. 3564, andthe "Educational Testing Act of 1979,"known as the Weiss Bill, or H.R.4949. The former would cover achieve-ment and occupational tests as well asadmissions tests, but would notrequire total disclosure; the latterwould be limited to admissions testsbut would not require total disclosure.

All but two of the billsintroduced apply to postsecondaryeducation admissions testing only.They do not apply to standardizedachievement tests used in publicelementary and secondary schools, norto personality, diagnostic, or minimalcompetency exams. An exception is theMassachusetts Bill which requirestotal disclosure of its competencytests. With the exception of theGibbons Bill, these bills would notapply to occupational testing, civilservice or licensing examinations.The New Jersey bill, however, wouldapply to all tests "developed by a

test agency for the purpose ofselection, placement, classification,graduation or any other bonafidereason concerning pupils in elementary

and secondary, postsecondary orprofessional schools."

The arguments surrounding testdisclosure legislation are compoundedby disagreements about the role andpower of testing companies and thequality of standardized tests usedprimarily for predicting studentperformance. Table 1 summarizes thosearguments which deal with the issue oftest disclosure.2

Issue 4: What is the meaning of testbias?

Perhaps the most difficult social,educational, technical, and legalissue facing educators in general andmeasurement specialists in particular,is the issue of test bias. Bias issuch an important issue because itarises from our aspirations to achievetwo highly valued goals. First, wehave emerged from the 1970s with anever growing awareness of the widevariety of cultures in our society anda desire to accommodate them. Second,we face the always present challengeof conducting good quality (fair anduseful) assessment in our schools.These goals give rise to the need fortesting methods that take into accountcultural and linguistic differences instudents.

Meeting both priorities is adifficult challenge because we oftenlack the combination of cultural orlinguistic knowledge and testdevelopment skills required to do the

2The information in the table istaken from Searching for the Truth in"Truth in Testing" Legislation: ABackground Report. Much of the abovematerial has been abstracted from thatreport; those readers who wish topursue the issues outlined are

encouraged to obtain a copy of thispublication. The report is availablefrom ECS, 1860 Lincoln Street, Denver,Colorado 80295. The cost is $6.50 percopy.

14

Page 19: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

TABLE 1

Debates For and Against Test Disclosure Legislation

Pro-Legislation Sentiments

Grade inflation, misuse have combinedto give tests too much influence inadmissions decisions.

A commitment to "truth in lending,""truth in advertising," sunshine lawsand consumerism should extend to anarea as important as admissionstesting.

Legislation will promote greateraccuracy, validity of tests.

Legislation will encourage use ofmultiple criteria in selection process.

The admissions test industry is notaccountable to anyone.

Students can learn about tests andtest strategy from examining testquestions.

Security need not be an issue; newmeasurement technology could enabletesters to eliminate the problem.

Development costs would not increaseas much as testers suggest.

Items now available only to expensivecoaching schools would be available toeveryone, benefiting poor students.

There are many solutions to thecomparability problem; the laws do notadversely affect comparabilitymeasurement.

The fairness issue takes precedenceover technical matters.

Disclosure will help admissionsofficers as well as students.

15

Anti-Legislation Sentiments

Higher education's need for studentshas lessened importance of admissions'test scores.

Test publishers and higher educationinstitutions already provide ampleinformation and protection; anaiogiesto consumer movements are misleading.

There are several competing publicinterests at stake; critics have notestablished an overriding need forlegislation.

Legislation calling for fulldisclosure will lower the quality oftests.

Most institutions already use multiplecriteria and test agencies encouragethe practice.

The industry is accountable to thepsychometric profession, marketforces, academic community.

Federal legislation would constitutedangerous, if not unconstitutional,federal incursion into education.

Legislation interferes with FirstAmendment right of colleges todetermine who they want to teach.

2Q

Page 20: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

job. The equation is complex indeed. .

On one hand we have an examinee whobrings to the test a language and setof cultural experiences that mayrepresent any of hundreds ofcultures. And, on the other hand, wehave a test prepared by test makers(teachers or test publishers) who mustmake certain assumptions aboutlanguage and cultural patterns inorder to prepare teat items. Claimsare often made that tests are based onthe language and culture of white,middle-class, suburban children andare inherently unfair to students whoexperience other cultural settings.Claims of ethnic, cultural,socio-economic and sex bias arewidespread.

Currently, test publishers andeducational researchers are devotingconsidereple effort to clarifying thedefinitions of and reasons for testbias, and to determine how to dealwith its existence. For instance, in1980 a National Symposium of Education-al Research sponsored by Johns HopkinsUniversity was devoted to the topic oftest item bias methodology.

DEFINITIONS

Although no single technicallycorrect definition of test biasexists, one which repeatedly appearsin the writings of researchers andpublishers is that a test is biased ifindividuals from different groups whoare equally able, do not have equalprobabilities of success. Forexample, on an achievement test, ifstudents in one racial group scoreconsistently lower than students fromanother group, and consistently lowerthan would be expected from theirobserved classroom performance, thetest may be said to be biased againstthat group. Similarly, on a test usedto select students for collegeadmission, if students from one racialgroup score consistently lower thanstudents from another group, but the

16

performance of the two groups ofstudents in the college program iscomparable, the test may be said to bebiased against the lower scoring group.

Several other definitions havebeen suggested. For example, onedefinition is that a test is biased ifthe different groups tested do notachieve the same average score on eachitem of the test. Another definitionholds that a test is biased if twogroups do not achieve similar totaltest scores. This definition allowsfor differences in performance ondifferent items. These definitionsassume that the groups are alike inknowledge of skills measured and anydifferences in performance are due tounfair items. These definitions havegiven rise to many public complaintsof unfairness. However, it iscritical to keep in mind that givenour history of discriminatory educa-tional practices, differences inperformance may be caused by factorsother than biased test items.

Another definition does notrequire that groups have the sameability or skill, but does requirethat differences hold true for alltest items. That is, if differencesare not uniform, it is assumed thatthe test items are measuring differentthings in the various groups.

Other kinds of bias are notinherent in the test but, rather,relate to how a test is used. Forexample, bias could be shown to occurif a test were used to make aselection decision simply because thetest is correlated with a thirdvariable that is relevant to andpredictive of job performance eventhough the test itself has not beenestablished as relevant to jobperformance. The use of a test couldbe biased if it assessed only oneprerequisite skill and ignored equallypredictive and important skills forwhich the pattern of group performancewas noticeably different.

Page 21: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

APPROACHES TO REDUCING TEST BIAS

It is important to point out thatthere is no clear-cut "solution" tothe problem of test bias. No"culture-free" test has yet beendevised, nor is the state of the artsuch that one can be developed. Thebest that can be done is for test-makers to make vigorous efforts tocontinuously screen tests for potentialbias, and for test users to be surethat test results are used fairly inall cases.

One approach commonly used toavoid test bias is to have a panel ofpersons broadly representative of the.various racial, ethnic and sexualgroups that might be taking the testreview the test questions. This helpsensure that test questions will not bebiased or that they will not reflectonly experiences or the culture of aparticular group. This procedureshould be undertaken not only when atest is first written, but periodi-cally thereafter so that changes inour culture do not make some questionsobsolete for some groups.

Another approach is to carefullyexamine the performance of variousgroups on the test as a whole as wellas for individual questions. In thisway, unusual variations in performanceamong the groups can be pinpointed,and the test questions reexamined inan effort to detect any characteris-tics or wording that would seem tomake them biased towards a particulargroup. For publishers to conductthese studies, school districts mustbe willing to provide the demographicdata necessary to perform the analyses.

Given the large number of languagesand cultures in some educationalenvironments, this process of carefultest review and development willrequire significant time, money andpatience.

Issue 5: What are the legal issuesrelated to IQ testing?

People have and will probablycontinue to disagree about whether orhow "intelligence" can be accuratelyand systematically measured. Someargue that evidence of intelligencecan be reduced to a set of tasks whichcan be systematically measured throughsome form of performance or paper andpencil test. Others argue that traitssuch as common sense, wit, creativity,resourcefulness, ambition, and sensi-tivity are all important dimensions ofintelligence and can never be adequate-ly quantified in a test score.

IQ tests have historically beenused to attempt to assess a child'saptitude for performance in school.These tests are designed to assessskills that are perceived to beprerequisites to learning skills suchas verbal reasoning, spatial percep-tion, etc. Thus, high scores on thetests are often used to place childrenin classes for the gifted. Conversely,low scores are often used to placechildren in special education classesfor the mentally retarded. The mostcommonly used individually adminis-tered IQ tests, the Stanford-Binet andWechsler Intelligence Scale forChildren (WISC), are forms of"performance tests." Children aregiven a set of tasks to perform andare judged on the speed and accuracywith which they perform them. Oneimportant assumption behind the testsis that "intelligence" is distributedin society along a normal curve. Thismeans that a small number of people inthe society will be very bright orvery dull, and the majority willcluster around a point defined asaverage intelligence.

Since the way in which IQ testscores are used has significantconsequences for children (e.g.,placement in classes for theretarded), legal challenges havefocused both on the nature of the

Page 22: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

tests and the ways in which theresults are used. The mostsignificant legal precedents in IQtesting come from a 1979 FederalDistrict Court decision in aCalifornia case (Larry P. v. Riles,No. C71-2270 RFP, N.D. Cal. Decision10/16/79) and a 1980 Federal DistrictCourt decision in an Illinois case(Parents in Action on SpecialEducation v. Hannon, No. 74C3586, N.D.Ill. Decision 7/7/80).

The Larry P. v. Riles decisionheld that California school officialsunlawfully discriminated against blackchildren by using racially andculturally biased tests to classifyand place them in classes for theeducable mentally retarded (EMR).Judge Robert F. Peckham provides thefollowing summary of his 131-pageopinion.

This court finds in favor ofplaintiffs, the class of blackchildren who have been or in thefuture will be wrongly placed ormaintained in special classes for theeducable mentally retarded, onplaintiffs' statutory and state andfederal constitutional claims. Inviolation of Title VI of the CivilRights Act of 1964, the RehabilitationAct of 1973, and the Education for AllHandicapped Children Act of 1975,defendants have utilized standardizedintelligence tests that are raciallyand culturally biased, have adiscriminatory impact against blackchildren, and have not been validatedfor the purpose of essentiallypermanent placements of black childreninto educationally dead-end, isolated,and stigmatizing classes for theso-called educable mentally retarded.Further, these federal laws have beenviolated by defendants' general use ofplacement mechanisms that, takentogether, have not been validated andresult in a large over-representationof black children in the specialE.M.R. classes.

23

"Defendants' conduct additionallyhas violated both state and federalconstititional guarantees of the equalprotection of the laws. Theunjustified toleration ofdisproportionate enrollments of blackchildren in E.M.S. classes, and theuse of placement mechanisms,particularly the I.Q. tests, thatperpetuate those disproportions,provide a sufficient basis for therelief under the CaliforniaConstitution. And under the federalConstitution, especially asinterpreted by the Ninth Circuit Courtof Appeals, it appears that the sameresult is dictated.

"Moreover, there is another basisfor the federal constititionalruling. Defendants' conduct, inconnection with the history of I.Q.testing and special education inCalifornia, reveals an unlawfulsegregation intent. This intent wasnot necessarily to hurt blackchildren, but it was manifested, interalia, in the use of unvalidated andracially and culturally biasedplacement criteria. This intent,consistent only with an impermissibleand unsupportable assumption of higherincidence of mental retardation amongblacks, cannot be allowed in the faceof the constitutional prohibition ofracial discrimination."

Relief granted to plaintiffsincluded an injunction againstdefendants' use of standardizedintelligence tests for EMRidentification or placement withoutcourt approval and an order thatdefendants monitor and eliminatedisporportionate EMS placement ofblack children. The court decisionalso granted the reevaluation of allblack children who were placed in EMRclasses without the use of such tests,as well as supplemental education forall children found to have beenmisclassified.

The trigger for the Larry_ P. v.Riles court's legal scrutiny of IQ

Page 23: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

tests and test bias was thedisproportionate number of blackchildren placed in EMR classes as aresult of IQ tests and the seriousinjury of EMR placement tomisclassified children. The courtfound that the EMR classes were"conceived of as 'dead-end classes'"for children incapable of learning theregular curriculum. Children in theseclasses tended to fall further andfurther behind children in regularclasses since they were provided withinstruction that deemphasized academicskills in favor of adjustment.Disproportionate numbers of blackchildren had been placed inCalifornia's EMR classes. Forexample, the evidence showed that inthe. 20 districts accounting for 80percent of the enrollment of blackchildren in 1976-77, black studentscomprised about 27.5 percent of thestudent population and 62 percent ofthe EMR population. This dispro-portion cannot be explained by chancesince "there is less than one in amillion chance that the overenrollmentof black children and the underen-rollment of nonblack children in theEMR classes in 1967-77 would haveresulted under a color-blind system ofplacement."

Although California law requiredIQ test scores to be "substantiatedby" other evidence such as adaptivebehavior (the ability to engage insocial activities and perform everydaytasks), the court found that the"magic of numbers" was strong and thatthe available data suggested verystrongly that the IQ scores were apervasive influence in the placementprocess. The entire placement processoften revolved around the demonstra-tion of IQ.

In an introductory discussion ofintelligence tests subtitled "TheImpossibility of MeasuringIntelligence," Judge Peckham notedthat the expert testimony overwhelm-ingly rejected the concept that IQ was

19

an objective measure of innate, fixedintelligence.

"Defendants' expert witnesses,even those closely affiliated with thecompanies that devise and distributethe standardized intelligence tests,agreed, with one exception, that wecannot truly define, much lessmeasure, intelligence--I.Q. tests,like other ability tests, essentiallymeasure achievement in skills coveredby the examinations. The fact thatIQ tests are developed according tothe plausible but unproven assumptionthat intelligence is distributed inthe population in accordance with anormal statistical curve--cautions usto look very carefully at what thetests do measure and exactly how theywere validated for determining mentalretardation."

Noting that the disparities in EMRplacement of'black children are alsoreflected historically in blabkperformance in general on standardizedintelligence tests, Judge Peckhamexamined three arguments used toexplain the disparity-in IQ scores- -the genetic argument, the socio-economic argument, and cultural bias.Judge Peckham rejected the geneticargument because defendants wereunwilling to admit any reliance on itfor policy-making purposes and becausethe rather weak evidence in support ofthis explanation tends to rest on thedisparities in the IQ scores, whichoverlooks possible bias in the teststhemselves. Judge Peckham alsorejected the socio-economic argument.Testimony and studies showed that therelatively low scores of blackchildren do not result from mentaldisease attributable to the physicalconditions of poverty. Schoolperformance, however, does varysomewhat according to socio-economicstatus.

On the other hand, Judge Peckhamfound the plaintiffs' evidence ofracial and cultural bias in the IQ

24

Page 24: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

tests more persuasive. "The firstimportant inferential evidence is thatthe tests were never designed toeliminate cultural biases againstblack children; it was assumed, ineffect, that black children were less'intelligent' than whites." He laternoted: "The tests had been adjusted,for example, to eliminate differencesin the average scores between thesexes, but a comparable effort was notmade and has never been made for blackand white children."

The court also found thatWechsler's admission in 1944 (that theWISC's standardization was based uponwhite subjects only and that thosenorms cannot be used for the nonwhitepopulation of the United States)applies with equal force to otherstandardized tests. These problemswere not solved by the restandardization of the Stanford-Binet andWISC-R intelligence tests. The courtwent on to review a number of indica-tors that point to the existence of acultural bias against black children'svocabulary and other linguisticdifferences, obviously biased itemsand more subtle kinds of bias involvedin measuring knowledge of whiteculture. With only one exception,there was general agreement by allsides on the inevitable effect ofcultural differences on IQ scores.Put succinctly by Professor AsaHillard, black people have a "culturalheritage that represents an experiencepool which is never used" or tested bythe standardized IQ tests.

In analyzing the requirements offederal statutory law, the Larry P. v.Riles case set legal standards forvalidation of IQ tests used for EMRplacement. Reviewing Title VI of theCivil Rights Act of 1964, the Rehabil-itation Act of 1973, and the Educationfor All Handicapped Children Act of1975 (EHA), and related case law,Judge Peckham concluded that theapproach used in Title VII employmenttest cases was generally appropriatefor allocating burden of proof for

"validation" in the Larry P. v. Rilescase. Under this procedure, testsshown to have a discriminatory impactcannot be utilized unless the employeris able to show that any givenrequirement has a manifest relation-ship to the employment in question.Judge Peckham noted, however, that thenotion of predicting "job performance"cannot be effectively translated intoan educational context given thediffering purposes of employers andschools:

"Compulsory attendance ofeducational institutions is requiredby the state, and the schools aresupposed to take children fromdifferent backgrounds and teach themthe skills necessary for adaptationand success in our society. Thispoints out a fundamental differencebetween the use of tests in employmentand education, at least in the earlyyears of schooling. If tests canpredict that a person is going to be apoor employee, theemployer canlegitimately deny that person a job,but if tests suggest that a youngchild is probably going to be a poorstudent, the school cannot on thatbasis alone deny that child theopportunity to improve and develop theacademic skills necessary to successin our society. Assignment to E.M.R.classes denies that opportunitythrough relegation to a markedlyinferior, essentially dead-end, track."

Given this important distinctionand federal regulations under EHA andthe Rehabilitation Act requiring thattests and other evaluation materialsbe "validated for the specific purposefor which they are used," JudgePeckham replaced the predictivevalidity required in employment caseswith an alternative kind of validation:

"We are not concerned now withpredictions of performance, but ratherwhether the tests are validated withrespect to the characteristicsconsistent with E.M.R. status and

20 eJ

Page 25: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

placement in E.M.R. classes. E.M.R.classes exist 'for people whose mentalcapabilities make it impossible forthem to profit from the regulareducational program.' 'Mentalretardation' is the touchstone, andretardation must make it 'impossible'to profit from the regular classes,even with remedial instruction.Defendents have the burden of showingvalidation of intelligence tests withrespect to these characteristics."

In Parents in Action on SpecialEducation v. Hannon, the presidingjudge, Judge Grady, focused sharply onwhether the IQ tests in question(WISC, WISC-R, and Stanford-Binet)are, in themselves, racially biased,and whether use of the tests as a partof the statute-mandated criteria forplacement in classes of the "educablementally handicapped" is raciallydiscriminatory. In summary, theopinion concluded that:

(1) Only one item on theStanford-Binet and a total of eightitems on the WISC and WISC-R areculturally biased against blackchildren, or at least sufficientlysuspect that their use isinappropriate. These few items do notrender the tests unfair and would notsignificantly affect the score of anindividual taking the test.

(2) When used in conjunction withother statute-mandated criteria fordetermining an approprite educationalprogram for a child, these tests donot discriminate against blackchildren in the Chicago schools.

In contrast to the Larry P. v.Riles decision, Judge Grady neverreached the question of appropriatelegal standards for evaluatingcompliance with federal law. Instead,Grady presented an exhaustive, item byitem analysis of questions included inthe three tests, found an insignifi-cant number to be biased, and refusedto enjoin Chicago's use of the testsas a part of the placement process.

The opinions in each of thesecases are readable and informative.

21

Readers interested in more detail andbackground on the opinions areencouraged to obtain and review copiesof the opinions from the respectiveDistrict Courts.

It is difficult to predict whatwill follow in the wake of these twoopinions. While Judge Peckham inLarry P. v. Riles accepted thecontention that IQ tests were biased,Judge Grady in Parents in Action v.Hannon rejected this allegation.Undoubtedly, further litigation willfollow. The California Department ofEducation has already announced plansto appeal Larry P. v. Riles.

It is likely that the legalcontroversy over use of traditional IQtests will spur research efforts todevelop so-called "non-discriminatory"assessment batteries whose resultswill more accurately reflect thepotential of minority children. Oneexample of such a battery is the"System or Multicultural PluralisticAssessment," known as SOMPA. SOMPAwas developed by a sociologist at theUniversity of California, Riverside,and is designed to provide a farbroader picture of a child's potentialbased on a careful examination of thechild's social and cultural backgroundand experiences. It is unlikely that"alternative" IQ measures which areacceptable to critics of IQ tests willbe developed and validated quickly.

Issue 6: What are the educationaland legal issues surrounding minimumcompetency testing?

The fundamental purpose behindminimum competency testing is todetermine whether students haveacquired sufficient proficiency incertain basic and/or life skills tocope with the adult world. Two typesof tests exist, tests that measure thebasic academic skills of reading,writing and computation, and testsmeasuring "life skills" on topics such

26

Page 26: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

as consumer awareness, health,citizenship, balancing a checkbook orapplying for a bank loan.

In some states, the same test isgiven statewide, whereas in otherstates each district designs andadministers its own test based onlocally determined competence.

A 1979 study sponsored by theNational Institute of Educationinvestigated 31 state and 20 localdistrict competency testing programsin the United States. An executivesummary of that study states:

"Sixteen of the 31 state-levelprograms were mandated by the StateBoard of Education, and 15 wereinitiated by the state legislature.Two of the legislated mandates callfor temporary programs; one StateBoard-initiated program and onelegislated program permit voluntaryparticipation of local schooldistricts. Two other states emphasizethe competency-based instructionalaspects of their programs rather thanthe testing components.

"Of the 20 local programs studied,five developed in states withoutstatewide requirements for minimumcompetency testing. Of the remaining15 districts, eight began institutingminimum competency testing programsprior to state mandates, while sevendistricts implemented programs inresponse to such mandates.

"The majority of programs, bothstate and local, were developed in thetwo to three years since 1976, but theage of programs ranged from 18 yearsto less than one year with ongoingpilot-testing. Fourteen stateprograms have been fully implemented,while 17 are being phased in. Forexample, many state programs areintroducing new graduationrequirements or curriculum changesover a period of years and hence,these programs will not be "in place"until some time in the future. Bycomparison, 13 of the 20 localprograms have already been fully

99

implemented, while seven programs arephasing in mandated changes.

"Programs in only four states havehad litigation associated with them inany way--Delaware, Florida, Maryland,and North Carolina--and the majorityof this activity has occurred inFlorida.

"With respect to goals andpurposes, 14 states cited certifi-cation of basic skills competencyprior to high school graduation as amajor purpose, and two states reportedusing competency achievement as onecriterion for grade-to-grade promotionas a reason for implementing a minimumcompetency testing program. The mostfrequently cited purpose forinstituting such a program was toidentify students in need ofremediation; 19 states reported thispurpose. Curriculum improvement wasmentioned by 10 states as a majorprogram goal. By comparison, 16 localdistricts reported certification ofbasic skills as one reason fordeveloping a minimum competencytesting program; four districts citedthe use of test results, along withother information, to determinegrade-to-grade promotion as a majorpurpose of the program. Elevenprograms reported purposes related toproviding remediation and sevendistricts mention curriculum change asa major purpose behind programimplementation.

"Reading and mathematics werecompetency areas assessed in all stateand local programs. Twenty-seven ofthe state programs assessed skills inlanguage arts and/or writing, while 15local districts assess these sameskills. Skills in other subjectareas, such as speaking, listening,consumer economics, science,government, and history, are assessedin only a few programs. Almost all ofthe tests administered in both stateand local programs consist primarilyof multiple-choice items, and awriting sample is the most frequently

Page 27: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

selected non-multiple-choiceassessment."3

LEGAL ISSUES

Many of the legal issues involvedin competency testing are inextricablylinked to issues of test quality andthe quality of educational programsdesigned to support competencytesting. For example, the nature andquality of a competency test maytrigger legal challenge, but testquality is and should be in itself aneducational issue. Similarly,insuring quality and effectiveness inbasic and remedial instructionalprograms is one of the centralmissions of education. Nevertheless,in examining minimum competencyprograms, courts are likely to closelyexamine these instructionalactivities. While it seems impossibleto clearly disentangle "legal" from"educational" issues in minimumcompetency testing, it is useful toreview the issues courts have examinedto date.

The distinction between using acompetency test only as a diagnostictool to identify student weaknesses inbasic skills and tying high schoolgraduation to successful performanceon the test, is crucial in examiningthe legal implications of minimumcompetency testing. The legality of atesting program will usually dependmore on how the test results are usedthan on the nature of the testitself. For example, as McClungpoints out in a legal review ofcompetency testing:

3Gorth, W.F., and Perkins, M.R., AStudy of Minimum Competency TestingPrograms: Final Summary and AnalysisReport. Amherst, MA: NationalEvaluation Systems, Inc., December

1979.

"Using the test results as theprimary basis for any decision thatwill cause serious harm to a studentraises the initial legal questions.The trigger for legal analysis is thisinjury. Assuming there is injury, thefollowing questions arise: Who isresponsible for that injury and doesthat person or agency have sufficientjustification for causing that injury?

"If there is no injury, then thereis no legal problem. Competency testscan be used in many ways that cause noinjury to a student. For example,competency tests could be used simplyto determine the general level ofstudent performance in basic skills ona statewide or district level; toidentify basic skill areas in aninstruction program that need moreemphasis; or to diagnose areas inwhich an individual student needsspecific help. In such cases, thereis usually no injury and no legalproblem.

"On the other hand, competencytests can be used to make decisionsabout individual students that havepotential for grave injury. Forexample, competency tests can be usedfor tracking, grade promotion, ordenial of a regular high schooldiploma. Diploma denial, as mandatedin Florida and California, probablycauses the greatest injury to anindividual student, and thereforeraises the most serious legalquestions (p. 657-658)."4

Minimum competency testingrequirements that incorporate somesanction upon students for failing topass the tests run the greatest riskof legal challenge. These legalchallenges are most likely to beraised if competency testing programstouch on any of the following issues:

4McClung, M.S., "Competency testingprograms: Legal and educationalissues," Fordham Law Review, 47,1979, 651-711.

23 26

Page 28: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Potential for racial andlinguistic discrimination

Adequacy of advance notice andphase-in periods prior to theinitial use of the test as agraduation requirement

Psychometric validity orreliability of the tests

Match between the instructionalprogram and the test

The degree to which remedialinstruction may create orreinforce tracking

1. Potential for racial andlinguistic discrimination. Brieflystated, some states and many localschool districts in the past have beenfound to have discriminated againstracial and linguistic minoritystudents in violation of the equalprotection clause of the U.S.Constitution and Title VI of the CivilRights Act of 1964. Examples of suchstates and districts include thosethat have been held by courts to haveoperated "dual school systems" forblacks and whites and who have beenordered to desegregate, and those thathave been found not to be providingadequate bilingual instruction inaccord with the U.S. Supreme Court'sruling Lau v. Nichols. In states ordistricts which have been subject toor are vulnerable to such findings,the effect of minimum competencytesting requirements may be toreinforce the effects of priordiscrimination. That is, the minimumcompetency testing sanction could pileone injury (diploma denial) on top ofanother (prior denial of equaleducational opportunity).

2. Adequacy of advance notice andphase-in periods prior to the initialuse of the test as a graduationrequirement. Legal concerns forfairness and due process will requireextensive notice of minimum competency

testing requirements to students andparents. For example, the first classof students subject to a minimumcompetency testing requirement mightnot know that passing a competencytest will be a condition for acquiringa. diploma. The school district, infact, would have explicitly approvedstudents' progress by promoting themeach year even though many of themlacked basic skill proficiencies. It

is also likely that many, if not most,of those students failing the testmight haire studied differently andteachers taught differently had theyreceived advance notice of therequirement.

Procedures for notifying studentsvary from school to school. In mostdistricts students are first givengeneral notice of the proficiencyrequirement for a diploma and then ata later date notified of the specificperformance objectives to be measuredby the proficiency test. Students,parents and teachers should be givennotice of both performance objectivesand assessment procedures as soonafter their adoption as possible.

Traditional notions of due processrequire adequate prior notice of anyrule that could cause irreparable harmto a person's educational or occupa-tional prospects. Notification ofrequirements after completing most ofone's educational program may beviewed as both unfair and inadequate,especially if the minimum competencytest is designed to measure knowledgeand skills not previously taught inthe district's classrooms.

3. Psychometric validity orreliability of minimum competencytests. All tests ought to meetreasonable professional psychometricstandards of validity and reliability.Simply stated, validity refers towhether or not a test measures what itpurports to measure, and reliabilityrefers to whether or not the testmeasures student performanceaccurately from one test adminis-tration to another. The most widely

2S24

Page 29: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

accepted professional test developmentstandards are the Standards forEducational and Psychological Tests,published by the American Psycholo-gical Association. It is likely thatminimum competency tests will besubjected to careful scrutiny againstsuch benchmarks as the Standards.

4. Match between the instruc-tional program and the test. Mostpersons would agree that fairnessrequires that a school's curriculumand instruction be matched to thecompetencies measured by a test. Inother words, the test would be unfairif it attempted to measure what theschool did not teach. This conceptshould be considered in terms of bothcurriculur validity and instructionalvalidity.

Curricular validity is a measureof how well test items match theobjectives of the curriculum. Ananalysis of curricular validity wouldrequire comparison of the testobjectives with the school's statedcourse objectives. This becomesimportant, for example, if thecurriculum is not specificallydesigned to teach functionalcompetency and the use of a testcovering functional competency isconsidered. It might be unfair todeny students their diplomas becausethey did not learn these functionalcompetencies. In such a situation,failure on the minimum competency testmight indicate that the school did notoffer an appropriate curriculum.

A minimum competency test shouldalso have what may be calledinstructional validity: Even if thecurricular objectives of the schoolcorrespond to those of the competencytest, there might be a discrepancybetween the stated objectives of theschool and what is actually beingtaught in the classroom. Instruc-tional validity obviously does notrequire prior exposure of the studentto the exact questions asked on thetest, but it does require exposure tothe kind of knowledge and skills that

25

would enable a student to answer thetest questions.

It is important to note thatcontent validity does not ensureeither curricular or instructionalvalidity. They are related, butdistinguishable concepts. Contentvalidity is a measure of how well testitems represent the body of skills andknowledge that the test purports tomeasure but is not necessarily ameasure of how well the test itemsrepresent either a school's curricularobjectives or instruction. Instruc-tional validity should be the centralconcern because content and curricularvalidity mean very little if the testitems are not representative-ofinstruction actually received by thestudent.

5. The degree to which remedialinstruction may create or reinforcetracking. Most minimum competencytesting programs implicitly or explic-itly require remedial instruction forstudents found to be deficient inbasic skills. In districts subject tofindings of prior racial or linguisticdiscrimination as described above, oneeffect of minimum competency testingrequirements may be to inappropriatelychannel or "track" disproportionatenumbers of minority students intoremedial programs on the basis oftheir test results. This could havethe effect of "resegregating" studentsinto remedial programs in directcontradiction to prior orders todesegregate school systems.

THE DEBRA P. v. TURLINGTON DECISION

To date, the only major legalchallenge to competency testing wasmounted in Florida. In Debra P. v.Turlington, a group of black studentplaintiffs sued the state in FederalDistict Court to have the state'scompetency testing program ruledunconstitutional. Plaintiffschallenged the test on each of thegrounds mentioned above.

Page 30: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

In July 1979, the court held thatFlorida's c.ampetency testing programdid not give all students adequatenotice of the inclusion of thecompetency teat as a graduationrequirement,,, nd that the competencytesting program carried forward theeffects of prior racial discriminationin violation of the due process andequal protection clauses of theFourteenth Amendment of the U.S.Constitution, Title VI of the CivilRights Act of 1964, and the EqualEducational Opportunities Act of1974. As a remedy, the court enjoinedFlorida from using the test as adiploma requirement for four years,until the 1982-83 school year. Thecourt did not, however, deny use ofthe competency test during thisfour-year period for assessing theeffects of instruction.

Although the court foundpsychometric deficiencies in Florida'stest, it did not find these deficien-cies to be unconstitutional. Thecourt did not address in any depth theissue of the correlation between thetest and instructional program.

Issue 7: Are tests being used toevaluate teachers in schools?

Tests are being used to evaluateteachers in a variety of ways. Buttests are never used as the solecriterion of teacher evaluationbecause of the complexity of thelearning process. Since many factorsinfluence learning, some under teachercontrol and some not, teacherevaluation must be done very carefully.

The types of test scores that canplay a role in teacher evaluation arethe achievement test scores ofstudents, test scores of licensingexaminations, and the scores of testsused in the teacher selection andhiring processes.

The evaluation of teachers byusing the achievement test scores of

26

the students they teach is a verydelicate process. If a group ofstudents who have previously shownpatterns of growth in test scores donot grow over an extended period oftime, and this phenomenon is apparentin the test scores of all or nearlyall students in the group with thesame teacher, then those test scorescan be combined with other informationabout the teacher as part of theteacher evaluation process. However,if test scores of students are to beused in this way, they must be usedvery carefully and with full awarenessof the potential difficulties withthis evaluation strategy.

The first difficulty is thatfactors apart from the schoolexperience can greatly influencestudent achievement. Since teachershave no control over many of thesefactors they cannot be held account-able. For example, characteristicssuch as the child's ability to learn,and the child's motivation, are nottotally within the teacher's control.The student's home environment alsoexerts great influence on learning.In fact, some research suggests thatsome non-school factors may faroutweigh school factors in determin-ing achievement. When these factorsbegin to interact with the variouscharacteristics of the school learningenvironment, it becomes difficult tosort out the component of learningthat is influenced by the teacher andthe components that are influenced bynon-school conditions.

The second difficulty with usingstudent test scores to evaluateteacher performance is the complexityof the desired end product. In

school, teachers endeavor to help thechild to gain knowledge and skills inmany academic areas, some common toall students, some unique to anindividual student. In addition,teachers attempt to develop values,attitudes and interpersonal skillsthat will benefit a student insociety. Given all of these desired

Page 31: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

traits along with the complexity anduniqueness of each individual student,it becomes impossible to define thecharacteristics of the "desired" endproduct to evaluate.

Even when it is possible to definethe citizen we want our schools toproduce, we have great difficultyreflecting many of the importantcharacteristics in reliable and validtest scores. Though we can use teststo document some of the basicachievement areas, the focus of thesetests is very broad and general andmay not reflect the importanteducational objectives in a givenschool district, building, orclassroom. Furthermore, other desiredoutcomes, such as attitudes, valuesand interpersonal skills areinherently complex and not easilymeasured in an objective way in schoolsettings or otherwise.

The third potential difficultywith using student test scores forteacher evaluation is that learningdoes not take place at a steady andpredictable rate. Even if we coulddefine and measure the end product ofschooling and control most of thefactors that influence that product,we could not assume that every childwould gain new knowledge and skills atthe same pace. Sane would learnfaster than others. Some would growslowly then spurt ahead--all accordingto the nature of human development.This fact must be taken into accountin evaluating teacher performance viastudent test scores.

State licensing examinations arealso used as a form of teacherevaluation. Though most states issuelicenses on the basis of thecompletion of specified collegecourses or degrees, some also includean examination as part of thecredentialling process.

In the field of education, testshave been in use for decades forcertifying teacher competence. TheState of South Carolina, for example,has used the National Teacher Examina-

tions (NTE) to certify teachers since1945. The Education Commission of theStates5 has developed an excellentsummary of the current status of suchtesting.

The National Teacher Examinations,which are published and administeredby the Educational Testing Service,include examinations covering academicpreparation in professional educationand general education (writing,science, math, social studies,literature) as well as academicpreparation in 26 subject-fieldspecializations. The tests typicallyfocus on the recall of factualinformation with some use of higherorder mental operations tests as well.

In the fall of 1977, four statesrequired or recommended use of NTEresults for initial certificationpurposes. These states wereMississippi, North Carolina, SouthCarolina and West Virginia. Louisianawas added to this list in 1978. Inaddition to these five states, atleast 23 states used the NTE forspecial purposes, ranging fromobtaining statewide data for teachereducation studies (Alabama) tovalidating credits earned atnonaccredited institutions (California,Delaware). In June 1978, the FloridaLegislature passed a bill requiring,in part, a test of teaching competencyand subject matter mastery for initialcertification. Working steadily overa period of four to five years, theGeorgia State Department of Educationdeveloped test instruments for a"Performance Based Teacher Certifi-cation" program, and first adminis-tered the test in November 1978. Inearly 1979, hearings were held inNorth Carolina on plans for a "QualityAssurance Program for ProfessionalPersonnel" in which testing for

5Vlaanderen, R. "Trends incompetency-based teachercertification." Denver, CO:Education Commission of the States,March 1980.

27 32

Page 32: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

teaching competencies and subjectmatter mastery plays a major role inthe certification process. Theprogram was adopted in the fall of1979.

In 1979, several state legisla-tures introduce bills embodying thetesting concept in teachercertification. In Arkansas, a billwas passed in record time, whilesimilar bills in Colorado, Kansas,Arizona, Missouri and Vermont died incommittee. Bills were introduced inAlabama, Iowa and Oklahoma in 1980 andagain, in a special session, inArizona. State Board action hasmandated testing in Alabama andTennessee.

Test scores are also used, in someinstances, when several teachers arebeing considered for a limited numberof teaching positions. The employersmay use a test as part of theselection process. In this case, allteachers may be certified, but anothertest might be used to determineknowledge of subject matter and/orability to perform in a certaineducational environment. As in theother instances, test scores shouldnever be the only criteria consideredin the selection process. But theycan be a valuable selection aid whenused carefully with other performanceinformation.

28

Page 33: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Annotated Bibliography

Test Purposes and Users

Anderson, B.L., Stiggins, R.J., and Hiscox, S.B. Guidelines for selectingbasic skills and life skills tests. Portland, OR: NorthwestRegional Educational Laboratory, 1980.

This short guide designed for teachers and administratorsdiscusses test purposes and characteristics to consider whenselecting tests. Lists of currently available basic skills andlife skills tests are provided along with the names andaddresses of test publishers.

Brown, F.G. Guidelines for test use: A commentary on the standard foreducational and psychological tests. Washington, D.C.:National Council on Measurement in Education, 1980

This book is designed for teachers, counselors, schoolpsychologists, administrators, parents and others concerned witheducational measurement. It is a nontechnical explanation ofthe Standards.

Burrill, L.E. How a standardized achievement test is built, test servicenotebook 125. New York, NY: The Psychological Corporation.

The steps described are typical of the way tests are built bymany major test publishers. Other short articles on relatedtopics are available from The Psychological Corporation, NewYork, NY 10017.

Feder, B. The complete guide to taking tests. Englewood Cliffs, NJ:Prentice-Hall, Inc., 1979.

This book is written for test takers who want to take some ofthe mystery out of testing.

Parents and testing. Washington, D.C.: National Education Association,1979.

This guide provides parents with information on how they shouldand can be involved with schools' testing programs. It alsogives the NEA position on student testing.

Rebell, M.A., and Block, A.R. Competence assessment and the courts:An overview of the state of the law. Boston, MA: McBer, 1980.

This study looks at the implication of legal cases on a widevariety of educational testing situations, includingcertification, IQ tests, ability tracking, and graduate schooladmissions tests.

293 4

Page 34: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Teachers and testing. Washington, D.C.: National Education Association,1979.

Teachers are provided with general informaton on how and whytests are used as well as their strengths and weaknesses. TheNEA resolutions relating to testing issues are given.

Achievement Test Score Decline

Munday, L. Declining admissions test scores. ACT Research, Report #71,Iowa City, IA: The American College Testing Program, 1976.

Several indices of declining academic achievement aresummarized. However, the principle focus is on declining ACTAssessment Program test scores. Correlates of score decline areidentified and potential explanations are explored.

darnischfeger, A., and Wiley, D.E. Achievement test score decline: Dowe need to worry? Monograph of CEMREL, Inc., 3120 59th Street,St. Louis, MI 63139, 1976.

This 160-page monograph reviews several potential explanationsfor declining academic achievement test scores. Data arepresented in association with the potential explanationspresented and conclusions are drawn regarding each explanation.An excellent summary of conclusions is presented.

Lollege Entrance Examination Board. On further examination. New York,NY: 1977.

This monograph reports the results of the deliberations of theCEEB Advisory Panel on the Scholastic Aptitude Test ScoreDecline. Potential explanations related to school and nonschoolfactors are examined and accepted or rejected as viable.Conclusions are presented regarding multiple causes.

Truth in Testing

Brown, R. Searching for the truth in "truth in testing" legislation:A background report. Denver, CO: Education Commission of theStates, 1980.

This is a readable summary of the background and current issuesin truth in testing. It also summarizes relevant pendingfederal and state legislation.

Brown, R. Searching for the truth about truth in testing. Compact,Winter, 1980, 7-11.

This article is a much abbreviated summary of the issuespresented in the background report listed above.

30 3

Page 35: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Nairn, A. and Associates. The reign of ETS: The corporation makes upminds. Washington, D.C., 1980.

The Nairn report on ETS was sponsored by R4lph Nader and offersa strong indictment of many of ETS' practices.

Educational Testing Service. Test scores and family income. Princeton,NJ, February 1980.

Educational Testing Service. Test use and validity. Princeton, NJ,February 1980.

The two ETS reports were developed in response to the Nairnreport.

Cultural Bias

Burrill, L.E. and Wilson, R. Fairness and the matter of bias: Testservice notebook 36. New York, NY: The PsychologicalCorporation, 1980.

This article succintly covers major issues in facial bias, itembias and bias in selection and prediction.

Burrill, L.E. Statistical evidence of potential bias in items and testsassessing current educational status. Paper presented at theFourteenth Annual Southeastern Conference on Measurement inEducation, 1975.

This paper describes various definitions and interpretations ofbias and provides a useful reference list.

Sheppard, L., Camilli, G., and Averill, M. Comparison of six proceduresfor detecting test item bias using internal and external abilitycriteria. A paper presented to the National Council onMeasurement in Education Annual Meeting, Boston, 1980.

This paper not only provides a thorough comparison of proceduresfor detecting test item bias, but also contains an extensivereference list to the literature on test item bias.

IQ Testing

Larry P. v. Riles, No. C71-2270 RFP, N.D. Cal. Decision 10/16/79.

Readers who are interested in pursuing the issues raised in theLarry P. decision are urged to obtain a transcript of thedecision and read it in its entirety. The decision is readable,to the point, and appropriate for a lay reader.

3631

Page 36: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Parents in Action on Special Education vs. Hannon. No. 74C3586, N.D.Ill., Decision 7/7/80.

This transcript of the Parents in Action case provides adetailed item by item analysis of the IQ tests in question.

Notes on Larry P. Footnotes (Newsletter of the Law and EducationCenter, Education Commission of the States, Denver, CO) Vol. 1,No. 2, Spring 1980.

This newsletter presents a short, readable analysis of the LarryP. v. Riles case.

Minimum Competency Testing

Runda, M.A., and Sanders, J.R. (Eds.) Practices and problems incompetency based measurement. Washington, D.C.: NationalCouncil on Measurement in Education, 1979.

This 144 page book provides articles on the key issues incompetency based testing.

Debra P. v. Turlington. Footnotes (Newsletter of the Law andEducation Center, Education Commission of the States, Denver,CO) Vol. 1, No. 1, November 1979.

This newsletter provides a short readable review of the keyissues in the Debra P. v. Turlington case.

Gorth, W.P. and Perkins, M.R. A study of minimum competency testingprograms: Final summary and analysis report. Amherst, MA:National Evaluation Systems, 1979.

Alr

This report summarizes the current status of the implementationof minimum competency testing across the country.

McClung, M.S. Competency testing programs: Legal and educational issues.Fordham Law Review, 1979, 47, 651-711.

This article is an exhaustive review of legal issues whichincorporates potential implications of the Debra P. vs.Turlington decision.

Shoemaker, J.S. Minimum competency testing: Implications forinstruction. Washington, D.C.: National Institute ofEducation, January 1979.

This paper presents a discussion of design considerations in thedevelopment of minimum competency testing programs that willmaximize the utility of the program for instructional uses.

32 37

Page 37: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Rosewaterr A. liinimeconwItmciLtetAng,isrograitnijatr2U.cgmEtdstudentes Perspectives on policy and practice. Washington,D.C.: George Washington University Institute for EducationalLeadership, 1979.

This paper presents a review of policy and practical problemsinvolved in implementing minimum competency testing programs forthe handicapped.

Teacher Testing and Evaluation

The Psychological Corporation. Summaries of court decisions on employmenttesting, 1968-1977. New York, NY, 1978.

This book summarizes court decisions on employment testing inboth the private and public sector. It is not limited toeducational personnel.

Vlaanderen, R. Trends in competency based teacher certification.Denver, CO: Education Cominission of the States, March 1980.

This paper presents a summary of the current status of teachercompetency testing.

33 38

Page 38: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Appendix A

A Glossary of Measurement Terms

The following glossary is used with the permission ofthe Psychological Corporation, New York, N.Y. 10017

Similar glossaries may be obtained from other majortest publishers.

35 39

Page 39: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

VesteSeffliceNcyebook13

k Glossary of Measurement TermsMYTH!' C. MITCHUM', Consultant, Test Department

This glossary of terms used in educational and psychologi-st measurement is primarily for persons with limited trainingn measurement, rather than for the specialist. The terms de-Ined are the more common or basic ones such as occur ineat manuals and educational journals. In the definitions, w-ain teChnicalities and niceties of usage have been sacrificedOr the sake of brevity and, It is hoped, clarity.

The definitions are based on the usage of the various termss given in the current textbooks in educational and psycho-ogical measurement and statistics, and in certain specializedlietionaries. Where there is not complete uniformity amongwriters in the measurement field with respect to the meaningof a term, either these variations are noted or the definitionoffered is the one that the writer judges to represent the'best" usage.

ocadesnle'apdtude. The combination of native and acquiredibilities -that are needed for school learning; likelihood ofuccess in mastering academic work, as estimated from meas-ores of.the'necessary abilities. (Also called scholastic aptitude,chool learning ability, academic potential)

'Maven:int test. A test that measures the extent to which aerson has. "achieved" something, acquired certain informa-ion, or mastered certain skills usually as a result of plannednstruction or training.

ge norms. Originally, values representing typical or averageoerformance for persons of various age groups; most currentmaga refers to sets of complete score interpretive data forepropriate successive age groups. Such norms are generallyBed in the interpretation of mental ability test scores.

iternate-form reliability. The closeness of correspondence.or correlation, between results on alternate (i.e., equivalent oroarallel) forms of a test; thus, a measure of the extent to whichhe two forms are consistent or reliable in measuring what-Iver they do measure. The time interval between the two test-rigs must be relatively short so that the examinees themselvesre unchanged in the ability being measured. See RELIABILITY,ELIABILITY COEFFICIENT.

mecdotal record. A written description of an incident in anadividual's behavior that is reported objectively and is con-idered significant for the understanding of the individual.

aptitude. A combination of abilities and other characteristics,whether native or acquired, that are indicative of an individ-ual's ability to learn or to develop proficiency in some par-ticular area if appropriate education or training is provided,Aptitude tests include those of general academic ability (com-monly culled mental ability or intelligence tests); those ofspecial abilities, such as verbal, numerical, mechanical, ormusical; tests assessing "readiness" for learning; and prognos-tic tests, which measure both ability and previous learning,and are used to predict future performance usually in aspecific field, such as foreign language, shorthand, or nursing.

Some would define "aptitude" in a more comprehensivesense. Thus, "musical aptitude" would refer to.the combina-tion not only of physical and mental characteristics but alsoof motivational factors, interest, and conceivably other char-acteristics, which are conducive to acquiring proficiency inthe musical field. r'

arithmetic mean. A kind of average usually referred , to asthe mean. It is obtained by dividing the sum of a set of scoresby their number.

average. A general term applied to the various measures ofcentral tendency. The three most widely used averages arethe arithmetic mean (mean), the median, and the mode. Whenthe term "average" is used without designation as to type,the most likely assumption is that it is the arithmetic: mean.

battery. A group of several tests standardized on the samesample population so that results on the several tests are com-parable. (Sometimes loosely applied to any group of testsadministered together. even though not standardized on thesame subjects.) The most common test batteries are those ofschool achievement, which include subtests in the separatelearning areas.

bivariate chart (bivariate distribution). A diagram in which atally mark is made to show the scores of one individual ontwo variables. The intersection of lines determined by thehorizontal and vertical scales form cells in which the talliesare placed. Such a plot provides frequencies for the two dis-tributions, and portrays the relation between the two variablesas a basis for computation of the product-moment correlationcoefficient.

issued by The Psychological Corporation37 40*

Page 40: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

ceiling. The upper limit of ability that zatt be measured by Stest. When an individual makes a score which is at or nelliihehighest possible score, it is said that the test has tootovi a"ceiling" for him; he should be given a higher level of the test.

central tendency. A measure of central tendency provides asingle most typical score as representative of a group of scores;the "trend" of a group of measures as indicated by some typeof average, usually the mean or the median.

coefficient of correlation. A measure of the degree of rela-tionship or "going-togetherness" between two sets of meas-ures for the same group of individuals. The correlation co-efficient most frequently used in test development and educa-tional research is that known as the Pearson or product-mo-ment r. Unless otherwise specified, "correlation" usually refersto this coefficient, but rank, hiserial, tetrahoric, and othermethods are used in special situations. Correlation coefficientsrange from .00, denoting a complete absence of relationship,to +1.00, and to 1.00, indicating perfect positive or perfectnegative correspondence, respectively. See CORRELATION.

composite score. A score which combines several scores,usually by addition; often different weights are applied to thecontributing scores to increase or decrease their importancein the composite. Most commonly, such scores are used forpredictive purposes and the several weights are derived throughmultiple regression procedures.

concurrent validity. See VALIDITY (2),

construct validity. See VALIDITY (3).

content validity. Sce VALIDITY ( ).

correction for guessing (correction for chance). A reduction inscore for wrong answers, sometimes applied in scoring true-false or multiple-choice questions. Such scoring formulas(R W for tests with 2-option response, R 1/2W for 3

options, R 1/2W for 4, etc.) are intended to discourageguessing and to yield more accurate rankings of examinees interms of their true knowledge. They are used much less todaythan in the early days of testing.

correlation. Relationship or "going-togetherness" between twosets of scores or measures; tendency of one score to vary con-comitantly with the other, as the tendency of students of highIQ to be above average in reading ability. The existence of astrong relationship i.e., a high correlation between twovariables does not necessarily indicate that one has any causalinfluence on the other. See COEFFICIENT OF CORRELATION.

criterion. A standard by which a test may be judged or eval-uated; a set of scores, ratings, etc., that a test is designed tomeasure, to predict. or to correlate with. See VALIDIrY.

criterion- referenced (content-referenced) test. Terms often usedto describe tests designed to provide information on the spe-cific knowledge or skills possessed by a student. Such testsusually cover relatively small units of content and are closelyrelated to instruction. Their scores have meaning in terms ofwhat the student knows or can do, rather than in their relationto the scores made by some external reference group.

38

criterion-related validity. See VALIDITY (2),

adtbrefair test. So-called culture-fair tests attempt to providean equal opportunity for success by persons of all cultures andlife experiences. Their content must therefore be limited tothat which is equally common to all cultures, or to materialthat is entirely unfamiliar and novel for all persons whatevertheir cultural background. See CULTURE -FREE TEST,

culture-free test. A test that is free of the impact of all culturalexperiences; therefore, a measure reflecting only hereditaryabilities. Since culture permeates all of man's environmentalcontacts, the construction of such a test would seem to be animpossibility. Cultural "bias" is not eliminated by the use ofnon-language or so-called performance tests, although it maybe reduced in some instances. In terms of most of the purposesfor which tests are used, the validity (value) of a "culture-free" test is questioned; a test designed to be equally applicableto all cultures may be of little or no practical value in any.

curricular validity. See VALIDITY (2).

decile. Any one of the nine points (scores) that divide a dis-tribution into ten parts, each containing one-tenth of all thescores or cases; every tenth percentile. The first decile is the10th percentile, the eighth decile the 80th percentile, etc.

deviation. The amount by which a score differs from somereference value, such as the mean, the norm, or the score onsome other test.

deviation IQ (DIQ). An age-based index of general mentalability. It is based upon the difference or deviation between aperson's score and the typical or average score for persons ofhis chronological age. Deviation IQs from most current scho-lastic aptitude measures are standard scores with a mean of100 and a standard deviation of 16 for each defined age group.

diagnostic test. A test used to "diagnose" or analyze; that is,to locate an individual's specific areas of weakness or strength,to determine the nature of his weakneSses or deficiencies, and,wherever possible, to suggest their cause. Such a test yieldsmeasures of the components or subparts of some larger bodyof information or skill. Diagnostic achievement tests are mostcommonly prepared for the skill subjects.

difficulty value. An index which indicates the percent of somespecified group, such as students of a given age or grade, whoanswer a test item correctly.

discriminating power. The ability of a test item to differentiatebetween persons possessing much or little of some trait.

discrimination index. An index which indicates the discrimi-nating power of a test item. The most commonly used indexis derived from the number passing the item in the highest 27percent of the group (on total score) and the number passingin the lowest 27 percent.

distractor. Any incorrect choice (option) in a test item.

distribution (frequency distribution). A tabulation of the scores(or other attributes) of a group of individuals to show thenumber (frequency) of each score, or of those within therange of each interval.

4.

Page 41: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

paivident form. Any of two or more forms of a test that areonly parallel with respect to the nature of the content ande number and difficulty of the items included, and that willeld very similar average scores and measures of variabilityr a given group. (Also referred to as alternate, comparable,parallel form.)

ror of measurement. See STANDARD ERROR OP MEASUREMENT.

tpectancy table ("expected" achievement). A term with twommon usages, related but with some difference:(1) A table or other device for showing the relation be-

men scores on a predictive test and some related outcome.he outcome, or criterion status, for individuals at each levelpredictive score may be expressed as (a) an average on

e outcome variable, (b) the percent of cases at successivevels, or (c) the probability of reaching given performancevels. Such tables are commonly used in making predictionseducational or job success.(2) A table or chart providing for an interpretation of a

udent's obtained score on an achievement test with the scorehich would be "expected" for those at his grade level andith his level of scholastic aptitude. Such "expectancies" arcised upon actual data from administration of the specified:hievement and scholastic aptitude tests to the same student)pulation. The term "anticipated" is also used to denote:hievement as differentiated by level of "intellectual status."

drapolation. In general, any process of estimating values ofvariable beyond the range of available data. As applied tost norms, the process of extending a norm line into grade orle levels not tested in the standardization program, in orderpermit interpretation of extreme scores. Since this extensionusually done graphically, considerable judgment is involved.itrapolated values are thus to some extent arbitrary; for thisid other reasons, they have limited meaning.

A symbol denoting the frequency of a given score or of theores within an interval grouping.

ce validity. See VALIDITY (1).

ctor. In mental measurement, a hypothetical trait, ability,component of ability that underlies and influences perform-

ice on two or more tests and hence causes scores on the testsbe correlated. The term "factor" strictly refers to a theo-

ticsl variable, derived by a process of factor analysis fromtable of intercorrelations among tests. However, it is alsoed to denote the psychological interpretation given to theriablei.e., the mental trait assumed to be represented by theriable, as verbal ability, numerical ability, etc.

ctor analysis. Any of several methods of analyzing the in-rcorrelations among a set of variables such as test scores..ictor analysis attempts to account for the interrelationshipsterms of some underlying "factors," preferably fewer in

mber than the original variables, and it reveals how muchthe variation in each of the original measures arises from,is associated with, each of the hypothetical factors. Factor

alysis has contributed to an understanding of the organiza-in or components of intelligence, aptitudes, and personality;d it has pointed the way to the development of "purer" teststhe several components.

39

forced-choice .item. Broadly, any multiple-choice item inwhich the examinee is required to select one or more of thegiven choices. The term is most often used to denote a specialtype of multiple-choice item employed in personality tests inwhich the options are (I) of equal "preference value," i.e.,chosen equally often by a typical group, and are (2) such thatone of the options discriminates between persons high and lowon the factor that this option measures, while the other optionsmeasure other factors. Thus, in the Gordon Personal Profile,each of tour options represents one of the four perionalitytraits measured by the Profile, and the examinee must selectboth the option which describes him most and the one whichdescribes him least.

frequency distribution. See DISTRIBUTION.

g. Denotes general intellectual ability; one dimensional meas-ure of "mind," as described by the British psychologistSpearman. A test of "g" serves as a general-purpose test ofmental ability.

grade equivalent (GE). The grade level for which a givenscore is the real or estimated average. Grade-equivalent inter-pretation, most appropriate for elementary level achievementtests, expresses obtained scores in terms of grade and monthof grade, assuming a 10-month school year (e.g., 5.7). Sincesuch tests are usually standardized at only one (or two)point(s) within each grade, grade equivalents between pointsfor which there arc data-based scores must he "estimated" byinterpolation. See EXTRAPOLATION, INTERPOLATION.

grade norms. Norms based upon the performance of pupils ofgiven grade placement. See GRADE EQUIVALENT, NORMS, PER-CENTILE RANK, STANINE.

group test. A test that may be administered to a number ofindividuals at the same time by one examiner.

individual test. A test that can be administered to only oneperson at a time, because of the nature of the test and/or thematurity level of the examinees.

intelligence quotient (IQ). Originally, an index of brightnessexpressed as the ratio of a person's mental age to his chrono-logical age, MA/CA, multiplied by 100 to eliminate thedecimal. (More precisely and particularly for adult ages, atwhich mental growth is assumed to have ceased the ratio ofmental age to the mental age normal for chronological age.)This quotient IQ has been gradually replaced by the deviationIQ concept.

It is sometimes desired to give additional meaning to IQsby the use of verbal descriptions for the ranges in which theyfall. Since the IQ scale is a continuous one, there can be noinflexible line of demarcation between such successive cate-gory labels as very superior, superior, above average, average.below average, etc.; any verbal classification system is there-fore an arbitrary one. There appears to be, however, rathercommon use of the term average or normal to describe IQsfrom 90-109 inclusive.

An IQ is more definitely "interpreted" by noting the normalpercent of IQs within a range which includes the IQ, and/or

A rt

Page 42: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

[intelligence quotient (IQ), continued.]by indicating its percentile rank or stanine in the total na-tional norming sample. Column 2 of Table I shows the nor-mal distribution of IQs for M = 100 and S.D. = 16, showingpercentages within successive 10-point intervals. (For IQswhose S.D. is greater than 16, the percentages for the extremeIQ ranges will be larger, and those for IQs near the mean willbe smaller, than those shown in the table.) Table 1 indicatesthat 47 percent, approximately one-half of "all" persons, haveIQs in the 20-point range of 90 through 109; an IQ of 140 orabove would be considered as extremely high, since fewerthan one percent (0.6) of the total population reach this level,and fewer than one percent have IQs below 60. From thecumulative percents given in Column 3, it is noted that 3.1percent have IQs below 70, usually considered the mentallyretarded category. This column may be used to indicate thepercentile rank (PR) of certain IQs. Thus an IQ of 119 has aPR of 89, since 89.4 percent of IQs are 119 or below; an IQ of79 has a PR of 10.6, or 11. See DEVIATION IQ, MENTAL AGE.

Table 1. Normal Distribution of IQs with Mean of 100 andStandard Deviation of 16

( I )IQ

Range

(2)Percent of

Persons

(3)Cumulative

Percent

140 and above 0.6 100.6130-139 2.5 99.4120-129 7.5 96.9110-119 16.0 89.4100-109

23'4(46,8 73.490- 99 23.4) 50.080- 89 16.0 26.670- 79 7.5 10.660- 69 2.5 3.1

Below 60 0.6 0.6Total 100.0

internal consistency. Degree of relationship among the itemsof a test; consistency in content sampling. See sPi. tr-nm.t.RELIABILITY.

interpolation. In general, any process of estimating inter-mediate values between two known points. As applied to testnorms, it refers to the procedure used in assigning interpretive

between the succes-the standardization

values (e.g., grade equivalents) to scoressive average scores actually obtained inprocess. Also, in reading norm tables itis necessary at times to interpolate toobtain a norm value for a score betweentwo scores given in the table; e.g,. in thetable shown here, a percentile rank of83 (from 81 + '/ of 6) would be as-signed, by interpolation, to a score of46; a score of 50 would correspond to a percentile rank of 94(obtained as 87 + 1/2 of 10).

PercentileScore Rank

51 9748 8745 81

inventory. A questionnaire or check list, usually in the formof a self-report, designed to elicit non-intellective informationabout an individual. Not tests in the usual sense, inventoriesare most often concerned with personality traits, interests,attitudes, problems, motivation, etc. See PERSONALITY TEST.

Inventory test. An achievement test that attempts to coverrather thoroughly some relatively small unit of specific in-struction or training. An inventory test, as the name suggests,is in the nature of a "stock-taking" of an individual's knowl-edge or skill, and is often administered prior to instruction.

item. A single question or exercise in a test.

item analysis. The process of evaluating single test items inrespect to certain characteristics. It usually involves determin-ing the difficulty value and the discriminating power of theitem, and often its correlation with some external criterion.

Kuder-Richardson formula(s). Formulas for estimating thereliability of a test that are based on infer-item consistencyand require only a single administration of the test. The onemost used, formula 20, requires information based on thenumber of items in the test, the standard deviation of the totalscore, and the proportion of examinees passing each item. TheKuder-Richardson formulas are not appropriate for use withspeeded tests,

mastery test. A test designed to determine whether a pupil hasmastered a given unit of instruction or a single knowledge orskill; a test giving information on what a pupil knows, ratherthan on how his performance relates to that of some norm-reference group. Such tests are used in computer-assisted in-struction, where their results are referred to as content- orcriterion-referenced information.

mean (M). See ARITHMETIC MEAN.

median (Md). The middle score in a distribution or set ofranked scores; the point (score) that divides the group intotwo equal parts; the 50th percentile. Half of the scores arebelow the median and half above it, except when the medianitself is one of the obtained scores.

mental age (MA). The age for which a given score on a men-tal ability test is average or normal. If the average score madeby an unselected group of children 6 years, 10 months of ageis 55, then a child making a score of 55 is said to have a men-tal age of 6-10. Since the mental age unit shrinks with in-creasing (chronological) age, MAs do not have a uniforminterpretation throughout all ages, They are therefore mostappropriately used at the early age levels where mental growthis relatively rapid.

modal-age norms. Achievement test norms that are based onthe performance of pupils of normal age for their respectivegrades. Norms derived from such age restricted groups are,.,,free from the distorting influence of the scores of underage and.,overage pupils.

mode. The score or value that occurs most frequently in adistribution.

multiple-choice item. A test item in which the examinee's taskis to choose the correct or best answer from several givenanswers or options.

N. The symbol commonly used to represent the number ofcases in a group.

403

Page 43: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

amPlanguage test. See NON-VERBAL TEST.

sonAretimil test. A test that does not require the use of wordsn the.item or in the response to it. (Oral directions may beIncluded in the formulation of the task.) A test cannot, how-ever, be classified as non-verbal simply because it does notrequire reading on the part of the examinee. The use of non-metal tasks cannot completely eliminate the effect of culture.

morns fine. A smooth curve drawn to best fit (1) the plottedmean or median scores of successive age or grade groups, or(2) the successive percentile points for a single group,

normal distribution. A distribution of scores or measures thatIn graphic form has a distinctive bell-shaped appearance.Figures 1 and 2 show graphs of such a distribution, known asi normal, normal probability, or Gaussian curve. (Difference indupe is due to the different variability of the two distributions.)In such a normal distribution, scores or measures are distributediymmetrically about the mean, with as many cases up to variouslistances above the mean as down to equal distances below it.rases are concentrated near the mean and decrease in fre-luency, according to a precise mathematical equation, thefarther one departs from the mean. Mean and median aredentical. The assumption that mental and psychological char-acteristics are distributed normally has been very useful in,eat development work.

worms. Statistics that supply a frame of reference by whichmeaning may be given to obtained test scores. Norms are basedupon the actual performance of pupils of various grades orages in the standardization group for the test. Since they rep-resent average or typical performance, they should not be re-larded as standards or as universally desirable levels of attain-ment. The moat common types of norms are deviation IQ,tercentile rank, grade equivalent, and stanine. Reference groupsire usually those of specified age or grade.

Dbjective test. A test made up of items for which correct re-Iponses may be set up in advance; scores are unaffected by the'pinion or judgment of the scorer. Objective keys provide for;coring by clerks or by machine. Such a test is contrasted witht "subjective" test, such as the usual essay examination, towhich different persons may assign different scores, ratings,or grades.

:minibus test. A test (1) in which items measuring a variety ofnental operations are all combined into a single sequencerather than being grouped together by type of operation, and(2) from which only a single score is derived, rather thaneparate scores for each operation or function. Omnibus testsnake for simplicity of administration, since one set of direc-ions and one overall time limit usually suffice. The Elemen-ary, Intermediate, and Advanced tests in the Otis-Lennon1ental Ability Test series are omnibus-type tests, as contrast-ed with the Kuhlmann-Anderson Measure of Academic Po-ential, in which the items measuring similar operations occurogether, each with its own set of directions. In a spiral-omni-nts test, the easiest items of each type are presented first, fol-owed by the same succession of item types at a higher dif-lculty level, and so on in a rising spiral.

percentile (P). A point (score) in a distribution at or belowwhich fall the percent of cases indicated by the percentile. Thusa score coinciding with the 35th percentile (P33) is regardedas equaling or surpassing that of 35 percent of the persons inthe group, and such that 65 percent of the performances ex-ceed this score. "Percentile" has nothing to do with the percentof correct answers an examinee makes on a test.

percentile band. An interpretation of a test score which takesaccount of the measurement error that is involved. The rangeof such bands, most useful in portraying significant differencesin battery profiles, is usually from one standard error ofmeasurement below the obtained score to one standard errorof measurement above it.

percentile rank (PR). The expression of an obtained test scorein terms of its position within a group of 100 scores; the per-centile rank of a score is the percent of scores equal to orlower than the given score in its own or in some externalreference group.

performance test. A test involving some motor or manual re-sponse on the examinee's part, generally a manipulation ofconcrete equipment or materials. Usually not a paper-and-pencil test.

(1) A "performance" test of mental ability is one in whichthe role of language is excluded or minimized, and ability isassessed by what the examinee does rather than by what hesays (or writes). Mazes, form boards, picture completion, andother types of items may be used. Examples include certainStanford-Binet tasks, the Performance Scale of Wechsler Intel-ligence Scale for Children, Arthur Point Scale of PerformanceTests, Raven's Progressive Matrices.

(2) "Performance" tests include measures of mechanicalor manipulative ability where the task itself coincides withthe objective of the measurement, as in the Bennett Hand-Tool Dexterity Test.

(3) The term "performance" is also used to denote a testthat is actually a work-sample; in this sense it may includepaper-and-pencil tests, as, for example, a test in bookkeeping,in shorthand, or in proofreading, where no materials other thanpaper and pencil may be required, and where the test responseis identical with the behavior about which information isdesired. SRA Typing Skills is such a test.

The use of the term "performance" to describe a type oftest is not very precise and there are certain "gray areas."Perhaps one should think of "performance" tests as those onwhich the obtained differences among individuals may not beascribed to differences in ability to use verbal symbols.

personality test. A test intended to measure one or more of thenon-intellective aspects of an individual's mental or psy-chological make-up; an instrument designed to obtain infor-mation on the affective characteristics of an individualemo-tional, motivational, attitudinal, etc. as distinguished fromhis abilities. Personality tests include (1) the so-called person-ality and adjustment inventories (e.g., Bernreuter PersonalityInventory, Bell Adjustment Inventory, Edwards PersonalPreference Schedule) which seek to measure a person's status

41 4

Page 44: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

[personality test, continued.]on such traits as dominance, sociability, introversion, etc., bymeans of self-descriptive responses to a series of questions;(2) rating scales which call for rating, by one's self or another,the extent to which a subject possesses certain traits; and (3)opinion or attitude inventories (e.g., Allport-Vernon-LindzeyStudy of Values, Minnesota Teacher Attitude Inventory).Some writers also classify interest, problem, and belief inven-tories as personality tests (e.g., Kuder Preference Record,Mooney Problem Check List). See PROJECTIVE TECHNIQUE.

power test. A test intended to measure level of performanceunaffected by speed of response; hence one in which there iseither no time limit or a very generous one. Items are usuallyarranged in order of increasing difficulty.

practice effect. The influence of previous experience with a teston a later administration of the same or a similar test; usuallyan increased familiarity with the directions, kinds of questions,etc. Practice effect is greatest when the interval between testingsis short, when the content of the two tests is identical or verysimilar, and when the initial test-taking represents a relativelynovel experience for the subjects.

predictive validity. See VALIDITY (2).

product-moment coefficient (r). Also known as the Pearson r.See COEFFICIENT OF CORRELATION.

profile. A graphic representation of the results on several tests,for either an individual or a group, when the results have beenexpressed in some uniform or comparable terms (standardscores, percentile ranks, grade equivalents, etc.). The profilemethod of presentation permits identification of areas ofstrength or weakness.

prognosis (prognostic) test. A test used to predict future suc-cess in a specific subject or field, as the' Pimsleur LanguageAptitude Battery.

projective technique (projective method). A method of person-ality study in which the subject responds as he chooses to aseries of ambiguous stimuli such as ink blots, pictures, unfin-ished sentences, etc. It is assumed that under this free-responsecondition the subject "projects" manifestations of personalitycharacteristics and organization that can, by suitable methods,be scored and interpreted to yield a description of his basicpersonality structure. The Rorschach (ink blot) Technique,the Murray Thematic Apperception Test and the MachoverDraw-a-Person Test are commonly used projective methods.

quartile. One of three points that divide the cases in a distribu-tion into four equal groups. The lower quartile (Q1), or 25thpercentile, sets off the lowest fourth of the group; the middlequartile (Q2) is the same as the 50th percentile, or median,and divides the second fourth of cases from the third; and thethird quartile (Q3), or 75th percentile, sets off the top fourth.

r. See COEFFICIENT OF CORRELATION.

random sample. A sample of the members of some total pop-ulation drawn in such a way that every member of the popu-lation has an equal chance of being included that is, in away that precludes the operation of bias or "selection." Thepurpose in using a sample free of bias is, of course, the re-quirement that the cases used be representative of the total

population if findings for the sample are to be generalized tothat population. In a stratified random sample, the drawing ofcases is controlled in such a way that those eho,en are "rep-resentative" also of specified subgroups of the total popula-tion. See REPRESENTATIVE SAMPLE.

range. For some specified group, the difference betNeen thehighest and the lowest obtained score on a test; thus a veryrough measure of spread or variability, since it is based upononly two extreme scores. Range is also used in reference tothe possible spread of measurement a test provides, which inmost instances is the number of items in the test.

raw score. The first quantitative result obtained in scoring atest. Usually the number of right answers, number right minussome fraction of number wrong, time required for perform-ance, number of errors, or similar direct, unconverted, unin-terpreted measure.

readiness test. A test that measures the extent to which anindividual has achieved a degree of maturity or acquired cer-tain skills or information needed for successfully undertaking.some new learning activity. Thus a reading readiness test indi-cates whether a child has reached a developmental stage wherehe may profitably begin formal reading instruction. Readinesstests are classified as prognostic tests.

recall item. A type of item that requires the examinee to sup-ply the correct answer from his own memory or recollection,as contrasted with a recognition item, in which he need onlyidentify the correct answer.

Columbus discovered America in the yearis a recall (or completion) item. See RECOGNITION ITEM.

recognition item. An item which requires the examinee to rec-ognize or select the correct answer from among two or moregiven answers (options).

Columbus discovered America in(a) 1425 (b) 1492 (c) 1520 (d) 1546

is a recognition item.

regression effect. Tendency of a predicted score to be nearerto the mean of its distribution than the score from which it ispredicted is to its mean. Because of the effects of regression,students making extremely high or extremely low scores on atest tend to make less extreme scores, i.e., closer to the mean,on a second administration of the same test or on some pre-dicted measure.

reliability. The extent to which a test is consistent in measuringwhatever it does measure; dependability, stability, trustworthi-ness, relative freedom from errors of measurement. Reliabilityis usually expressed by some form of reliability coefficient orby the standard error of measurement derived from it.

reliability coefficient. The coefficient of correlation betweentwo forms of a test, between scores on two administrations ofthe same test, or between halves of a test, properly corrected.The three measure somewhat different aspects of reliability, butall are properly spoken of as reliability coefficients. SeeALTERNATE-FORM RELIABILITY, SPLIT-HALF RELIABILITY COEFFI-CIENT, TEST-RETEST RELIABILITY COEFFICIENT, KUDER-RICH-ARDSON FORMULA(S).

42

Page 45: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

*presentative sample. A sample that corresponds to orwitches the population of which it is a sample with respect totharacteristics important for the purposes under investigation.In an achievement test norm sample, such significant aspectsnight be the proportion of cases of each sex, from varioustypes of schools, different geographical areas, the severalsocioeconomic levels, etc.

w.holastic aptitude. See ACADEMIC APTITUDE.

;hewed distribution. A distribution that departs from symme-try or balance around the mean, i.e., from normality. Scorespile up at one end and trail off at the other.

Spearman -Brown formula. A formula giving the relationshipbetween the reliability of a test and its length. The formulapermits estimation of the reliability of a test lengthened orshortened by any multiple, from the known reliability of agiven test. Its most common application is the estimation ofreliability of an entire test from the correlation between itstwo halves. See SPLIT-HALF RELIABILITY COEFFICIENT.

split-half reliability coefficient. A coefficient of reliability ob-tained by correlating scores on one half of a test with scoreson the other half, and applying the Spearman-Brown formulato adjust for the doubled length of the total test. Generally,but not necessarily, the two halves consist of the odd-numberedand the even-numbbrrd items. Split-half reliability coefficientsare sometimes referred to as measures of the internal consist-ency of a test; they involve content sampling only, not stabilityover time. This type of reliability coefficient is inappropriatefor tests in which speed is an important component.

standard deviation (S.D.). A measure of the variability or dis-persion of a distribution of scores. The more the scores clusteraround the mean, the smaller the standard deviation. For anormal distribution, approximately two thirds (68.3 percent)of the scores are within the range from one S.D. below themean to one S.D. above the mean. Computation of the S.D.is based upon the square of the deviation of each score fromthe mean. The S.D. is sometimes called "sigma" and is repre-sented by the symbol cr. (See Figure 1.)

13.6%

SSA. 2 S.D. 1 53)

L

Forcontio ll.,ha 0:1 0.11 2 7 19

DtliblowtowDotiodow ICI 62

34.1% 34.1%

13.6%

1

4.1 S.D. 3 S.D. +3 S.D.MowMathew

31 50 11.0 IP% 9I 113 9 OSA 911.9

114 100 11ti 12i 1413

RPM 1. PaanitIN WM. I/sowing relations mows Mondani! deviation distance (row moan....1poroonto0o of Gahm) Women thole pants. gooroortflo rank and 10 from toots with in S.D. of 16.

43

standard error (S.E.). A statistic providing an estimate of thepossible magnitude of "error" present in some obtained meas-ure, whether ( I) an individual score or (2) some group meas-ure, as a mean or a correlation coefficient.

(1) standard error of measurement (S.E. Meas.) : As ap-plied to a single obtained score, the amount by which the scoremay differ from the hypothetical true score due to errors ofmeasurement. The larger the S.E. Meas., the less reliable thescore. The S.E. Meas. is an amount such that in about two-thirds of the cases the obtained score would not differ by morethan one S.E. Meas. from the true score. (Theoretically, then,it can be said that the chances are 2:1 that the actual score iswithin a band extending from true score minus 1 S.E. Meas. totrue score plus 1 S.E. Meas.; but since the true score can neverbe known, actual practice must reverse the true-obtained re-lation for an interpretation.) Other probabilities are notedunder (2) below. See TRUE SCORE.

(2) standard error: When applied to group averages,standard deviations, correlation coefficients, etc., the S.E. pro-vides an estimate of the "error" which may be involved. Thegroup's size and the S.D. are the factors on which thesestandard errors are based. The same probability interpretationas for S.E. Meas. is made for the S.E.s of group measures, i.e.,2:1 (2 out of 3) for the 1 S.E. range, 19:1 (95 out of 100)for a 2 S.E. range, 99:1 (99 out of 100) for a 2.6 S.E. range.

standard score. A general term referring to any of a variety of"transformed" scores, in terms of which raw scores may beexpressed for reasons of convenience, comparability, ease ofinterpretation, etc. The simplest type of standard score, knownas a z-score, is an expression of the deviation of a score fromthe mean score of the group in relation to the standard devi-ation of the scores of the group. Thus:

standard score (Z) = raw score (X) mean (M)standard deviation (S.D.)

Adjustments may be made in this ratio so that a system ofstandard scores having any desired mean and standard devia-tion may be set up. The use of such standard scores does notaffect the relative standing of the individuals in the group orchange the shape of the original distribution. T-scores have aM of 50 and an S.D. of 10. Deviation IQs are standard scoreswith a M of 100 and some chosen S.D., most often 16; thusa raw score that is 1 S.D. above the M of its distribution wouldconvert to a standard score (deviation IQ) of 100 + 16 = 116.(See Figure 1.)

Standard scores are useful in expressing the raw scores oftwo forms of a test in comparable terms in instances wheretryouts have shown that the two forms are not identical indifficulty; also, successive levels of a test may be linked toform a continuous standard-score scale, making across-batterycomparisons possible.

standardized test (standard test). A test designed to provide asystematic sample of individual performance, administered ac-cording to prescribed directions, scored in conformance withdefinite rules, and interpreted in reference to certain norma-tive information. Some would further restrict the usage of theterm "standardized" to those tests for which the items havebeen chosen on the basis of experimental evaluation, and forwhich data on reliability and validity are provided. Otherswould add "commercially published" and/or "for general use."

46

Page 46: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Amine. One of the steps in a nine-point scale of standard scores.The stanine (short for standard-nine) scale has values from 1to 9, with a mean of S and a standard deviation of 2. Eachstanine (except 1 and 9) is 1/2 S.D. in width, with the middle(average) stanine of 5 extending from Ve S.D. below to VeS.D. above the mean. (See Figure 2.)

Percent of Scores

Approximate Rangeof Percentile Ranks

Standard OorlatIonDistance from Mein

Median

4% 7% 12% 17% 17% 12% 7% 44

Below 5 51 1 12.73 24-40 4140 81.77 7849 90.95 Above 94

-taut -PAS -3AO -Imo 1A +74Bis +1/se 4.171/a

Plods 2. Stanines and the normal Corns Each slanine (except 1 and 9) is one bait S.D. in width.

survey test. A test that measures general achievement in agiven area, usually with the connotation that the test is in-tended to assess group status, rather than to yield precisemeasures of individual performance.

t. A critical ratio expressing the relationship of some measure(mean, correlation coefficient, difference, etc.) to its standarderror. The size of this ratio is an indication of the significanceof the measure. If t is as large as 1.96, significance at the .05level is indicated; if as large as 2.58, at the .01 level. Theselevels indicate 95 or 99 chances out of 100, respectively.

taxonomy. An embodiment of the principles of classification;a survey, usually in outline form, such as a presentation of theobjectives of education.

test-retest reliability coefficient. A type of reliability coefficientobtained by administering the same test a second time, aftera short interval, and correlating the two sets of reores. "Sametest" was originally understood to mean identical content, i.e.,the same form; currently, however, the term "test-retest" isalso used to describe the administration of different forms ofthe same test, in which case this reliability coefficient becomesthe same as the alternate-form coefficient. in either case (1)fluctuations over time and in testing situation, and (2) anyeffect of the first test upon the second are involved. When thetime interval hetween the two testings is considerable, as sev-eral months, a test-retest reliability coefficient reflects not onlythe consistency of measurement proVided by the test, but alsothe stability of the examinee trait being measured.

true score. A score entirely free of error; hence, a hypotheticalvalue that can never be obtained by testing, which always in-volves some measurement error. A "true" score may be thoughtof as the average score from an infinite number of meas-urements from the same or exactly equivalent tests, assumingno practice effect or change in the examinee during the test-ings. The standard deviation of this infinite number of "samp-lings" is known as the standard error of measurement.

validity. The extent to which a test does the job for which itis used. This definition is more satisfactory than the traditional"extent to which a test measures what it is supposed to meas-ure," since the validity of a test is always specific to the pur-poses for which the test is used. The term validity, then, hasdifferent connotations for various types of tests and, thus, adifferent kind of validity evidence is appropriate for each.

(1) content, curricular validity. For achievement tests,validity is the extent to which the content of the test representsa balanced and adequate sampling of the outcomes (knowl-edge, skills, etc.) of the course or instructional program it isintended to cover. It is best evidenced by a comparison of thetest content with courses of study, instructional materials, andstatements of educational goals; and often by analysis of theprocesses required in making correct responses to the items.Face validity, referring to an observation of what a test ap-pears to rneasere, is a non-technical type of evidence; apparentrelevancy is, however, quite desirable.

(2) criterion-related validity. The extent to which scoreson the test are in agreement with (concurrent validity) or pre-dict (predictive validity) some given criterion measure. Pre-dictive validity refers to the accuracy with which an aptitude,prognostic, or readiness test indicates future learning successin some area, as evidenced by correlations between scores onthe test and future criterion measures of such success (e.g., therelation of score on an academic aptitude test administered inhigh school to grade point average over four years of college).In concurrent validity, no significant time interval elapses be-tween administration of the test being validated and of thecriterion measure. Such validity might be evidenced by con-current measures of academic ability and of achievement, bythe relation of a r ew test to one generally accepted as or knownto be valid, or b/ the correlation between scores on a test andcriteria measures which are valid but are less objective andmore time-consuming to obtain than a test score would be.

(3) construct validity. The extent to which a.test measuressome relatively abstract psychological trait or construct; ap-plicable in evaluating the validity of tests that have been con-structed on the basis of an analysis (often factor analysis) ofthe nature of the trait and its manifestations. Tests of person-ality, verbal ability, mechanical aptitude, critical thinking,etc.. are validated in terms of their construct and the relationof their scores to pertinent external data.

variability. The spread or dispersion of test scores, best indi-cated by their standard deviation,

variance. For a distribution, the average of the squared devia-tions from the mean; thus the square of the standard deviation.

TEST SERVICE NOTEBOOKS are issued from time to time as a professional service of The Psychological Corporation. Inquiries,comments, or requests for additional copies may be addressed to the office nearest you. Write: Advisory Services, The PsychologicalCorporation, New York, NY 10017 Chicago, II 60648 San Francisco, CA 94109 Atlanta, GA 30309 Dallas, TX 75235

44 4

Page 47: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

Appendix B

Summary of Common Test Scores

45 48

Page 48: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

SCORES FREQUENTLY ASSOCIATED WITH NORM REFERENCED TESTS

DEFINITION MAJOR ADVANTAGES MAJOR DISADVANTAGES

The percentile rank establishes

a student's standing relative to

a norm group in terms of the per-

centage of students who scored at

or below his or her raw score.

For example, a student who scored

at the 98th percentile achieved

a raw score which was higher than

the raw scores of 98 percent of

the norm group who took the same

test under the same conditions.

1. Percentiles show the relative standing

of individuals compared to a normative

group.

2. They are familiar to most public school

personnel, though probably not the

general public.

3. Percentiles are relatively easily

explained.

1, Percentiles are frequently confused

with the percent of the total number

of test items answered correctly.

2, Since the percentile scale does not

have equal units of measurement, per-

centiles should not be used in the

computation of group statistics.

w

0

O

The grade equivalent score indi-

cates the performance of a student

on a particular test relative to

the median performance of students

at a given grade level and month;

e.g., a fifth grader who receives

a grade equivalent score of 8.2 on

a reading test achieved the same

raw score performance as the typi-

cal eighth grader in the second

month of eighth grade would be

expected to achieve on the same

fifth_grade test.

_

1. It appears easy to communicate the

standing of an individual student rela-

tive to a grade level (most people

believe they understand what is meant

by grade equivalent scores).

1, Grade equivalents are easily misunder-

stood and misinterpreted.

2. Achievement expressed in grade equi-

valent score units cannot be meaning-

fully compared with each other in

several instances,

a. Grade equivalent scores cannot be

meaningfully compared for the same

student (or group of students) over

b. Grade equivalent scores cannot be

meaningfully compared for the same

student (or group of students) across

subject matter areas,

c. Grade equivalent scores cannot he

meaningfully compared for the same

student (or group of students) across

different tests.

3. Many grade equivalent scores are statistical

projections (interpolations or extrapolations).

In the later grades it is not uncommon to find

grade equivalent scores of two or three grade

levels above or below the student's actual

grade level, but these scores are of doubtful

accuracy.

4. The grade equivalent scale is not composed of

equal sized units. Having equal sited units

implies that the underlying difference between

any two scores is the same throughout the scale,

49 50

Page 49: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

SCORES FREQUENTLY ASSOCIATED WITH NORM REFERENCED TESTS

.11111EM110.DEFINITION MAJOR ADVANTAGES

vi

Standard scores are derived from 1.

raw scores, but express the results

of a test on the same numerical

scale regardless of grade level,

subject area or test employed.

MAJOR DISARM

Since the mean and standard deviation 1,

of the standard score scales are pre-

specified, a student's standard score

immediately communicates two important

facts about his or her performance on

that test:

a. Whether the student's score is

above or below the mean.

b. Now far above or below the mean,

in standard deviation units, his

or her performance is.,

2. The constant numerical scale of standard 2,

scores facilitates comparisons:

a. Across students taking the same

test,

b, Across subject matter areas for the

same student,

3, Standard scores are derived in a way

that maintains the equal interval pro-

perty in their units which is absent

in percentile and grade equivalent scores

Therefore, summary statistics may be

meaningfully interpreted when calculated

on standard scores.

The most useful interpretation of standard

scores requires some knowledge of statistics

(IA., mean and standard deviation) and

hence may not be appropriate for audiences

who have not been exposed to these concepts

(e.g., parents, the, news media).

Given the variety of standard scores available,

there may be potential confusion in expressing

the same test performance with so many different

numerical values.

3. The conversion of raw scores to standard scores

may either maintain the shape of the distribution

observed, or may transform the distribution to

another, more interpretively convenient shape

(e.g., the normal distribution); and the pro-

cedures employed in specifying.the conversion

process may not be immediately obvious,

A standard score system having 99 1. Same as standard score systems. 1. They are relatively new.,

equal intervals; The average corres-

ponds to the 50th centile; the 1st & 2. Permit aggregation of data from a wide 2. They depend upon standard scores or

99th NCEs correspond to the 1st & 99th variety of tests. percentiles.

centiles. Range: generally 1.99

but can be higher and lower. 3.' Not all test publishers use them.

51

Page 50: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

SCORES FREQUENTLY ASSOCIATED WITH NORM REFERENCED TESTS

DEFINITION MAJOR ADVANTAGES

Expanded scale scores are a type of

standard score whose scale is

designed to extend across grade

levels and whose mean increases

progressively as the grade level

increases.

....WI,I.rStanines are a standard score scale

consisting of nine values with a

mean of five and a standard devia-

tion of two.

If the distribution of scores is

normal, each stanine includes a

known proportion of the scores

in the distribution.

1. Expanded scores facilitate longitudinal

comparisons of an individual across

grade levels.

2. Expanded scale scores provide the vehicle

for expressing a performance obtained at

one grade level to the norm group of

another, This is useful when the appro

priate level of a test to be administered

to a student is judged to be other than

that of his or her grade level (i.e.,

functional level testing).

3. Since they were designed as equal

interval, their scores may be mathemati-

cally manipulated (e.g., averaged).

MAJOR DISADVANTAGES

1. Different test publishers use different

terms to refer to their expanded scale

scores (e.g., growth scale values,

achievement development scale scores,

standard score, scale score) and this

may be confusing when considering results

from different tests.

2. Different tests use different ranges,

and standard deviations in deriving

their expanded scale scores. Thus,

results from different tests expressed

in expanded scale score units cannot be

readily compared.

3. The statistical properties of expanded

scale scores are often not as uniform

as theoretically desired.

1. As in all standard scores, stanines have

the same meaning across different tests,

different grade levels and different

content areas.

2. Stanines consist of only nine possible

scores and thus may be easier to commun-

icate to audiences not familiar with

measurement terminology. Verbal labels

may be given to each stanine value to

facilitate interpretation,

1. Since some of the stanines encompass

a wide range of scores, their use in

reporting can be insensitive to differ-

ences between students' performance

that are mare apparent from the use of

other test scores.

54

53

Page 51: DOCUMENT RESUME ED 196 935 booklet is intended to to used in conjunction dith workshops and seminars conducted by measurement specialists using the training methods described in Training

SCORES FREQUENTLY ASSOCIATED WITH OBJECTIVE REFERENCED TESTS

DEFINITION

The number of items on a test or

subtest answered correctly by the

student.

0

X

The proportion of the total number

of items answered correctly by the

student.

=r...........yr,I.....MAJOR ADVANTAGES MAJOR DISADVANTAGES

1. Virtually no statistical or measure-

ment expertise is needed to calculate

raw scores,

2.

2.

When a standard for mastery has been 1.

applied to a set of items for a speci-

fic objective, a student's performance

in terms of that objective is expressed

as having mastery or non-mastery of

the objective,

Raw scores are the necessary first step

in expressing test performance In any of

a number of other ways (e.g., standard

scores, percentiles.t

1.

Very little statistical or measurement 1.

expertise is required to understand this

expression of test performance.

If the content area is sufficiently

represented by the items on the test,

the percent correct provides an expression

of the proportion of the subject matter

mastered by the student,

By themselves, raw scores offer no indication

as to how a student who has mastered the skills

represented on the test "should" perform

(i.e., criterion referenced) or how other

students at the same grade level have performed

(i.e., norm referenced.)

No notion of test difficulty or expected

performance is contained in this score.

Unless accompanied by a standard fur mastery

or information as to how a student's peers have

performed in the test, misinterpretations may

arise.

The objective mastery score compares the 1.

student's performance on that objective

to a judged standard of what he or she

should know of the skills required to master

it, This score can be very useful in

diagnosing a student's specific strengths

and weaknesses.

2, When the subject matter requires a 2

successive accumulation of skills (e.g.,

elementary math), objective mastery

scores may be extremely useful in

monitoring the progress of students in

specific skill areas.

Objective mastery scores are difficult to

compare across different tests. Items designed

to measure the same objective may differ in

difficulty or have different standards for

mastery on different tests.

If a purpose in testing is to differentiate

among students, objective mastery scores do

not present a very useful index. Different

raw scores above or below the mastery level

are viewed as the same-either mastery or

non - mastery.

56


Recommended