TESTING FO
R LANGUAGE
TEACHERS
A R T HU R H
U GH
E S
Mohammad PazhouheshKhayyam University of MashhadFarhangian University of Mashhad, Beheshti Campus
TEACHING AND TESTING
BACKWASH The effect of testing on teaching and learning If the test is important, it can dominates all teaching and
learning activities It can be harmful or beneficial
Harmful Backwash : The test content and testing techniques are in variance with
ojectives of the course
Beneficial Backwash : It has an immediate effect on teaching
All measures of mental ability are necessarily indirect, incomplete, imprecise, subjective, relative
To minimize the effects of these limitationsA. provide clear theoretical definitions of the
abilities we want to measure;
B. Specify precisely the conditions, or operations that we will follow in eliciting and observing performance,
C. Quantify the observations so as to assure our measurement scales have the properties we require.
GENERAL TYPES OF TESTS Proficiency tests Achievement tests Diagnostic tests Placement tests Selection Tests
Competition testsAptitude tests
Language aptitude testsVocational aptitude tests
KINDS OF TESTS AND TESTING
PROFICIENCY TESTS measure language ability regardless of any
previous training. are not based on
the content / objectives of language courses .
a specification of what candidates are able to do to be considered proficient
Proficiency: having sufficient command of the language
used for a particular purpose such as: A translator in the United Nations A student seeking admission in American /
British Universities Used for a general purpose such as:
General Proficiency Tests FCE, CPE, TOEFL, IELTS
KINDS OF TESTS AND TESTING
ACHIEVEMENT TESTS are directly related to language courses Used to determine whether students have
achieved the objectives of the course or not.
Kinds of Achievement tests 1. Final achievement tests 2. Progress achievement tests
Final achievement tests are administered at the end of a course of study. Their contents are related to the course
concerned.
syllabus content approach:Should the test be based directly on a detailed course
syllabus? disadvantage : If the syllabus is badly designed, the
results of the test could be misleading
Objective content approach: : Should the test be based on course objectives?
Advantages compelling course designers to be explicit about course
objectives making it possible for the test to show how far objectives
have been achieved
compelling the course designers to choose a syllabus which is consistent with the course objectives
working against poor teaching practice promoting a more beneficial backwash effect
PROGRESS ACHIEVEMENT TESTS measure the progress of students and one to
measure progress is to administer repeatedly final
achievementtests
Disadvantage: The low scores in early stages are
discouraging. The alternative is to establish a series of well-
defined short-term objectives. These should make a clear progression towards the final achievement test based on course objectives
Pop Quizzes make a rough check on students’ progresskeep students on their toes.
• are used to identify learners’ strengths and weaknesses • are intended to ascertain what learning still needs to
take place. • can tell us that someone is particularly weak in, say,
speaking as opposed to reading in a language • Proficiency tests may prove adequate for this purpose
• Teacher may even need to analyze samples of a person’s performance in writing or speaking in order to create profiles of student’s ability in certain categories
DIAGNOSTIC TESTS
PLACEMENT TESTS are intended to place students at the stage of
the teaching program most appropriate to their abilities.
are used to assign students to classes at different levels
are constructed for particular situations. depend on the identification of the key features
at different levels of teaching.
APTITUDE TESTS indicate an individual facility for acquiring
specific skills and learning are used to measure aptitude for learning and
to predict future performance
DIRECT VS. INDIRECT TESTING
Direct tests require the candidate to perform precisely the skill that we wish to measure
If we want to know how well candidates can write composition, we get them to write composition.
If we want to know how they pronounce a language, we ask them to speak.
The tasks, and the texts used, should be as authentic as possible.
Direct testing is easier to carry out to measure the productive skill.
Attractions of Direct testing
1. Straightforward to create the conditions to elicit the required behaviors
2. Straightforward assessment and interpretation
3. helpful backwash effect
SEMI-DIRECT TESTING A. speaking tests where candidates respond to a
tape-recorder stimuli, their own responses being recorded and later scored
INDIRECT TESTING measures the abilities that underlie the skills tested. EXAMPLE: One section of the TOEFL as an
indirect measure of writing ability. At first the old woman seemed unwilling to accept anything that was offered her by my friend and I.
The main appeal of indirect testing: testing a large number of elements in one test giving it to a large number of students correcting it objectively
The main problem of indirect testing: The relationship between performance in the
test andactual performance of the skills being tested is
weak instrength and uncertain in nature
A. DISCRETE POINT TESTING
refers to the testing of one element at a time, item by item.
might take the form of a series of items, each testing a particular grammatical structure .
is a testing approach which cuts up language skills and components into smaller parts and then tests them one by one.
is an atomistic approach to language teaching and learning.
DISCRETE POINT VS. INTEGRATIVE TESTING
B. INTEGRATIVE TESTING
requires the candidate to combine many language elements in the completion of a task writing a composition taking notes while listening to a lecture taking a dictation completing a cloze passage
Unlike DP tests , IN tests tend to be direct. some integrative methods, such as cloze
procedure, are indirect Diagnostic tests of grammar tend to be discrete
point
A. NORM-REFERENCETESTING (NRT) relates one candidate’s performance to that of
other candidates . We are not told directly what the student is
capable of doing in the language .
B. CRITERION-REFERENCE TESTING (CRT) provides direct information about what a
candidate can actually do in the language.
NORM-REFERENCE VS. CRITERION-REFERENCE TESTING
A. OBJECTIVE TESTING No judgment is required on the part of the scorer
(multiple-choice tests) B. SUBJECTIVE TESTING Judgment is called for on the part of the scorer
(composition) There are different degrees of subjectivity in
testing. Scoring composition is more subjective than scoring short-answer items.
Objectivity in scoring brings greater reliability to testing.
Scoring rubrics can increase reliability of subjective tests such as composition.
OBJECTIVE VS. SUBJECTIVE TESTING
No real need for strong candidates to attempt easy items, and no need for weak candidates to attempt difficult items.
an efficient way of collecting information on testees’ ability Presenting initially items of average difficulty.
Those who respond correctly are presented with a more
difficult item. Those who respond incorrectly are presented
with an easy item. The computer adapts the items to the testees’
level .
Oral interviews are typically a form of adaptive testing
COMPUTER ADAPTIVE TESTING
Measuring any ability to take part in acts of communication, including reading and listening
It is assumed that it is usually communicative ability that we want to test .
COMMUNICATIVE LANGUAGE TESTING
VALIDITY: Definition A test is valid if it measures accurately what it is intended to measure
Types of validity Construct Content Criterion-related Face
CHAPTER 4 : VALIDITY
VALIDITY: Definition A test is valid if it measures accurately what it is intended to measure
Types of validity Construct Content Criterion-related Face
CHAPTER 4 : VALIDITY
Construct Validitythe degree to which a test measures what it
claims, or purports, to be measuring
Construct: A construct is an attribute, an ability, or skill that happens in the human brain and is defined by established theories. Intelligence, motivation, anxiety, proficiency,
and fear are all examples of constructs. They exist in theory and has been observed to
exist in practice. Constructs exist in the human brain and are
not directly observable. There are two types of construct validity:
convergent and discriminant validity. Construct validity is established by looking at numerous studies that use the test being evaluated.
CHAPTER 4 : VALIDITY
2. CONTENT VALIDITY The test content is a representative sample
of the language skills being tested .
The test is content valid if it includes a proper
sample. importance of content validity
the greater a test’s content validity, the more
likely its construct validity
a test without content validity is likely to have a
harmful backwash effect since areas that are not
tested are likely to become ignored in teaching and
learning
3. CRITERION-ORIENTED VALIDITY The degree to which results on the test agree with those provided by an independent criterion
Kinds of criterion-related validity A. Concurrent Validity
is established when the test and the criterion are
administered at the same time
B. Predictive Validity concerns the degree to which a test can
predict candidates’ future performance.Areas that are not tested are likely to
become ignored in teaching and learning
VALIDITY COEFFIENT A mathematical measure of similarity to show the degree of validity . Perfect validity will result in a coefficient of 1.00 Total lack of validity results in a coefficient of 0.00 Satisfactory validity depends on the test’s purpose & importance A coefficient of 0.70 might be considered low if the test is important VALIDITY IN SCORING a reading test may call for short written responses If the scoring of these responses takes into account spelling and grammar, then it is not valid in scoring.
4. FACE VALIDITY The way the test looks to the examinees, test administrator, educators, and the like If you want to test the student in pronunciation, but you do not ask them to speak, your test lacks face validity If your test contain items or materials which are not acceptable to candidates, teachers, educators, etc., your test lacks face validity
HOW TO MAKE TESTS MORE VALID? Write explicit specifications for the test, which include all
the constructs to be measured. Make sure that you include a representative sample of the
content Use direct testing . Make sure the scoring is valid . Make the test reliable .
RELIABILITY refers to the stability or consistency of scores Nearly the same scores for the same individuals in two sessions Multiple-choice tests have high coefficient of reliability Look at the tables on p. 37
RELIABILITY COEFFICIENT The ideal coefficient is 1.00 Total lack of reliability is 0.00 Satisfactory reliability depends on the purpose and importance of the test Vocabulary, structure, and reading tests: .90 - .99 Auditory comprehension tests: .80 - .89 Oral production tests: .70 - .79
CHAPTER 5 : RELIABILITY
HOW TO ESTIMATE RELIABILITY? The way in which reliability coefficient arrived at
Test-retest Method Taking the same test twice by the same students, and then comparing the scores
Drawbacks of this method: If the administration is too soon, the students will remember, and then their scores will be affected If the time is too long, the students will forget or improve, and then that will affect the scores
CHAPTER 5 : RELIABILITY
The Alternate Forms Method Two equivalent forms, but the problem is such forms are not available
The Split Half Method The most common method to obtain reliability The subjects take the test one time, but each subject is given two scores One score for each half of the test The two sets of scores are then used to obtain the reliability coefficient be affected
CHAPTER 5 : RELIABILITY
THE STANDARD ERROR OF MEASUREMENT AND THE TRUE SCORE All test scores are estimates All tests contain some degree of error you have to use a statistic known as the standard error of measurement… to estimate the limits within which an obtained score is likely to diverge from a true score
CHAPTER 5 : RELIABILITY
SCORER RELIABILITY Consistency of scoring Nearly the same score for the same test In other words, comparing the scores of two or more scorers for the same students In composition tests, scores are usually fluctuate In multiple-choice tests, scores are nearly perfect If the scoring of a test is not reliable, then the test results cannot be reliable either
CHAPTER 5 : RELIABILITY
HOW TO MAKE TESTS MORE RELIABLE? 1) Take enough samples of behavior The more items you have on a test, the more reliable the test will be Considerations to be taken when adding extra items: Additional items should be independent of each other and of existing items Each additional item should represent a fresh start for the candidate Tests should neither be too long, nor too short
CHAPTER 5 : RELIABILITY
HOW TO MAKE TESTS MORE RELIABLE?2) Exclude items which do not discriminate well between weaker and stronger students Items on which strong students and weak students perform with similar degree of success contribute little to the reliability of a test Too easy items or too difficult items should be excluded A small number of easy items may be kept at the beginning of a test to give candidates confidence and reduce the stress they feel
CHAPTER 5 : RELIABILITY
HOW TO MAKE TESTS MORE RELIABLE?3) Do not allow candidates too much freedom The procedure of giving choices of questions to candidates has a negative effect on reliability In general, candidates should not be given a choice
4) Write unambiguous items 5) Provide clear and explicit instructions 6) Ensure that tests are well laid out and perfectly legible 7) Make candidates familiar with format and testing techniques
CHAPTER 5 : RELIABILITY
HOW TO MAKE TESTS MORE RELIABLE?
8) Provide uniform and non-distracting conditions of administration 9) Use items that permit scoring which is as objective 10) Provide a detailed scoring key 11) Train scorers 12) All scorers should follow the same criteria for scoring 13) Identify candidates by number not name 14) Employ multiple, independent scoring
CHAPTER 5 : RELIABILITY
RELIABILITY AND VALIDITY A valid test must be reliable However, a reliable test may not be valid at all Increasing the reliability of a test may be on the expense of validity There will always be some tension between reliability and validity The tester has to balance gains in one against losses in the other
CHAPTER 5 : RELIABILITY
CHAPTER 6 :ACHIEVING BENEFICIAL BACKWASH
Test the abilities whose development you want to encourage.
Beware of reasons for not testing particular abilities.
In case of MCQ and objectivity subjective scoring in case of subjective tests the expense involved in terms of time and
money
Determine the points that should be tested and give them sufficient weight in relation to the other
abilities
How to achieve beneficial backwash: I. Sample widely and unpredictably. II. Use direct testing.III. Make testing criterion-reference. IV. Base achievement tests on objectives. V. Ensure the test is known and understood by
students and teachers.VI. Where necessary provide assistance to teachers. VII.Count the cost.
CHAPTER 7: STAGES OF TEST DEVELOPMENT1)Make a full and clear statement of the testing
‘problem’ 2)Write complete specifications for the test 3)Write and moderate items4) Try the items on native speakers 5) Try the items on non-native speakers 6) Analyze the results of the trial and make necessary changes 7) Calibrate scales 8) Validate 9) Write handbooks for test takers, test users, and staff 10) Train any necessary staff (interviewers, raters, etc.)
The questions to be answered in order to state the problem:
i) What kind of test is it to be?ii) What is its precise purpose?iii) What abilities are to be tested?iv) How details must the results be?v) How accurate must the results be?vi) How important is backwash?vii) What constraints are set by unavailability, expertise, facilities, and time?
Stating the problem
i) Determining Content Specifying instructional objectives Preparing a table of specifications Determining number of items
ii. Necessary operations by the test-developer Specification of Text types:
Letters, forms, academic essays Addresses of Texts Length of Text(s) Topics (familiar/unfamiliar) Readability Structural and Vocabulary Range Dialect, accent, style Speed of Processing
words to be read per minute , rate of speech
2) Writing specifications for the test
iii) Structure, timing, medium/channel and techniques Test Structure(test section: grammar, voc., reading)Number of ItemsNumber of PassagesMedium/channel( tape, paper & pencil, ...)TimingTechniques
iv) Critical Levels of Performance iv) Scoring Procedures :
subjective or objective 3) Writing and Moderating Items
i) Sampling ( based on the contents)ii) Writing Items iii) Moderating Items (Reviewing )
4) PretestingInformal Trial of Items on Native Speakers Trialing Items on Non-native Speakers (Pretesting)
6) Item Analysis (analysis of the results) Reliability level of difficulty discrimination index distracters clearance of instructions and items timing
7) Calibration of scales8) Validation9) Writing handbooks for test-takers, test users & staff10. Training staff
7) Calibration of Scales For testing speaking and writing, a team of experts looks at samples of
skills and assign each of them to a point on the relevant scale 8) Validation It is essential for proficiency tests and repeatedly-used tests 9) Writing Handbooks for test takers, users, staff
It is essential for proficiency tests and repeatedly - used tests 10) Training Staff It is essential for proficiency tests and
repeatedly -used tests
See pp. 66 – 72 for examples of test development