Date post: | 24-Oct-2014 |
Category: |
Documents |
Upload: | alfonsi-arcos-rus |
View: | 119 times |
Download: | 2 times |
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 1
Some warm up expressions:
Image in the mirror syndrome: the fear of the teacher of doing exams because their
teaching is reflected in the results of the tests, because a test is a feedback of the teacher
too.
Evaluation: You take into account everything on a teaching program.
Assessment: It takes into account all the competences (all the learner’s skills and
knowledge).
Testing: It takes into account the formal knowledge of what the students have
learnt at a particular time.
The main purpose of testing is getting feedback of what the students are learning.
Mode: The most common value among a group.
Mean: It is the central tendency of a collection of numbers taken as the sum of
the numbers divided by the size of the collection.
Evaluation
Assement
Testing
Scoring that evaluators use to rate or score the separate parts or traits (dimensions) of an examinee's product or process first, then sum these part scores to obtain a total score. A piece of writing, for example, may be rated separately as to ideas and content, organization, voice, choice of words, sentence structure, and use of English mechanics. These separate ratings may then be combined to report an overall assessment. It is more objective
Analitic Scoring
The word "holistic" means looking at the whole rather than at parts; holistic scoring is a procedure for evaluating essays as complete units rather than as a collection of constituent elements. Holistic scoring of student writing has three purposes: to enable valid, quick and reliable evaluation of student essays. It is the general impressions you get. It is better for beginners.
Holistic Scoring
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 2
1. Kind of Tests:
1.1. Purpose:
1.1.1. Proficiency: It is a test that tries to measure the level of the student in a
given language, regardless to previous experiences.
1.1.2. Diagnostic: They are used to diagnose a particular language or to check
on students’ progress in learning particular elements of the course, and also
to discover the weaknesses of the students. They help learners to know if
they need further teaching. The exam is done individually, but in order to
obtain a good result, we need to analyze in depth the results of a large
group, because that way, we can see if many of them have difficulties or if
they have achieved the knowledge instructed.
1.1.3. Achievement: A test intended to evaluate the learners’ knowledge
according to a given syllabus. There are two types:
Final achievement: It focuses on the syllabus course.
Progress achievement: It focuses on the different parts of the
syllabus course.
1.1.2. Aptitude: It takes place before the foreign language course to check the
strength and weaknesses of students, taking into account factors like
intelligence, age, motivation, memory, phonological sensitivity and
grammatical patterns.
1.1.3. Placement: It obtains information of students’ abilities in order to place
them in the most adequate class for their level. It actually helps to place the
students at a stage of the teaching programme most appropriate to their
abilities.
1.2. Frame of Reference:
1.2.1. Norm Reference: It relates to one learner’s performance to that of other
students. Teachers are not told to directly what the students are capable of
doing with the language.
1.2.2. Criterion Reference: Students are classified according to whether they
are able to perform some tasks. What they can actually do in the language
is measured.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 3
1.3. Scoring procedure:
1.3.1. Discrete vs. integrative: DP testing refers to the testing of one
particular element at a time. It requires the candidate to combine many
language elements in the completion of a task. Discrete tests favour
objective scoring, for example, grammatical structures. They are always
indirect. Integrative tests favour subjective scoring like writing a
composition, dictations and making notes while listening to a lecture. They
tend to be direct.
1.3.2. Objective vs. subjective: It is related to the way of questioning.
Objective contains closed questions or multiple choice questions. It
does not require any judgment on the part of the scorer
Subjective contains open questions. It needs judgment from the
scorer. There are different levels of subjectivity. For example, the
scoring of a composition may be considered more subjective than the
scoring of short answers in response to a reading passage.
1.4. Content:
1.4.1. Direct testing: It is as authentic as possible. Candidates have to perform
the skill to be measured, such as in writing, where the students or
candidates have to write a composition. A test is said to be direct when it
tests the skills we wish to measure. It is used to test production skills, such
as writing or speaking. It is not so good for measuring reading or listening,
because it is hard to test understanding. There is likely to be a helpful
Washback effect.
1.4.2. Indirect testing: It measures the ability which underlies the skills which
we are interested in.
Eg.:
• candidates must identify mistakes in a text.
• candidates use a sample text and underline matching words.
testing writing
• candidates use pen and paper, matching similar sounds without speaking.
testing pronunciation
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 4
2. Testing requirements:
2.1. Content validity: The test must be coherent with the teaching itself. You have
to test the contents you have taught in class. For example, if you have taught the
uses of the verb to be, your exam will have to deal with exercises focused on
the verb to be and it has to contain a proper sample of the relevant structures.
In order to know if a test has content validity, we need a specification of the
skills and structures it is meant to cover. The greater a test’s content validity,
the more likely it is to be an accurate measure of what it is supposed to
measure.
2.2. Construct validity: A test is said to have construct validity if it measures just
the ability it is supposed to measure, according to the methodology you have
used when teaching. The word construct refers to any underlying ability which
is hypothesized in a theory of language ability. For example, the ability to read
involves a number of sub-abilities (such as guessing the meaning of unknown
words from the context) and we have to take into account all of them.
2.3. Concurrent or empirical validity: It is established when the test and the
criterion are administered at about the same time. This validity is obtained as a
result of comparing the results of the test with the results of some criterion
measure such as:
An existing test, known or believed to be valid and given at the same
time.
The teachers ratings or any other such form of independent
assessment given at the same time.
The subsequent performance of testees on a certain task measured by
some valid test.
The teacher’s ratings or any other such form of independent
assessment given later.
2.4. Face validity: A test is said to have face validity when it looks like a test. It is
advisable to show the test to other colleagues in order to have different
viewpoints of the test itself and this could also show us some mistakes that we
may not find out, such as absurdities or ambiguities. Some language test may
not be face validity depending on the countries you use them. The student’s
motivation is maintained if the test has face validity.
Measure the
test’s
concurrent
validity
Measure the
test’s predictive
validity.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 5
2.5. Reliability: Reliability is a necessary characteristic of any good test. All tests
must first be reliable as a measuring instrument. If the test is administered to the
same candidates on different occasions and it produces different results, then
the test is said to be no reliable. This is commonly known as test/re-test
reliability. If the test is marked by different examiners and they get similar
marks, then is known as mark/ re-mark reliability. In order to be reliable, a test
must be consistent in its measurements:
The larger the sample, the greater the probability that the test as a
whole is reliable.
A way of discovering the test reliability is to administer the same test
at different groups or at different times, especially those tests of oral
production and listening comprehension.
The instructions must be clear.
One of the most important factors affecting reliability is the scoring of
the test. Objective tests overcome this problem but reliability in
subjective tests is much more difficult to prove. In order to make them
reliable, multiple marking or the use of rating scales are used. The
candidates’ skills are measured separately: listening, speaking,
writing, reading.
2.6. Practicallity/ Administration: A test must be practicable, which means that it
must be fairly straight forward to administer. We have to take into account the
time spent on the administration of the test, the reading of the test instructions,
the collection of the answer sheets, etc. Another important factor to take into
account is the place where the candidates have to answer the questions of the
test, in the test’s sheet or in a different sheet of paper. The use of a separate
sheet of answer may be empowered when testing a large group of candidates.
We must also bear in mind the equipment facilities of the centre where the test
is going to take place, e.g.: there is no sense in recording voices or dialogues on
tape if there is no cassette player in the centre.
2.7. Discrimination: An important feature of a test is the capacity to discriminate
among the different candidates and to reflect the differences in the performance
of the individuals in the group. The results of the test are examined to determine
the extent to which it discriminates between individuals who are different.
Candidates must obtain different percentages; otherwise there will be no
discrimination. It is useful to find out which elements of the teaching syllabus
have been mastered and which of them have not. It can also be used to assess
relative abilities and locate areas of difficulties. Precision is really difficult,
rather impossible, to obtain. This lack of precision is known as margin of error
in a test.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 6
3. Testing Methods:
3.1. Multiple Choice Tests: Multiple choice is a form of assessment in which
respondents are asked to select the best possible answer (or answers) out of the
choices from a list.
3.2. Cloze Procedure: Cloze procedure is a technique in which words are deleted
from a passage according to a word-count formula or various other criteria. The
passage is presented to students, who insert words as they read to complete and
construct meaning from the text. This procedure can be used as a diagnostic
reading assessment technique.
3.2.1. Traditional Cloze: Words are deleted at regular intervals, typically
every seventh or eighth word. The more frequent the deletions, the more
difficult the test.
3.2.2. Modified Cloze: Words are deleted at regular intervals, depending on
what the testers want to test. All the deletions may test the same language
point (e.g.: past forms) or they may test different but specific language
points that the testers are concerned with.
3.2.3. Multiple-Choice Cloze: An easier version of either traditional or
modified cloze where the learner is offered choices from which to select
his/her answers.
3.2.4. Authentic Cloze: A version of traditional cloze in which the tester
simply cuts a number of letters off from the beginning or end of each line,
making the text look as if it has been clipped in a photocopier. Again, the
more letters that are cut off, the more difficult the test.
3.2.5. C-Testing: The most recent variation in which the second half of every
second word is deleted (words with odd numbers of letters have the extra
letter supplied). Giving the first half of the word makes the test easier, but
deleting every second word restores the difficulty.
3.3. Guessing from context: Candidates have to guess the meaning of underlying
words from the context.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 7
4. Designing classroom tests: Construction of a classroom tests:
4.1. Planning stage: (mirar ppt)
4.1.1. Learners’ need of a test.
4.1.2. Specification and sampling.
4.1.3. Construct and content validity, reliability, practicality.
4.2. Development stage:
4.2.1. Item construction (Input + Method).
Compilation of inputs or texts.
Methods.
Chanel.
Strategy.
Levels of difficulty.
Avoidance of overlapping, tricky questions and ambiguity.
4.2.2. Instructions.
4.2.3. Design layout/ format.
4.2.4. Consideration of scoring.
4.3. Control and operational stage:
4.3.1. Administration of the test.
4.3.2. Performance of statistical tests.
4.3.3. Washback (pedagogical effects).
A test will influence teaching and learning.
A test will influence what/how teachers teach and what/how learners
learn.
A test will influence the RATE and SEQUENCE of teaching/
learning.
A test will influence the DEGREE and DEPTH of teaching/learning.
A test will influence the ATTITUDES to the content, method, etc. of
teaching/ learning.
Tests that have important consequences will have Washback, those
that don’t, won’t.
Tests will have Washback on ALL teachers and learners.
4.3.4. Presentation of test to students (correction).
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 8
5. Approaches to language testing:
Spolsky (1975) identified three periods of language testing: the pre-scientific, the
psychometric-structuralist and the psycholinguistic-sociolinguistic.
The pre-scientific period: Testing in the pre-scientific Era did
not rely on linguistic theory, and reliability was considered less important
than the production of a test that “felt fair”.
The psychometric-Structuralist period: The name was intended
to reflect the joint contribution of the structural linguists, who identified
elements of language s/he wanted testing, and the psychometrics, who
produced objective and reliable methods of testing the candidates’
control of those elements.
The psycholinguistic-Sociolinguistic period: By the 1970s discrete
point testing was no longer felt to provide a sufficient measure of
language ability, and testing moved into the psycholinguistic-
sociolinguistic era, with the advent of global integrative testing. Oller
(1979, cited in Weir 1990) argued that global integrative testing, such as
cloze tests, which required candidates to insert suitable words into gaps
in a text, and dictation, provided a closer measure of the ability to
combine language skills in the way they are used for actual language use
than discrete point testing.
The Communicative Period: The fact that discrete point and
integrative testing only provided a measure of the candidate’s
competence rather than measuring the candidate’s performance brought
about the need for communicative language testing (Weir 1990). Before
we look at the features which distinguish this form of testing, we will
outline the models of communicative competence on which it is based.
According to Spolsky (1989:140), “Language tests involve measuring a
subject’s knowledge of,and proficiency in, the use of a language. A
theory of communicative competence is a theory of the nature of such
knowledge and proficiency. One cannot develop sound language tests
without a method of defining what it means to know a language, for
until you have decided what you are measuring, you cannot claim to
have measured it”. The main implication this model had for
communicative language testing was that since there was a theoretical
distinction between competence and performance, the learner had to be
tested not only on his/her knowledge of language, but also on his/her
ability to put it to use in a communicative situation (Canale and Swain,
1980).
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 9
For Shohamy, Donitsa-Schmidt, and Ferman (1996), washback is "the connections between
testing and learning" (p. 298); to Gates (1995), it is "the influence of testing on teaching and
learning" (p. 101); and for Messick (1996) washback is "the extent to which the introduction
and use of a test influences language teachers and learners to do things they would not
otherwise do that promote or inhibit language learning" (p. 241). Clearly then, the washback is
roughly speaking the effect of testing on the teaching and learning processes. An example that
often comes up in Japan is the effect of the university entrance examinations in Japan on high
school language teaching and learning.
Washback, whether it is positive or negative, can be a potential boon or threat to language
teaching curriculum (broadly defined) because, through washback, a test can steer a curriculum
in one direction or another (in terms of teaching, course content, course characteristics, and/or
class time) either with or against the better judgment of the administrators, teachers, students,
parents, etc.
Thinking about washback can also lead us to think about the consequential basis for test
validity in terms of the social consequences of test use and the values implications of
test interpretations.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 10
(The following point is Javi’s contribution, so thank you Javi )
6. Testing grammar and vocabulary
6.1. Grammar
Why test grammar?
Necessary --> skeleton of the language
Lack of grammatical ability --> obstacle to skills performance
format of grammar tests is familiar
Large # of items can be administered and scored in a short period time
There is good cause to include a grammatical component in achievement,
placement, and diagnostic tests of teaching institutions.
Recommendations
Make items sound as natural as possible
Contextualize items
Be clear about what each item is testing and award points for that only.
Tests
Multiple choice, error correction, rearrangement items, completion items,
transformation items, items involving the changing of words, "broken
sentence" items, pairing and matching etc..
6.2. Vocabulary
Why test vocabulary?
Knowledge of vocab is essential to the development and demonstration
of linguistic skills.
Recommendations
Contextualize
Decide whether test active production or passive recognition
Lexical items can be selected from: syllabus, textbook, reading material,
students free writing.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 11
Techniques
Recognition: multiple choice, definitions, gap-filling, matching, word
formation, synonyms, sets (association of words).
Production: pictures, definitions, gap-filling, sets (association of words),
synonyms, completion items.
7. Testing skills
7.1. Listening
Recommendations
Usually involves a spoken stimulus.
Materials should be natural and authentic
Good recording quality
No pressure on candidates
Memorization should be avoided
minimal demands on productive skills
Texts should be shorter and questions easier, as respects to reading.
Techniques
Multiple choice, short answer, completion, information transfer (labeling
pictures, completing forms, showing routes on a map...), note-taking,
dictation.
7.2. Speaking
Recommendations
Make test as long as feasible and plan it carefully
Give candidates as many fresh starts as needed.
Interviewers should be selected and trained. (also have a second tester
while interviewing).
Only tasks and topics that would be expected of candidates knowledge.
Interview should be carried out in quiet room.
Put candidates at their ease.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 12
Collect enough relevant info.
Dont talk too much
Formats
Interview, interaction with peers, response to tape recordings,
Elicitation techniques
Questions and requests for information, pictures, role-play, interpreting,
discussion.
It is not recommended --> prepared dialogue, or reading aloud.
Marking
Look at pg 1. holistic vs. analytical marking.
Note: Neil McLaren says that we should bear in mind the following aspects when evaluating oral
production.
Word Order, Vocab, Pronunciation, fluency, appropriateness of expression, tone, accuracy,
capacity to reason, initiative in asking for info, accent, range of expressions, flexibility, size.
7.3. Reading
Levels of comprehension
Literal, Inferential, evaluative
Types of texts to use
Textbook, novel, magazine, newspaper, academic journal, letter,
timetable, poem..
Recommendations
Choose texts of appropriate length.
For acceptable reliability, include as many passages as possible in tests.
Choose texts which will interest candidates.
Avoid texts made up of candidates general knowledge.
Don't choose texts culturally laden.
Don't use texts which candidates have already read.
Don't ask students to write too much.
Language Testing in Theory and Practice
Aprendizaje y Enseñanza de la Lengua Extranjera I
Alfonsi Arcos Rus Página 13
Formats
Multiple choice, short answer, info transfer with help of visuals,
completion exercises, identifying order of events, referents, guessing meaning.
Writing
Recommendations
Instructions should not be too long
Set as many tasks as feasible.
Test only writing ability.
Restrict candidates by defining tasks well
Avoid over-overcorrection and negative marking.
Look for strengths and weaknesses.
Types of exercises
Controlled writing (you set one topic only), guided writing (you give
ideas on what to write about), free writing.
Marking
Analytical vs. Holistic.