Language Testing in Theory and Practice Revised2

Language Testing in Theory and Practice

Aprendizaje y Enseñanza de la Lengua Extranjera I

Alfonsi Arcos Rus Página 1

Some warm up expressions:

Image in the mirror syndrome: the fear of the teacher of doing exams because their

teaching is reflected in the results of the tests, because a test is a feedback of the teacher

too.

Evaluation: You take into account everything on a teaching program.

Assessment: It takes into account all the competences (all the learner’s skills and

knowledge).

Testing: It takes into account the formal knowledge of what the students have

learnt at a particular time.

The main purpose of testing is getting feedback of what the students are learning.

Mode: The most common value among a group.

Mean: It is the central tendency of a collection of numbers taken as the sum of

the numbers divided by the size of the collection.

Evaluation

Assement

Testing

Scoring that evaluators use to rate or score the separate parts or traits (dimensions) of an examinee's product or process first, then sum these part scores to obtain a total score. A piece of writing, for example, may be rated separately as to ideas and content, organization, voice, choice of words, sentence structure, and use of English mechanics. These separate ratings may then be combined to report an overall assessment. It is more objective

Analitic Scoring

The word "holistic" means looking at the whole rather than at parts; holistic scoring is a procedure for evaluating essays as complete units rather than as a collection of constituent elements. Holistic scoring of student writing has three purposes: to enable valid, quick and reliable evaluation of student essays. It is the general impressions you get. It is better for beginners.

Holistic Scoring




1. Kind of Tests:

1.1. Purpose:

1.1.1. Proficiency: It is a test that tries to measure the level of the student in a

given language, regardless to previous experiences.

1.1.2. Diagnostic: They are used to diagnose a particular language or to check

on students’ progress in learning particular elements of the course, and also

to discover the weaknesses of the students. They help learners to know if

they need further teaching. The exam is done individually, but in order to

obtain a good result, we need to analyze in depth the results of a large

group, because that way, we can see if many of them have difficulties or if

they have achieved the knowledge instructed.

1.1.3. Achievement: A test intended to evaluate the learners’ knowledge

according to a given syllabus. There are two types:

Final achievement: It focuses on the syllabus course.

Progress achievement: It focuses on the different parts of the

syllabus course.

1.1.2. Aptitude: It takes place before the foreign language course to check the

strength and weaknesses of students, taking into account factors like

intelligence, age, motivation, memory, phonological sensitivity and

grammatical patterns.

1.1.3. Placement: It obtains information of students’ abilities in order to place

them in the most adequate class for their level. It actually helps to place the

students at a stage of the teaching programme most appropriate to their

abilities.

1.2. Frame of Reference:

1.2.1. Norm Reference: It relates to one learner’s performance to that of other

students. Teachers are not told to directly what the students are capable of

doing with the language.

1.2.2. Criterion Reference: Students are classified according to whether they

are able to perform some tasks. What they can actually do in the language

is measured.




1.3. Scoring procedure:

1.3.1. Discrete vs. integrative: DP testing refers to the testing of one

particular element at a time. It requires the candidate to combine many

language elements in the completion of a task. Discrete tests favour

objective scoring, for example, grammatical structures. They are always

indirect. Integrative tests favour subjective scoring like writing a

composition, dictations and making notes while listening to a lecture. They

tend to be direct.

1.3.2. Objective vs. subjective: It is related to the way of questioning.

Objective contains closed questions or multiple choice questions. It

does not require any judgment on the part of the scorer

Subjective contains open questions. It needs judgment from the

scorer. There are different levels of subjectivity. For example, the

scoring of a composition may be considered more subjective than the

scoring of short answers in response to a reading passage.

1.4. Content:

1.4.1. Direct testing: It is as authentic as possible. Candidates have to perform

the skill to be measured, such as in writing, where the students or

candidates have to write a composition. A test is said to be direct when it

tests the skills we wish to measure. It is used to test production skills, such

as writing or speaking. It is not so good for measuring reading or listening,

because it is hard to test understanding. There is likely to be a helpful

Washback effect.

1.4.2. Indirect testing: It measures the ability which underlies the skills which

we are interested in.

Eg.:

• candidates must identify mistakes in a text.

• candidates use a sample text and underline matching words.

testing writing

• candidates use pen and paper, matching similar sounds without speaking.

testing pronunciation




2. Testing requirements:

2.1. Content validity: The test must be coherent with the teaching itself. You have

to test the contents you have taught in class. For example, if you have taught the

uses of the verb to be, your exam will have to deal with exercises focused on

the verb to be and it has to contain a proper sample of the relevant structures.

In order to know if a test has content validity, we need a specification of the

skills and structures it is meant to cover. The greater a test’s content validity,

the more likely it is to be an accurate measure of what it is supposed to

measure.

2.2. Construct validity: A test is said to have construct validity if it measures just

the ability it is supposed to measure, according to the methodology you have

used when teaching. The word construct refers to any underlying ability which

is hypothesized in a theory of language ability. For example, the ability to read

involves a number of sub-abilities (such as guessing the meaning of unknown

words from the context) and we have to take into account all of them.

2.3. Concurrent or empirical validity: It is established when the test and the

criterion are administered at about the same time. This validity is obtained as a

result of comparing the results of the test with the results of some criterion

measure such as:

An existing test, known or believed to be valid and given at the same

time.

The teachers ratings or any other such form of independent

assessment given at the same time.

The subsequent performance of testees on a certain task measured by

some valid test.

The teacher’s ratings or any other such form of independent

assessment given later.

2.4. Face validity: A test is said to have face validity when it looks like a test. It is

advisable to show the test to other colleagues in order to have different

viewpoints of the test itself and this could also show us some mistakes that we

may not find out, such as absurdities or ambiguities. Some language test may

not be face validity depending on the countries you use them. The student’s

motivation is maintained if the test has face validity.

Measure the

test’s

concurrent

validity

Measure the

test’s predictive

validity.




2.5. Reliability: Reliability is a necessary characteristic of any good test. All tests

must first be reliable as a measuring instrument. If the test is administered to the

same candidates on different occasions and it produces different results, then

the test is said to be no reliable. This is commonly known as test/re-test

reliability. If the test is marked by different examiners and they get similar

marks, then is known as mark/ re-mark reliability. In order to be reliable, a test

must be consistent in its measurements:

The larger the sample, the greater the probability that the test as a

whole is reliable.

A way of discovering the test reliability is to administer the same test

at different groups or at different times, especially those tests of oral

production and listening comprehension.

The instructions must be clear.

One of the most important factors affecting reliability is the scoring of

the test. Objective tests overcome this problem but reliability in

subjective tests is much more difficult to prove. In order to make them

reliable, multiple marking or the use of rating scales are used. The

candidates’ skills are measured separately: listening, speaking,

writing, reading.

2.6. Practicallity/ Administration: A test must be practicable, which means that it

must be fairly straight forward to administer. We have to take into account the

time spent on the administration of the test, the reading of the test instructions,

the collection of the answer sheets, etc. Another important factor to take into

account is the place where the candidates have to answer the questions of the

test, in the test’s sheet or in a different sheet of paper. The use of a separate

sheet of answer may be empowered when testing a large group of candidates.

We must also bear in mind the equipment facilities of the centre where the test

is going to take place, e.g.: there is no sense in recording voices or dialogues on

tape if there is no cassette player in the centre.

2.7. Discrimination: An important feature of a test is the capacity to discriminate

among the different candidates and to reflect the differences in the performance

of the individuals in the group. The results of the test are examined to determine

the extent to which it discriminates between individuals who are different.

Candidates must obtain different percentages; otherwise there will be no

discrimination. It is useful to find out which elements of the teaching syllabus

have been mastered and which of them have not. It can also be used to assess

relative abilities and locate areas of difficulties. Precision is really difficult,

rather impossible, to obtain. This lack of precision is known as margin of error

in a test.




3. Testing Methods:

3.1. Multiple Choice Tests: Multiple choice is a form of assessment in which

respondents are asked to select the best possible answer (or answers) out of the

choices from a list.

3.2. Cloze Procedure: Cloze procedure is a technique in which words are deleted

from a passage according to a word-count formula or various other criteria. The

passage is presented to students, who insert words as they read to complete and

construct meaning from the text. This procedure can be used as a diagnostic

reading assessment technique.

3.2.1. Traditional Cloze: Words are deleted at regular intervals, typically

every seventh or eighth word. The more frequent the deletions, the more

difficult the test.

3.2.2. Modified Cloze: Words are deleted at regular intervals, depending on

what the testers want to test. All the deletions may test the same language

point (e.g.: past forms) or they may test different but specific language

points that the testers are concerned with.

3.2.3. Multiple-Choice Cloze: An easier version of either traditional or

modified cloze where the learner is offered choices from which to select

his/her answers.

3.2.4. Authentic Cloze: A version of traditional cloze in which the tester

simply cuts a number of letters off from the beginning or end of each line,

making the text look as if it has been clipped in a photocopier. Again, the

more letters that are cut off, the more difficult the test.

3.2.5. C-Testing: The most recent variation in which the second half of every

second word is deleted (words with odd numbers of letters have the extra

letter supplied). Giving the first half of the word makes the test easier, but

deleting every second word restores the difficulty.

3.3. Guessing from context: Candidates have to guess the meaning of underlying

words from the context.

http://en.wikipedia.org/wiki/Educational_assessment




4. Designing classroom tests: Construction of a classroom tests:

4.1. Planning stage: (mirar ppt)

4.1.1. Learners’ need of a test.

4.1.2. Specification and sampling.

4.1.3. Construct and content validity, reliability, practicality.

4.2. Development stage:

4.2.1. Item construction (Input + Method).

Compilation of inputs or texts.

Methods.

Chanel.

Strategy.

Levels of difficulty.

Avoidance of overlapping, tricky questions and ambiguity.

4.2.2. Instructions.

4.2.3. Design layout/ format.

4.2.4. Consideration of scoring.

4.3. Control and operational stage:

4.3.1. Administration of the test.

4.3.2. Performance of statistical tests.

4.3.3. Washback (pedagogical effects).

A test will influence teaching and learning.

A test will influence what/how teachers teach and what/how learners

learn.

A test will influence the RATE and SEQUENCE of teaching/

learning.

A test will influence the DEGREE and DEPTH of teaching/learning.

A test will influence the ATTITUDES to the content, method, etc. of

teaching/ learning.

Tests that have important consequences will have Washback, those

that don’t, won’t.

Tests will have Washback on ALL teachers and learners.

4.3.4. Presentation of test to students (correction).




5. Approaches to language testing:

Spolsky (1975) identified three periods of language testing: the pre-scientific, the

psychometric-structuralist and the psycholinguistic-sociolinguistic.

The pre-scientific period: Testing in the pre-scientific Era did

not rely on linguistic theory, and reliability was considered less important

than the production of a test that “felt fair”.

The psychometric-Structuralist period: The name was intended

to reflect the joint contribution of the structural linguists, who identified

elements of language s/he wanted testing, and the psychometrics, who

produced objective and reliable methods of testing the candidates’

control of those elements.

The psycholinguistic-Sociolinguistic period: By the 1970s discrete

point testing was no longer felt to provide a sufficient measure of

language ability, and testing moved into the psycholinguistic-

sociolinguistic era, with the advent of global integrative testing. Oller

(1979, cited in Weir 1990) argued that global integrative testing, such as

cloze tests, which required candidates to insert suitable words into gaps

in a text, and dictation, provided a closer measure of the ability to

combine language skills in the way they are used for actual language use

than discrete point testing.

The Communicative Period: The fact that discrete point and

integrative testing only provided a measure of the candidate’s

competence rather than measuring the candidate’s performance brought

about the need for communicative language testing (Weir 1990). Before

we look at the features which distinguish this form of testing, we will

outline the models of communicative competence on which it is based.

According to Spolsky (1989:140), “Language tests involve measuring a

subject’s knowledge of,and proficiency in, the use of a language. A

theory of communicative competence is a theory of the nature of such

knowledge and proficiency. One cannot develop sound language tests

without a method of defining what it means to know a language, for

until you have decided what you are measuring, you cannot claim to

have measured it”. The main implication this model had for

communicative language testing was that since there was a theoretical

distinction between competence and performance, the learner had to be

tested not only on his/her knowledge of language, but also on his/her

ability to put it to use in a communicative situation (Canale and Swain,

1980).




For Shohamy, Donitsa-Schmidt, and Ferman (1996), washback is "the connections between

testing and learning" (p. 298); to Gates (1995), it is "the influence of testing on teaching and

learning" (p. 101); and for Messick (1996) washback is "the extent to which the introduction

and use of a test influences language teachers and learners to do things they would not

otherwise do that promote or inhibit language learning" (p. 241). Clearly then, the washback is

roughly speaking the effect of testing on the teaching and learning processes. An example that

often comes up in Japan is the effect of the university entrance examinations in Japan on high

school language teaching and learning.

Washback, whether it is positive or negative, can be a potential boon or threat to language

teaching curriculum (broadly defined) because, through washback, a test can steer a curriculum

in one direction or another (in terms of teaching, course content, course characteristics, and/or

class time) either with or against the better judgment of the administrators, teachers, students,

parents, etc.

Thinking about washback can also lead us to think about the consequential basis for test

validity in terms of the social consequences of test use and the values implications of

test interpretations.




(The following point is Javi’s contribution, so thank you Javi )

6. Testing grammar and vocabulary

6.1. Grammar

Why test grammar?

Necessary --> skeleton of the language

Lack of grammatical ability --> obstacle to skills performance

format of grammar tests is familiar

Large # of items can be administered and scored in a short period time

There is good cause to include a grammatical component in achievement,

placement, and diagnostic tests of teaching institutions.

Recommendations

Make items sound as natural as possible

Contextualize items

Be clear about what each item is testing and award points for that only.

Tests

Multiple choice, error correction, rearrangement items, completion items,

transformation items, items involving the changing of words, "broken

sentence" items, pairing and matching etc..

6.2. Vocabulary

Why test vocabulary?

Knowledge of vocab is essential to the development and demonstration

of linguistic skills.

Recommendations

Contextualize

Decide whether test active production or passive recognition

Lexical items can be selected from: syllabus, textbook, reading material,

students free writing.




Techniques

Recognition: multiple choice, definitions, gap-filling, matching, word

formation, synonyms, sets (association of words).

Production: pictures, definitions, gap-filling, sets (association of words),

synonyms, completion items.

7. Testing skills

7.1. Listening

Recommendations

Usually involves a spoken stimulus.

Materials should be natural and authentic

Good recording quality

No pressure on candidates

Memorization should be avoided

minimal demands on productive skills

Texts should be shorter and questions easier, as respects to reading.

Techniques

Multiple choice, short answer, completion, information transfer (labeling

pictures, completing forms, showing routes on a map...), note-taking,

dictation.

7.2. Speaking

Recommendations

Make test as long as feasible and plan it carefully

Give candidates as many fresh starts as needed.

Interviewers should be selected and trained. (also have a second tester

while interviewing).

Only tasks and topics that would be expected of candidates knowledge.

Interview should be carried out in quiet room.

Put candidates at their ease.




Collect enough relevant info.

Dont talk too much

Formats

Interview, interaction with peers, response to tape recordings,

Elicitation techniques

Questions and requests for information, pictures, role-play, interpreting,

discussion.

It is not recommended --> prepared dialogue, or reading aloud.

Marking

Look at pg 1. holistic vs. analytical marking.

Note: Neil McLaren says that we should bear in mind the following aspects when evaluating oral

production.

Word Order, Vocab, Pronunciation, fluency, appropriateness of expression, tone, accuracy,

capacity to reason, initiative in asking for info, accent, range of expressions, flexibility, size.

7.3. Reading

Levels of comprehension

Literal, Inferential, evaluative

Types of texts to use

Textbook, novel, magazine, newspaper, academic journal, letter,

timetable, poem..

Recommendations

Choose texts of appropriate length.

For acceptable reliability, include as many passages as possible in tests.

Choose texts which will interest candidates.

Avoid texts made up of candidates general knowledge.

Don't choose texts culturally laden.

Don't use texts which candidates have already read.

Don't ask students to write too much.




Formats

Multiple choice, short answer, info transfer with help of visuals,

completion exercises, identifying order of events, referents, guessing meaning.

Writing

Recommendations

Instructions should not be too long

Set as many tasks as feasible.

Test only writing ability.

Restrict candidates by defining tasks well

Avoid over-overcorrection and negative marking.

Look for strengths and weaknesses.

Types of exercises

Controlled writing (you set one topic only), guided writing (you give

ideas on what to write about), free writing.

Marking

Analytical vs. Holistic.

Date post:	24-Oct-2014
Category:	Documents
Upload:	alfonsi-arcos-rus
View:	119 times
Download:	2 times

Language Testing in Theory and Practice Revised2

Documents