Joint Committee on Intercollegiate Examinations Examiners ... · Joint Committee on Intercollegiate...

1

Joint Committee on Intercollegiate Examinations

Examiners’ Manual

50100_JC on IE_29 Leaves 150100_JC on IE_29 Leaves 1 4/8/06 09:11:384/8/06 09:11:38

2

This manual has been produced on behalf of the Joint Committee on Intercollegiate Examinations by the Centre for Medical Education, University of Dundee, Tay Park House, 484 Perth Road, Dundee DD2 1LR, Scotland UK.

© Centre for Medical Education

ISBN: 1-871749-08-5 978-1-871749-08-3

AuthorsMargery H Davis

Professor of Medical Education, Director of the Centre for Medical Education, University of Dundee, Scotland, UK

Gominda PonnamperumaLecturer in Medical Education, Faculty of Medicine, University of Colombo, Sri Lanka

Sean McAleerSenior Lecturer in Medical Education, Centre for Medical Education, University of Dundee, Scotland, UK

David RowleyProfessor of Orthopaedic Surgery, University of Dundee, Scotland, UK. Director of Education, Royal College of Surgeons of Edinburgh, Scotland UK

Desktop PublishingLynn E Bell


3

Foreword .. .. .. .. .. .. 5

Introduction .. .. .. .. .. 6

About this manual .. .. .. .. .. 7

Abbreviations .. .. .. .. .. 8

I: Assessment backgroundOverview of assessment .. .. .. .. 9

General principles for developing an assessment system .. .. 11

Assessment blueprinting .. .. .. .. 15

Miller’s pyramid .. .. .. .. .. 16

Bloom’s taxonomy .. .. .. .. 17

Utility of an assessment instrument .. .. .. 18

II: Assessment instruments – the background theorySBA (Selected Best Answer)

SBAs: assessing at ‘knows’ level .. .. .. 19The utility of SBAs .. .. .. .. 20

EMI (Extended Matching Items)EMIs: assessing at ‘knows how’ level .. .. .. 21The utility of EMIs .. .. .. .. 22

The structured, standardised oral examination The structured, standardised oral exam .. .. .. 23The utility of the structured, standardised oral exam .. .. 24

The structured clinical examinationThe structured clinical exam: assessing at ‘shows how’ level .. 25The utility of the structured clinical exam .. .. .. 26

III: Assessing the trainee: how to do itHow to do it: SBAs

Blueprinting SBAs .. .. .. .. 27 Developing SBAs at different levels .. .. .. 28

How to do it: EMIs Blueprinting EMIs .. .. .. .. 32

Developing EMIs .. .. .. .. 33

How to do it: The structured, standardised oral examination Blueprinting structured, standardised oral exam .. .. 35 Developing structured, standardised oral exam questions .. 36 Implementing the structured, standardised oral exam: the exam format 37 Marking the structured, standardised oral exam .. .. 38

Contents


4

How to do it: The structured clinical examinationBlueprinting structured clinical exams .. .. .. 39Developing structured clinical exam questions .. .. 40Implementing the structured clinical exam: the exam format .. 41Marking the structured clinical exam .. .. .. 42

General issues related to how to do it: item banking .. .. 43

IV: Standard settingStandards: an overview .. .. .. .. 45

Conjunctive and compensatory standards .. .. .. 46

Different methods of standard setting .. .. .. 47

Modifi ed Angoff method of standard setting .. .. .. 49

V: Guidance notes for examinersKey points for examiners .. .. .. .. 53

VI: Glossary .. .. .. .. .. 55


5

The Intercollegiate Specialty Fellowship Examination occupies a key role in surgical training. For the candidate, a successful outcome signifi es possession of the knowledge and skills, and the ability to use these at the standard of a Consultant in the National Health Service in the UK; surmounting this major career threshold brings immense personal satisfaction; for patients and public, success in the examination must carry assurance that the surgeon has indeed achieved this nationally recognised high quality of clinical performance. Conversely failure carries a high price to the candidate and to society.

It is, therefore, imperative that the conduct of the examination accords with the highest standards, in which a major factor is the contribution made by the specialty examiners. There have already been substantial improvements in recent years in the approach to selection and training of examiners and in guiding their approach during examinations. This new manual will result in further enhancement. Candidates, irrespective of the outcome, will be able to feel even more sure that their assessment has been extremely rigorous, fair and valid. The public can have confi dence that every effort is being made to ensure that those responsible for certifi cation of surgical specialists are discharging their duties at the highest level. Finally, Examiners will fi nd it helpful in bearing, what I know from personal experience, is a very substantial sense of responsibility.

It is a pleasure, on behalf of the Senate of Surgery and its constituent Colleges and Specialty Associations to congratulate the authors of the manual and recommend it with enthusiasm to current and future examiners – and perhaps it will also be found interesting by future candidates.

Professor Sir G M TeasdaleChairman Senate of Surgery, andPresident, Royal College of Physicians and Surgeons of Glasgow

ForewordA message from the

Chairman of the Senate of Surgery


6

Introduction

To be appointed as an examiner for the Intercollegiate Specialty Boards should rightly be seen as a real privilege. Members of the panels of examiners in the nine surgical disciplines are however faced with some challenging responsibilities. Some of the practices used in our surgical examinations have been marked by traditions endured by successive generations of trainees. As practice has evolved, so elements of style have changed. As clinicians we have been rather less than sensitive to modern medical education thinking. Just as we would never think of ignoring evidence in our clinical practice so it is vital that we are aware of the latest evidence in the realm of assessment and apply that in as practical a fashion as possible to bring our examination system up to date.

It is clear that we also need to cope with a culture of enhanced accountability and responsibility. We need to provide an examination system which is set at the appropriate standard and which is fair, reproducible and valid. While such an end result will be diffi cult to deliver the Boards have set out on a major plan to re-develop the intercollegiate specialty examinations and bring them into step with best educational practice while at the same time have sought to retain the distinctive elements which the profession and the public recognise as being so important. In so doing, it is clear that the range of principles and standards set out by the Postgraduate Medical Education and Training Board can and will be achieved.

This manual has been prepared to allow every exam panel member access to the evidence behind the new examination structure and to outline how the examinations will work in practical terms. It has been written by a team who are not only expert in assessment methods in medical education but who also have an in-depth understanding of the intercollegiate examination system. It will be evident that the ability to examine, whether in the construction of a written test item or in the conduct of a dynamic exchange based on a scenario or a real life clinical problem involves a series of skills which require careful thought, preparation and development. The section on crafting written items, both single best answer [SBA] items and extended matching questions [EMQ] ably illustrates both the numerous pitfalls and the diffi culties of constructing good questions which allow the exploration of the advanced thought processes which apply to the solving of clinical problems. The conduct of the oral and clinical sections of the examination has undergone signifi cant change. It has been necessary to introduce proper blueprinting, structure and careful standard setting. The value and importance of the clinical examination has been emphasised by examiners and lay representatives alike but the traditional methods of sampling the curriculum and marking with defi ned accuracy have been improved. The importance of probing the higher cognitive processes of candidates has been emphasised. There are excellent sections in the manual which explain and provide practical examples of these and other related issues.

Assessment in surgical training will involve more formal workplace based judgements in the future. It will be clear that one purpose of the intercollegiate examinations is to provide a measure of national quality assurance for those completing surgical training programmes in the UK. Others who take these examinations can test themselves at the same level. In every branch of surgery it makes sense to rely on the judgment of senior specialists who are not only recognised experts in their clinical fi eld but who are also trained examiners. This is a trustworthy and sensible policy for the future. Inevitably as the analysis of our examination system continues there will be the opportunity for the assessment methods to evolve further and improve the quality of the exercise. The manual will therefore be a dynamic document and has been designed to accommodate further developments in due course.

Mr David Galloway MD FRCSChairman, Joint Committee on Intercollegiate Examinations

An introduction from the Chairman

of the JCIE


7

This manual has been designed specifi cally for examiners in the ISB examinations and was developed to help you carry out your role as an examiner.

There are six sections.

1. Some of the basic assessment principles underpinning the assessment system.

2. What is known about the exam formats used in the assessment system.

3. How the exams are developed.

4. How the pass/fail mark is determined for exams.

5. Documentation to explain ISB exam policy and procedures to be followed by examiners.

6. A glossary of educational and assessment terminology to help you with unfamiliar jargon.

Additional background papers will be sent to you from time to time when changes are made to the exams.

We welcome your comments on this publication regarding any additional material you would fi nd useful, aspects that may require clarifi cation or other suggestions/comments you may wish to make.

We hope the manual is helpful and that you enjoy reading it.

About this manual


8

CCST Certifi cate of completion of specialist training

CEX Clinical evaluation exercise

CST Certifi cate of specialist training

EMI Extended matching item (a type of MCQ)

GMC General Medical Council

ISB Intercollegiate Specialty Board

JCIE Joint Committee on Intercollegiate Examinations

MCQ Multiple choice question

MEQ Modifi ed essay question

OSCE Objective structured clinical examination

PACES Practical assessment of clinical examination skills

PMETB Postgraduate Medical Education and Training Board

RITA Record of in-training assessment

SAMSS Structured assessment of minor surgical skills

SBA Single best answer

Abbreviations


9

There are six questions that need to be answered before developing an assessment scheme.

1. Why? 4. How?2. When? 5. By whom?3. What? 6. Where?

1. Why? There are different reasons why assessments are held: to certify attainment of the course outcomes (end-of-course assessment); to validate membership of professional bodies; to inform career selection; to demonstrate readiness to progress to the next stage of training, having met the required standard; to identify weak trainees; to award qualifi cations; to give feedback to trainees, trainers and other stakeholders; to determine the pre-knowledge of the learners (pre-test); to rank the learners to award prizes; as part of quality assurance; to map trainee progress; to assure the public that the trainee is ready for unsupervised practice; to drive learning; to identify defi ciencies in the curriculum.

The main purposes of the examinations conducted by the Intercollegiate Specialty Boards are to assess: knowledge and judgment, clinical diagnosis, management and treatment planning, and to help validate professionalism. They provide a measure of quality assurance for the training programmes and form one of the criteria for admission to Fellowship of the Colleges.

2. When? Assessment can take place at any point during training; i.e. at the end of training (summative assessment or exit exam), middle of the course, throughout the course (continuous assessment), beginning of the course (pre-test).

The Intercollegiate Specialty Board exams are intended to indicate when the trainee is ready for unsupervised practice and should be held towards the end of training. The RITA process should indicate when the trainee is ready to sit the ISB exam.

3. What? “Assessment is a moral activity. What we choose to assess... shows quite starkly what we value” (Knight, 1995).

What is assessed needs to be synchronized with what has been taught and learnt; i.e. the curriculum. The curriculum should comprise knowledge, skills, attitudes and professionalism. The trend is to use curriculum outcomes as the framework for planning the curriculum content and assessment.

4. How? No one examination will provide the examiners with all the information they need about the candidates. An examiner’s toolkit is needed with multiple exams in the kit.

Overview of assessment A description of the six important questions that

need to be addressed before developing an

assessment system


10

5. By whom? Assessment is a matter of expert judgement. Experts in the context of the ISB exams are the consultant surgeons who are experienced both in their own fi eld of surgical practice, and in training and assessing surgical trainees.

6. Where? Traditionally exams were held in College premises and in the wards. Now, there is increasing emphasis on what the trainee does in the wards in real life. This has highlighted work-based assessment.

Knight, P. (ed) (1995). Assessment for learning in higher education. Kogan Page, London.

All examiners should know the answers to these six questions before taking part in an exam.

take home message


11

General principles for developing an assessment system An outlineof the

guiding principles when developing an assessment system

PMETB developed nine principles to guide the developers of postgraduate assessment (Southgate & Grant, 2004).

Principle 1 The assessment system must be fi t for a range of purposes.

• The overall purpose of the assessment system must be documented and in the public domain.

• The purposes of each and all components of the assessment system must be specifi ed and available to the trainees, educators, employers, professional bodies including the regulatory bodies, and the public.

• The sequence of assessments must match the progression through the career pathway.

• Individual assessments within the system should add unique information and build on previous assessments.

The examples of purposes for assessment listed below are presented in the order that a trainee might encounter them. Position in the list does not imply importance1, neither is the list comprehensive:

1. to inform career selection and choice

2. to confi rm suitability of choice at an early stage of chosen career path

3. to demonstrate readiness to progress to the next stage of training having met the required standard

4. to provide feedback to the trainee about progress and learning needs

5. to support trainees to progress at their own pace by measuring progress in achieving competencies for chosen career path

6. to identify trainees who should change direction or leave medicine

7. to enable the trainee to collect all necessary evidence for revalidation

8. to assure the public that the trainee is ready for unsupervised practice

9. to provide evidence for the award of a CCST

10. to drive learning

11. to gain membership or fellowship of a medical Royal College or specialist association/society.

Principle 2 The content of the assessment will be based on curricula for postgraduate training which themselves are referenced to all of the areas of Good Medical Practice.

• The programme will be based on an overall specifi cation of the content for all component assessments, in order to avoid overlap or gaps.

1 Sometimes purposes for assessment are not made explicit. Two examples are included here which can cause diffi culties if all parties are not informed of the intention to use outcomes for these purposes.

1. to monitor the effectiveness of the training programmes 2. to identify the best people for posts where there is competition


12

2 The blueprint for an assessment specifi es the content from which the assessment sample is drawn. In postgraduate medical training it usually comprises a matrix with one dimension broadly based on Good Medical Practice, and the other on the clinical problems that a trainee should be able to manage, at that stage of training.

• The programme will be based on an overall specifi cation of the content for all component assessments, in order to avoid overlap or gaps.

• Assessments will together systematically sample the entire content, appropriate to the stage of training, with reference to the common and important clinical problems that the trainee will encounter in the workplace and to the wider base of knowledge, skills and attitudes that doctors require.

• The blueprint2 from which assessments in the workplace or national examinations are drawn will be available to trainees and educators in addition to assessors/examiners.

Principle 3 The methods used within the programme will be selected in the light of the purpose and content of that component of the assessment framework.

• Methods will be chosen on the basis of validity, reliability, feasibility, cost effectiveness, opportunities for feedback, and impact on learning.

• The rationale for the choice of each assessment method will be documented and evidence-based.

Large scale competence tests (e.g. MRCP, MRCGP, MRCPsych).• Approaches to the development and piloting of test items/clinical skills

assessments for national tests of competence will be documented and available for external quality assurance. Studies to establish the validity of new methods will be undertaken.

• Systematic data collection will support the routine reporting of the reliability of tests of competence in high stakes pass/fail examinations. These statistics will be in the public domain.

Work-based assessments (e.g. direct observation of consulting, 360 degree assessment, and case based discussions).

• Must be subject to reliability and validity measures.

• Evidence must be collected and documented systematically.

• Evidence must be judged against pre-determined published criteria.

• The weight placed on different sources of evidence must be determined by the blueprint and the quality of the evidence.

• The synthesis of the evidence and the process of judging it must be made explicit.

Methods for workplace based assessment

For example:Systematic observation of clinical practice

Direct observationVideo

Judgements of multiple assessorsConsulting with simulated patientsCase record review: including OPD lettersCase based discussionsOral presentations360 degree peer assessmentPatient surveysAudit projectsCritical incident review


13

3 Sometimes it is appropriate to provide no feedback other than the test result. If this is a policy decision then reasons should be stated.

Principle 4 The methods used to set standards for classifi cation of trainees performance/competence must be transparent and in the public domain.

• Standards in tests of competence such as national Royal College examinations, will be set using recognised methods based on test content and the judgments of competent assessors.

• Where the purpose of the test is to provide a pass/fail decision, information from the performance of reference groups of peers should inform, but not determine, the standard.

• The precision of the pass/fail decision must be reported on the basis of data about the test. The purpose of the test must determine how the error around the pass/fail level affects decisions about borderline candidates.

• Reasons for choosing either pass/fail or rank ordering should be described.

• Standards for determining successful completion of training to CST level should be explicit.

Principle 5 Assessments must provide relevant feedback.

• The policy and process for providing feedback to trainees following assessments must be documented and in the public domain3.

• The form of feedback must match the purpose of the assessment.

• Outcomes from assessments must be used to provide feedback on the effectiveness of education and training where consent from all interested parties has been given.

Principle 6 Assessors/examiners will be recruited against criteria for performing the tasks they undertake.

• The roles of assessors/examiners will be specifi ed and used as the basis for recruitment and appointment.

• Assessors or examiners must demonstrate their ability to undertake the role.

• Assessors/examiners should only assess in areas where they have competence.

• The relevant professional experience of assessors should be greater than that of candidates being assessed.

• Equality and diversity training will be a core component of any assessor/examiner training programme.

Principle 7 There will be Lay input in the development of assessment.

• Lay opinion will be sought in relation to appropriate aspects of the development, implementation and use of assessments for classifi cation of candidates.

• Lay people may act as assessors/examiners for areas of competence they are capable of assessing.


14

Principle 8 Documentation will be standardised and accessible nationally.

• Documentation will record the results and consequences of assessments and the trainee’s progress through the assessment system.

• Information will be recorded in a form that allows disclosure and appropriate access, within the confi nes of data protection.

• Uniform documentation will be suitable not only for recording progress through the assessment system but also for submission for purposes of registration and performance review.

• Documentation should provide evidence for revalidation and compliance with Good Medical Practice (GMC, 2001).

• Documentation should be transferable and accessible as the trainee moves location.

• Documentation should be comprehensive and accessible to both to the trainee and to those responsible for training.

Principle 9 There will be resources suffi cient to support assessment.

• Resources will be made available for the proper training of assessors.

• Resources and expertise will be made available to develop and implement appropriate assessment methods.

• Resources will support the assessment of trainees at national and local levels.

• Appropriate infrastructure at national, deanery and Trust levels will support assessment.

General Medical Council (2001). Good Medical Practice. (London, GMC).Southgate , L. & Grant , J . (2004) . Principles for an assessment sys tem for postgraduate medical training. Postgraduate Medical Education Training Board (PMETB), UK. Available at: http://www.pmetb.org.uk/media/pdf/5/e/principles_1.pdf (accessed on 26 January 2006)

These principles may be used to develop a system of postgraduate assessment in the UK context or to quality assure/validate an existing assessment system.

take home message


15

The assessment blueprint (Newble et al. 1994) is a tool used to plan exams. The blueprint confi rms that the exam tests a representative sample of all the appropriate curriculum outcomes and a representative sample of all the curriculum content. The assessment blueprint is a grid that plots the curriculum/assessment outcomes in columns against the curriculum content in rows.

Here is an example of an assessment blueprint for one of the general surgical exams.

The blueprint shows groups of conditions in the syllabus that will be included in the exam (rows) and the range of outcomes that are being assessed (columns).

The complex nature of assessment in the healthcare professions and the need for high validity and reliability, make the assessment blueprint an essential tool for examination planning (Crossley et al., 2002). The blueprint is a way of establishing the content validity of the examination (Schuwirth & van der Vleuten, 2004). The appropriateness of the sample ensures content validity, while the adequacy of the size of the sample ensures reliability.

Tumours: benignTumours: malignantCongenitalAcquiredTraumaticAutoimmuneHormonalNeurologicalInfectiveInfl ammatoryVascularMetabolicDegenerative

Upper GI conditions

ethi

cal

surg

ical d

ecisi

on m

akin

g

pres

enta

tions

preo

p as

sess

men

t

inve

stig

atio

ns

cons

ent

med

icole

gal

nono

pera

tive

man

agem

ent

anat

omy

phys

iolo

gy/a

naes

thet

ic

criti

cal c

are

diat

herm

y/su

ture

s

ster

ilisat

ion

infe

ctio

n co

ntro

l/bac

terio

logy

radi

olog

y/nu

clear

med

icine

bioc

hem

istry

haem

atol

ogica

l

imm

unol

ogica

l

path

olog

y/ge

netic

s

phar

mac

olog

y

surg

ical a

ppro

ache

s

surg

ical p

roce

dure

s

preo

pera

tion

prob

lem

s

man

agem

ent o

f com

plica

tions

brea

king

bad

new

s/pa

lliativ

e ca

re

post

op

man

agem

ent

evid

ence

bas

ed m

edici

ne

trial

s

audi

ts

mul

tidisc

iplin

ary

team

Assessment blueprinting A procedure to ensure that the assessment

samples all the curriculum outcomes

and content

A tick in a box in the assessment blueprint indicates that there is an assessment item related to the outcome and content represented by that box. Assessment developers should ensure that each row and each column of the blueprint have at least one tick.

The number of ticks per row and column depends on the importance attached to individual content and outcomes.

take home message

Crossley, J., Humphris, G. & Jolly, B. (2002). Assessing health professionals, Medical Education, 36, pp. 800-804. Newble, D., Dawson, B., Dauphinee, D., Gordon, P., Macdonald, M., Swanson, D., Mulholland, H., Thomson, A. & van der Vleuten, C. (1994). Guidelines for assessing clinical competence. Teaching and Learning Medicine, 6 (3), pp. 213-220.Schuwirth, L.W.T. & van der Vleuten, C.P.M. (2004). Changing education, changing assessment, changing research? Medical Education, 38, pp. 805-812.

An assessment blueprint from general surgery


16

Miller (1990) introduced an important framework that can be presented as four tiers/levels of a pyramid to categorise the different levels at which trainees need to be assessed in postgraduate education.

Miller’s pyramid of assessment (after Miller, 1990)

Miller emphasised that all four levels – knows, knows how, shows how and does – need to be assessed to obtain a comprehensive understanding of a trainee’s ability.

Knows Knowledge or information that the candidate has learned

Knows how Application of knowledge to medically relevant situations

Shows how Simulated demonstration of skills in an examination situation

Does Behaviour in real-life situations

All of the above levels relate to the roles of a doctor. This is why Miller’s pyramid has been so useful in organising/understanding different assessment methods and their place in an assessment system (Schuwirth & van der Vleuten, 2004).

A few examples of the assessments at each level are:

Knows One-from-fi ve MCQs (SBAs), structured essay questions

Knows how EMIs, patient management problems

Shows how OSCE, long case, short case, PACES

Does Rating scales, tutor reports

Miller’s pyramidA framework that

identifi es the different levels at which trainees

need to be assessed

Miller, G. (1990). The assessment of clinical skills/competence/performance, Academic Medicine, 65 (Suppl.), pp. S63-S67.Schuwirth, L.W.T. & van der Vleuten, C.P.M. (2004). Changing education, changing assessment, changing research? Medical Education, 38, pp. 805-812.

Does

Shows how

Knows how

Knows

Professional authenticity

Behaviour

Cognition

A postgraduate assessment system should be able to assess trainees at all four levels.

take home message


17

Bloom’s taxonomy A classifi cation system that helps framing

questions at different cognitive levels

Bloom and colleagues (1956) described six categories in the cognitive domain. These are:

1. Knowledge recall 4. Analysis 2. Comprehension or understanding 5. Synthesis 3. Application 6. Evaluation.

Bloom’s taxonomy is a hierarchical classifi cation, with the lowest cognitive level being ‘knowledge recall’ and the highest, ‘evaluation of knowledge’. The lower levels can be attained with superfi cial learning, such as memorisation, but the upper levels involve higher order thinking and can only be attained by deep learning. How we pose exam questions determines the cognitive level that we are testing.

Below are the six cognitive levels and some key verbs that can be used in questions pitched at each level.

The six levels can be unnecessarily complex and they can be telescoped as follows.

A. Recall-comprehension of knowledge; i.e. reproducing and understanding

B. Application-analysis; i.e. making use of knowledge

C. Synthesis-evaluation; i.e. doing different things with knowledge and making use of judgement

Bloom’s taxonomy and Miller’s pyramidWhile the relationship is complex, it is sometimes useful to think of the bottom level of Miller’s pyramid (‘knows’) as equating to ‘knowledge and understanding’ in the telescoped version of Bloom’s taxonomy (level A). Application, analysis, synthesis and evaluation (Bloom’s levels 3 to 6 or B and C in the telescoped version) fall into the ‘knows how’ level of Miller’s pyramid.

Evaluation

judgeappraiseevaluateratevaluerevisescoreselectchooseassessestimate

Comprehension

explaindescribeexpresslocatereview

Application

interpretapplyemployuseorganise

Knowledge

defi nelistrecallname

Synthesis

plancomposedesignformulateconstructcreatesetupmanageprepare

Analysis

distinguishanalysedifferentiatecomparecontrastcategorise

Bloom, B.S., Englehart, M.D., Furst, E.J., Hill, W.H. & Krathwohl, D.R. (1956). A taxonomy of educational objectives: handbook I: cognitive domain. David McKay, New York.

Best assessment practice is to assess the trainee in all six Bloom’s levels.take home message


18

Utility of an assessment instrument

Van der Vleuten (1996) in a landmark article produced a model for the utility of an assessment instrument.

We have slightly modifi ed his model and will use it to analyse different assessment instruments.

Utility = (R) x (V) x (A) x (E) x (C) x (P)

R reliability. Can the exam results of a given candidate, in a given context be reproduced? To what extent can we trust the results?

V validity. Does the assessment assess what it purports to assess?

A acceptability. How comfortable are the different stakeholders (candidates, examiners, Intercollegiate Specialty Boards, public, National Health Service) with the examination system?

E educational impact. Does the exam drive the trainees towards educationally and professionally valuable training?

C cost/cost effectiveness. Is the expenditure – in terms of money, time and manpower – to develop, run and sustain the examination process, worthwhile in relation to what is learned about the candidate?

P practicability. How ‘doable or workable’ is the assessment instrument, given the circumstances? Are there suffi cient resources to mount the exam?

Key points about this model:

1. Some of the factors are, by nature, unquantifi able; e.g. validity, acceptability, educational impact. Therefore, the model should be used only in the context of understanding and explaining the interplay of these factors.

2. The utility depends on the context and the importance of the exam. A one-off, high stakes clinical examination must have high reliability. In a clinical assessment system that has multiple exams, with different examiners over a long period of time, it is acceptable for some of the exams to be less reliable than others.

3. The utility has been equalled to the multiplied product of all the factors. If any one factor becomes zero, the utility of that assessment instrument will be zero. This underscores the importance of the contribution of all the factors.

A model for analysing the interplay of different factors that contribute

to the utility of an assessment instrument

Van der Vleuten, C.P.M. (1996). The assessment of professional competence: developments, research and practical implications. Advances in Health Sciences Education, 1, 1, pp. 41-67.

The utility of an assessment instrument has to be considered before it is used. What has to be borne in mind is that assessment utility is always a trade-off among these six factors.

take home message


19

The level of ‘knows’ in Miller’s pyramid (Miller, 1990) assesses the trainees’ knowledge. This level can be adequately assessed by written tests.

Written tests are of two types (Hambleton, 1996).

1. Selected response tests – the trainee chooses an answer from a number of possible responses/options. Examples include true-false items, MCQs (i.e. one-from-fi ve) and EMIs. Single best answer (SBA) is the name the ISB has decided to use for what are usually called one-from-fi ve MCQs.

2. Constructed response tests – the trainee formulates an answer in response to the question as opposed to selecting the appropriate answer from a set of options. Examples include short answer (free response) questions, essay questions.

The advantages of SBAs:

SBAs can assess a wide sample of curriculum content within a relatively short period of time. This leads to high reliability and improved validity.

They are a highly standardised form of assessment where all the trainees are assessed with the same questions. It is a fair assessment in that all the trainees sit the same exam.

They are easy to administer and mark.

SBA marking is mostly automated and hence examiner subjectivity is removed from the assessment process.

The main disadvantages of SBAs:

The trainee’s reasons for selecting a particular option/response cannot be assessed.

Although a wide sample of assessment material can be assessed, the assessment does not provide an opportunity for an in-depth assessment of the content.

Constructing good SBAs needs considerable examiner training (Case & Swanson, 2001).

SBAs: assessing at ‘knows’ level The advantages and disadvantages of the SBA question format – the one-from-fi ve

MCQ

Case, S.M. & Swanson, D.B. (2001). Constructing written test questions for basic and clinical sciences. 3rd edn. National Board of Medical Examiners (NBME), Philadelphia, USA. Available at: http://www.nbme.org/PDF/ItemWriting_2003/2003IWGwhole.pdf (accessed on 26 January 2006).Hambleton, R.K. (1996). Advances in assessment models, methods and practices, in: Berliner, D.C. & Calfee, R.C. (Eds.) Handbook of educational psychology. Simon & Schuster Macmillan, New York. (pp. 899-925)Miller, G. (1990). The assessment of clinical skills/competence/performance, Academic Medicine, 65 (Suppl.), pp. S63-S67.

There are different tests for assessing knowledge. SBAs are a reliable way of assessing knowledge cheaply and easily.

take home message

P

P

PP

OO

O


20

The utility model introduced on page 18 is used to explore the features of SBAs.

Reliability: high (desirable)The SBA results are highly reliable as almost identical scores can be obtained if the same student or a student with similar ability is given the same set of SBAs, irrespective of who marks the questions.

Validity: high for knowledge recall (desirable)If the purpose of the exam is to test factual recall of knowledge, then a SBA test can be valid. SBAs can also be used to test application of knowledge and higher order thinking, although the construction of such SBAs is diffi cult and requires training. SBAs can thus be used to assess the bottom two levels of Miller’s pyramid: ‘knows’ and ‘knows how’.

Acceptability: high (desirable)SBAs have been used extensively in medical education during the past few decades. Both trainees and examiners have come to accept them. Constructing good SBAs, however, is diffi cult. As a result, many exams contain badly constructed SBAs, which are diffi cult to answer and may be responsible for the unpopularity of this type of assessment.

Educational impact: (moderately desirable)Properly constructed SBAs will drive the learner towards learning important information. However, SBAs developed to test trivial knowledge will lead to rote learning. Fragmentation of knowledge is another criticism.

Cost: low (desirable) The cost of administering a SBA test and maintaining a SBA bank is low, once the start-up costs have been absorbed. The initial construction of the SBAs by a group of examiners may be costly in terms of examiner time. The initial set-up costs for item banking and marking software is high.

Practicability: high (desirable)SBAs are easy to administer in a standard examination hall or as a computer-based assessment.

The utility of SBAs

SBAs have high utility for assessing knowledge.take home message

An analysis of the utility of SBAs


21

The level of ‘knows how’ in Miller’s pyramid (Miller, 1990) assesses the trainee ability to apply knowledge to medically relevant situations.

EMIs, one-from-fi ve SBAs, problem sets and patient management problems are some of the assessment instruments that can be used to assess this level.

EMIs offer one of the best assessment formats to assess the ‘knows how’ level (Case & Swanson, 2001), although they can also be used to assess at the ‘knows’ level of Miller’s pyramid.

The main advantages of EMIs:

EMIs provide a convenient method for assessing application of knowledge.

They can assess a range of curriculum content in a relatively short time.

They are high in reliability.

All the trainees are assessed identically; hence examiner subjectivity is removed.

EMIs are easy to administer, score, mark and store.

EMIs are designed to assess clinical diagnostic skills and thinking processes relevant to clinical practice.

Due to the large number of options the likelihood of random guessing the correct answer is reduced.

The main disadvantages of EMIs:

There is less opportunity for the examiner to assess the trainee in-depth on a given content.

Designing EMIs is time consuming (Jolly & Grant, 1997).

Examiner training is necessary to construct acceptable EMIs (Case & Swanson, 2001).

Consensus on the correct/preferred answer may be diffi cult to achieve, especially in ‘choice of management’ questions (Jolly & Grant, 1997).

EMIs: assessing at ‘knows how’ level

The advantages and disadvantages of EMIs

Case, S.M. & Swanson, D.B. (2001). Constructing written test questions for basic and clinical sciences. 3rd edn. National Board of Medical Examiners (NBME), Philadelphia, USA. Available at: http://www.nbme.org/PDF/ItemWriting_2003/2003IWGwhole.pdf (accessed on 26 January 2006).Jolly, B. & Grant, J. (eds) (1997). Multiple choice questions – extended matching items. Part 3, Ch. 2, in: The Good Assessment Guide. Joint Centre for Education in Medicine, London. Miller, G. (1990). The assessment of clinical skills/competence/performance, Academic Medicine, 65 (Suppl.), pp. S63-S67.

EMIs have been specifi cally designed to test application of medical knowledge.

take home message

P

O

P

PPPP

P

O

O

O


22

The utility of EMIs

The utility model introduced on page 18 is used to explore the features of EMIs.

Reliability: high (desirable)As EMIs are a form of SBAs, the reliability is likely to be high.

Validity: high for application of knowledge (desirable)The EMI is one of the best tools to assess clinical reasoning and application of knowledge.

Acceptability: high (desirable)Though the acceptability data is sparse, EMIs, as with SBAs, should be acceptable to both trainees and examiners, as an effi cient and fair assessment.

Educational impact: high (desirable)The use of EMIs should emphasise the importance of clinical judgement and reasoning for the trainees. EMIs are less likely to fragment knowledge as they are mostly based on clinical vignettes.

Cost: low (desirable) Similar to SBAs.

Practicability: high (desirable)Similar to SBAs.

EMIs have high utility for assessing application of knowledge, clinical reasoning and judgement.

take home message

An analysis of the utility of EMIs


23

The structured, standardised oral exam

The traditional viva voce has numerous validity and reliability drawbacks. If, however, the oral exam questions are: based on patient care (i.e. clinical scenarios); pitched to elucidate the trainee’s higher order thinking (i.e. utilisation of knowledge for decision making, interpretation, judgement); and focused on the trainees’ quality of responses (i.e. the level of diffi culty of the questions answered; the clarity, focus and confi dence displayed in answering; and the degree of prompting needed), such exams can be used to assess how the trainee works through a clinical problem.

The best practice recommends: structuring the oral exam on clinical scenarios; having multiple orals; ensuring an adequate number of examiners assess a single candidate; using rating rubrics with descriptors for scoring candidate answers; and recruiting trained examiners (Davis & Karunathilake, 2005).

The advantages of the structured, standardised oral exam:

It is a face-to-face exam. It can thus be used to assess aspects of trainees that other exams may fail to assess such as quality of responses.

It is a fl exible exam; i.e. the examiner can choose from a pool of predetermined questions to ask an easier or more diffi cult question, depending on the candidate’s response to the earlier question.

Oral exams can be used to assess the candidate’s cognitive abilities related to clinical practice, such as problem solving and decision making.

Oral exams may capture certain important examinee traits which other exams may fail to assess; e.g. fi tness-to-practise, worthiness for recognition as senior clinicians, professionalism.

The disadvantages of the structured, standardised oral exam:

Meticulous planning is required to ensure the exam is structured according to the examination blueprint.

Oral exams require a large number of examiners to maintain reliability.

The examiners should be pre-trained to apply the same standards to each candidate using pre-validated rating scale descriptors.

The organisation and administration of an oral exam is costly and time consuming.

It has been shown that oral exams may bias against some candidates; e.g. certain ethnic groups.

Oral exams tend to assess certain candidate attributes which are not intended to be assessed; e.g. examinee style of speaking.

Oral exams can be threatening and stressful to the candidate.

An updated version of the viva voce

With thoughtful planning the structured, standardised oral exam can be developed into a more valid and reliable form of assessment than the traditional viva.

take home message

Davis, M.H. & Karunathilake, I. (2005). The place of the oral examination in today’s assessment systems. Medical Teacher, 27(4), pp. 294-297.

O

OO

O

O

O

O

P

P

P

P


24

The utility model introduced on page 18 is used to explore the features of the structured, standardised oral exam.

Reliability: can be high, but only if proper guidelines are followed Studies need to be carried out to calculate reliability using generalisability theory (Streiner & Norman, 2003).

Validity: high (desirable)If the clinical scenarios for the oral exam are chosen based on an assessment blueprint, and if the oral questions are properly constructed to assess ‘higher order thinking’, the validity of the exam should be high.

Acceptability: equivocalThe acceptability of the structured, standardised oral exam among ISB examiners has not been studied, but examiners may resent the loss of autonomy that results from the imposed structure. Among viva voce candidates high stress levels have been reported (Jayawickramarajah, 1985; Arndt et al., 1986).

Educational impact: highIf pre-validated clinical scenarios and pre-validated questions probing higher order thinking skills are used, the exam can focus the trainees on sound clinical practice.

Cost: moderate Running an oral examination is costly in terms of planning, examiner time and infrastructure, but is less costly than exams that additionally require the presence of patients.

Practicability: diffi cult but workable Though considerable resources are needed to conduct a valid and reliable structured, standardised oral examination, the diffi culties are surmountable.

The utility of the structured,standardised oral examination

An analysis of the utility of the structured,

standardised oral examination

Arndt, C.B., Guly, U.M. & McManus, I.C. (1986). Pre-clinical anxiety: the stress associated with viva voce examination. Medical Education, 20, pp. 274-280. Jayawickramarajah, P.T. (1985). Oral examinations in medical education. Medical Education, 19, pp. 290-293. Streiner, D.L. & Norman, G.R. (2003). Generalisability theory. Ch. 9 in: Health measurement scales: a practical guide to their development and use. 3rd edn. Oxford University Press, Oxford.

The structured, standardised oral examination is an updated version of the traditional viva voce designed to increase its utility.

take home message


25

Assessment at the ‘shows how’ level takes place under examination conditions, using real, standardised or simulated patients. The most commonly used assessment at this level is the OSCE or one of its many variations; e.g. SAMSS - Structured Assessment of Minor Surgical Skills. Other assessment instruments that can be used are: the long case; the short case; and the structured clinical.

The long caseThe trainee spends a fi xed period of time with a real patient to take a clinical history and carry out the physical examination. Different candidates see different patients, which may introduce unfairness into the exam related to the level of diffi culty presented by individual patients. The examiners later assess the trainee with oral questions based on the patient that s(he) has examined. Usually, different examiners assess different trainees and the examination may be unstructured and therefore, the reliability may not be high.

The short caseIn the short case one or two examiners directly observe the trainee carrying out a particular task; e.g. eliciting a physical sign or examining an organ system. However, all the candidates may not be assessed with the same patients, again introducing an element of unfairness.

The structured clinical examinationTo overcome the diffi culties associated with reliability and fairness in relation to the long and short case, the structured clinical examination is being introduced.

It comprises 12 short cases that are grouped in bays and lasts one hour. At each bay, the trainee has to carry out a specifi c skill/task and a pair of examiners observes and scores each trainee’s clinical skills, using a global rating scale. The assessment material may vary from a short clinical scenario, radiographs or investigation reports to a real or simulated patient for eliciting clinical signs.

Some sub-specialties additionally include a long case. Two examiners observe the candidate carrying out a comprehensive history taking and physical examination and then assess the candidate on history taking, physical examination, patient management, decision making and other aspects depending on the surgical specialty. Long cases are not currently used in General Surgery, Cardio-thoracic Surgery, Otolaryngology and Urology examinations.

Since all the trainees are assessed using the same assessment material, under identical conditions (e.g. same or similar standardised patients, and marked by similar/same examiners using the same rating scale), the structured clinical exam is likely to achieve higher reliability than the traditional long and short cases.

The structured clinical exam:assessing at ‘shows how’ level

An assessment instrument that can

be used to test competence

The structured clinical exam has the advantage of wide sampling of the curriculum with the examiners directly observing the candidate.

take home message


26

Reliability: high (desirable)Due to its structured exam format (i.e. all candidates go through equivalent exam forms and are assessed by many examiners, directly observing the candidate, on a wide range of clinical tasks, using a global rating scale anchored by descriptors), the structured clinical exam is likely to have greater reliability than the long and short case formats.

Validity: high for assessing competence of a skill (desirable)If used to test trainee competence in a range of curriculum outcomes, the structured clinical examination is a valid assessment tool. However, since it is conducted under simulated examination conditions, it does not provide valid information on the trainee’s ability to perform the skill in real-life situations.

Acceptability: high (desirable)Mainly due to its structured nature that allows every trainee to be tested under the same conditions, the structured clinical is likely to be highly acceptable.

Educational impact: high (desirable)The structured clinical assessment requires the examiners to directly observe the trainee carrying out clinical skills. Thus, this examination will drive the trainees towards perfecting clinical skills. Since the exam material represents a wide sample of the curriculum and a wide range of skills can be assessed, trainees cannot risk ignoring certain clinical skills.

Cost: high, but worth it (moderately desirable) The resources needed to conduct a structured clinical exam are: human resources (e.g. administrative staff, examiners and real or standardised patients); time for planning, setting up and piloting the exam; fi nancial resources; physical resources (e.g. assessment material such as radiographs, manikins); and space (i.e. an empty out patient department or ward). The cost, however, is justifi ed by the returns in terms of a reliable and valid assessment.

Practicability: sustainable (desirable)Organising the examination involves considerable effort: selecting patients; giving them the necessary instructions; arranging the exam bays with all the equipment set up properly; distribution of scoring sheets and collecting them at the end the exam; briefi ng the required number of trainees for each run of the exam; planning for contingencies (e.g. one of the examiners and/or patients not turning up); and piloting the exam before administration. The exam may have to be conducted at two or more sites at the same time to cope with candidate numbers. Studies on exams with similar structure (e.g. OSCE) have shown however, that the logistics are achievable (Reznick et al., 1992).

The utility of the structuredclinical examination

An analysis of the utility of the structured

clinical examination

The structured clinical exam is likely to be a reliable, valid, acceptable, cost-effective and a practicable way to assess clinical ability. Its educational impact is desirable as it emphasises the importance of clinical abilities.

take home message

Reznick, R., Smee, S., Rothman, A., Chalmers, A., Swanson, D., Dufresne, L., Lacombe, G., Baumber, J., Poldre, P., Levasseur, L., et al. (1992). An objective structured clinical examination for the licentiate: report of the pilot project of the Medical Council of Canada. Academic Medicine, 67 (8), pp. 487-494.


27

How to do it: Blueprinting SBAsChoosing

assessment material to develop SBAs

SBAs are specifi cally used by the Intercollegiate Specialty Board (ISB) as a tool for testing basic science knowledge; i.e. at the level of ‘knows’ in Miller’s pyramid.

Therefore, the columns of the SBA assessment blueprint should comprise basic sciences; e.g. anatomy, physiology, biochemistry, haematology, immunology, pathology/genetics, pharmacology and other relevant basic sciences.

The rows should comprise the curriculum material covering the basic sciences.

Head and neck

Respiratory / thoracic

Hepatobiliary

Kidneys and bladder

Genitalia

Skin and subcutaneous

Musculoskeletal (incl. abdominal wall)

Miscellaneous

Multisystem (incl. trauma)

Ethics and legal issues

Instrumental / sutures

System / Age / Condition / Problem (includes systematic and problem-based types) An

atom

y

Embr

yolo

gy

Gene

tics

Phys

iolo

gy

Bioc

hem

istry

Path

olog

y

Haem

atol

ogy

Micr

obio

logy

Imag

ing

Epid

emio

logy

Biop

hysic

s

The assessment blueprint for SBAs should be a grid meshing basic sciences (columns) versus curriculum content (rows).

take home message

Paediatric Surgery MCQ Exam Blueprint


28

How to do it: Developing SBAs at different levels

A guide to developing SBAs using

Miller’s pyramid and Bloom’s taxonomy

Options

Stem/case

Lead-in

During an operation, the arterial PCO2 and pH of an

anaesthetised patient are monitored. The patient is being ventilated by a mechanical respirator, and the initial values are normal (PCO

2 = 40 mm Hg; pH = 7.42).

If the ventilation is decreased, which of the following is most likely to occur?

Options

Lead-in Which is the most common variety of hernia through the abdominal wall?

correct answer

distractorsA. FemoralB. IncisionalC. InguinalD. UmbilicalE. Ventral

distractors

distractors

A. arterial PCO2 decrease; pH decrease

B. arterial PCO2 decrease; pH increase

C. arterial PCO2 decrease; pH no change

D. arterial PCO2 increase; pH increase

E. arterial PCO2 increase; pH decrease correct answer

SBAs and Miller’s pyramidWhilst not designed for this purpose, SBAs can be used to assess at the level of ‘knows’. Such a SBA consists of a:

• lead-in

• list of fi ve options containing a correct answer and four incorrect, but seemingly possible answers. These incorrect answers are called ‘distractors’.

Components of a SBA testing at the level of ‘knows’ in Miller’s pyramid

Source: adapted from ‘Multiple choice questions on lecture notes on general surgery’ (Fleming & Stokes, 1980).

The SBA can also be used to assess at the level of ‘knows how’. Such SBAs will consist of a:

• stem/clinical scenario

• lead-in

• list of fi ve options, comprising one correct answer and four ‘distractors’ (Case & Swanson, 2001).

Components of a SBA testing at the level of ‘knows how’ in Miller’s pyramid

Source: adapted from ‘Constructing written test questions for the basic and clinical sciences’ (Case & Swanson, 2001).


29

Shapes of SBASBA for assessing at the level of ‘knows’ SBA for assessing at the level of ‘knows how’

Stem

Lead-in Lead-in A A B B C C D D E E

SBAs and Bloom’s taxonomy: keywords to use at each levelThe following guide provides keywords that can be used to construct SBAs at different levels in the Bloom’s taxonomy: (A) recall-comprehension; (B) application-analysis; and (C) synthesis evaluation (page 17).

The key words to use at Bloom’s level A:

• Recall: reproduces previously learned material by recalling facts, terms, basic concepts and answers.

Keywords: who, what, when, omit, where, which, fi nd, how, defi ne, label, show, spell, list, match, name, tell, recall.

e.g. What are the steps in……? List the causes of……...

• Comprehension: demonstrating, understanding of facts and ideas by organising, translating, interpreting, giving descriptions and stating main ideas.

Keywords: demonstrate, interpret, explain, extend, illustrate, outline, relate, rephrase, translate, summarize, discuss, describe, locate, review, express.

e.g. Discuss the causes of………. Explain the pathophysiology of………

An example of a level A SBA: In the shoulder which of the following form part of the rotator cuff of muscles?

A. DeltoidB. Pectoralis minorC. SubclaviusD. Teres major E. Teres minor

The key words to use at Bloom’s level B:

• Application: solving problems by applying acquired knowledge, facts, techniques and rules in a different way.

Keywords: apply, interview, make use of, employ, organise, experiment with, utilise, model, identify, recognise, solve, adopt.

e.g. Provide a differential diagnosis What are the causes relevant to this particular case?

• Analysis: examining and breaking information into parts by identifying motives or causes; making comparisons and fi nding evidence to support generalisations.

Keywords: analyse, categorise, classify, compare, contrast, discover, dissect, divide, examine, inspect, simplify, take part in, test for, distinguish, make distinctions, extract theme(s), note relationships motives or functions, make assumptions, draw conclusions, differentiate.e.g. How will your differential diagnosis be altered in the light of investigation

fi ndings?

P


30

An example of a level B SBA: A 60-year old man has become progressively less able to abduct his shoulder. Passive

movements and active rotation are possible. What muscle is most likely to have a tear?A. DeltoidB. InfraspinatusC. SubscapularisD. SupraspinatusE. Teres minor

The key words to use at Bloom’s level C

• Synthesis: compiling information together in a different way by combining elements in a new pattern or proposing alternative solutions.

Keywords: build, choose, combine, compose, construct, create, design, develop, formulate, imagine, invent, make up, originate, plan, modify, change, improve, adapt, minimise, theorise, elaborate, improve, change, setup, prepare.

e.g. What will be your plan of management?

• Evaluation: presenting and defi ning opinions by making judgments about information, validity of ideas or quality of work based on a set of criteria.

Keywords: assess, award, choose, select, conclude, criticise, decide, defend, determine, dispute, evaluate, judge, justify, measure, mark, rate, recommend, rule on, appraise, prioritise, provide opinion, support, state the importance, set criteria, prove, disprove, infl uence, perceive, value, estimate, score, infl uence, deduct.

e.g. Justify your management of this patient.

An example of a level C SBA: An active 40-year old man has a short history of pain and loss of full shoulder function

due to a 4 cm longitudinal rotator cuff tear visualised by MRI scan. He is keen to return to work as soon as possible.

What is the most appropriate management?A. Arthroscopic rotator cuff repairB. Bankart repairC. Open acromioplastyD. Open rotator cuff repairE. Physiotherapy and shoulder exercises

The table overleaf provides some dos and don’ts when constructing SBAs.

Bloom, B.S., Englehart, M.D., Furst, E.J., Hill, W.H. & Krathwohl, D.R. (1956). A taxonomy of educational objectives: handbook I: cognitive domain. David McKay, New York.Case, S.M. & Swanson, D.B. (2001). Constructing written test questions for basic and clinical sciences. 3rd edn. National Board of Medical Examiners (NBME), Philadelphia, USA. Available at: http://www.nbme.org/PDF/ItemWriting_2003/2003IWGwhole.pdf (accessed on 26 January 2006).Fleming, P.R. & Stokes, J.F. (1980). Multiple choice questions on lecture notes on general surgery. 2nd edn. Blackwell scientifi c publications, Oxford, UK.

When planning the exam decide what proportion of SBAs will be used at each cognitive level.

take home message

P

P


31

Checking the stem • Give enough information to answerand lead-in the SBA, without/before reading the

options • Be clear, precise and simple

• Do not synthesise for the candidate; i.e. give details of the patient’s complaint in simple language

• Avoid technical item fl aws, such as:• A word in the stem repeated in the option(s)• Tricky/complicated stems• Clues to the answer in the stem

• Do not include the question (task for the candidate) in the stem

• Do not create a ‘test within a test’; e.g. do not require the candidate to interpret the investigation results to answer a SBA on patient management – data interpretation and patient management should be tested separately

After writing: checks 1 Does the SBA assess an important concept? 2 Does the SBA test factual recall of knowledge, application or evaluation? 3 Can the SBA be answered by only reading the stem & lead-in? 4 Are all options homogeneous? 5 Is the SBA (stem, lead-in and options) devoid of technical item fl aws?

Writing the stem • Select a common clinical case • Include as much information as required

to arrive at the correct answer; i.e. a long stem (in contrast to the options which should be short)

SBA writing step Do

Before writing • Select outcome(s) or important concept(s) that SBA should assess

• Identify the cognitive level at which the SBA should be pitched; e.g. factual recall, application of knowledge or evaluation

• Decide on the topic and content area

SBA writing - Dos and Don’ts

Don’t

• Do not assess trivial, insignifi cant facts

Writing the lead-in • Clearly indicate how to answer the SBA • Write in the question format • Refer back to the topic & content area,

when constructing the lead-in • Try to present a task to the candidate; e.g.

what is the diagnosis?

• Avoid phrases (e.g. Regarding epilepsy:); use questions instead

• Avoid technical item fl aws, such as:• Absolute terms; e.g. always, never• Frequency terms; e.g. rarely• ‘Which of the following statements is

correct?’ This type of lead-in may lead to heterogeneous options

• Negative questions

Writing the options • Develop an option list with only one clear answer

• Construct distractors that are clearly incorrect, but plausible

• Write options that are short and uncomplicated

• Write homogeneous options; i.e. like needs to be compared with like; e.g. all options being clinical signs

• List in a logical order; e.g. alphabetical order

• Position the correct option at different places in the option list for different SBAs

• Construct options of similar length • Use coherent, consistent terminology; e.g.

terms such as ‘pathagnomonic, typical, or recognised feature’, if used, should be defi ned in the instructions to candidates

• Avoid technical item fl aws:• Related to test-wiseness

• Grammatical cues• Logical cues• Absolute terms• Long correct answer• Word repeats• Convergence strategy

• Related to irrelevant diffi culty• Inconsistent numerical data• Vague terms; e.g. may• Overlapping options• Double options; e.g. do X and Y• Language not parallel to others• ‘None of the above/all of the above’• Answer is ‘hinged’ to another SBA

Examples of the above fl aws are given in: Case & Swanson (2001), pages 19-26


32

EMIs are used by the ISB exams to assess clinical aspects at the level of ‘knows how’ in Miller’s pyramid and Bloom’s levels 3 to 6 (application, analysis, synthesis and evaluation).

They can, however, also be used to test factual recall of knowledge and understanding (the fi rst two levels of Bloom’s taxonomy) and the ‘knows’ level of Miller’s pyramid.

The columns of the assessment blueprint are outcomes related to the clinical sciences, such as patient management and investigation. The rows of the blueprint comprise the topics in the curriculum.

EMI blueprints for each sub-specialty will be circulated as they become available.

How to do it: Blueprinting EMIsChoosing assessment

material to develop EMIs

The assessment blueprint for EMIs should be a grid meshing clinical sciences (columns) versus curriculum content (rows).

take home message


33

How to do it: Developing EMIsA guide to developing

EMIs to test at the ‘knows how’ level of

Miller’s pyramid

Theme

Lead-in

Option list

Abdominal pain

A. Abdominal aneurysm K. Kidney stoneB. Appendicitis L. Mesenteric adenitisC. Bowel obstruction M. Mesenteric artery thrombosisD. Cholecystitis N. Ovarian cyst – rupturedE. Colon cancer O. PancreatitisF. Constipation P. Pelvic infl ammatory diseaseG. Diverticulitis Q. Peptic ulcer diseaseH. Ectopic pregnancy – ruptured R. Perforated peptic ulcerI. Endometriosis S. PyelonephritisJ. Hernia T. Torsion

Stem / scenario

1. A 25-year-old woman has sudden onset of persistent right lower abdominal pain that is increasing in severity. She has nausea without vomiting. She had a normal bowel movement just before onset of pain. Examination shows exquisite deep tenderness to palpation in right lower abdomen with guarding but no rebound; bowel sounds are present. Pelvic examination shows a 7-cm, exquisitely tender right-sided mass. Haematocrit is 32%. Leukocyte count is 18,999/mm3. Serum amylase activity is within normal limits. Test of the stool for occult blood is negative.

2. An 84-year-old man in a nursing home has increasing poorly localised lower abdominal pain recurring every 3-4 hours over the past 3 days. He has no nausea or vomiting; the last bowel movement was not recorded. Examination shows a soft abdomen with a palpable, slightly tender, lower left abdominal mass. Haematocrit is 28%. Leukocyte count is 10,000/mm3. Serum amylase activity is within normal limits. Test of the stool for occult blood is positive.

For each patient with abdominal pain, select the most likely diagnosis.

An EMI consists of a:

1 theme

2 long list of options

3 lead-in statement

4 number of (at least two) item stems or case scenarios.

Components of an EMI

Source: adapted from ‘Constructing written test questions for the basic and clinical sciences’ (Case & Swanson, 2001).


34

Most of the guidelines for constructing one-from-fi ve SBAs (page 31) are applicable to EMIs as well. The following, however, need to be emphasised.

1 Decide on the theme, lead-in and options, and fi nally develop the stems; i.e. clinical vignettes.

2 Stems without lead-ins or with non-specifi c lead-ins should NOT be developed; e.g. match the following with the most suitable option (this is fl awed).

3 By reading the stem and lead-in alone (i.e. without looking at the options) the candidate should be able to answer the question.

4 Options should be homogeneous; heterogeneous options reduce the choice and increase the probability of guessing the correct answer (page 31).

5 The stems should be long and options should be short (but the option list is long).

6 Options should be arranged in the alphabetical order, so that the trainees will fi nd it easy to locate the answer within a long option list.

7 Since there are several stems and a long, homogeneous list of options, it is important to guard against ‘cueing’ (one stem providing the answer to another) and ‘hinging’ (answer to one stem needs to be known to answer another).

The shape of an EMI

Theme

A I B J C K D L E M F N G O H

Lead-in

Stems/case scenarios

1

2

3

Case, S.M. & Swanson, D.B. (2001). Constructing written test questions for basic and clinical sciences. 3rd edn. National Board of Medical Examiners (NBME), Philadelphia, USA. Available at http://www.nbme.org/PDF/ItemWriting_2003/2003IWGwhole.pdf (accessed on 26 January 2006).

EMIs are an effi cient form of assessment, if many questions need to be developed within a particular curriculum area. Relevance to practice is ensured by the scenario or stem.

take home message


35

How to do it: Blueprinting structured, standardised oral examination

Choosing assessment content that samples the whole curriculum

The oral examination blueprint should be a matrix meshing professional capability/patient care, knowledge and judgement, and quality of response (columns) with the curriculum topic areas (rows). Each row and each column should be represented with suitable oral questions.

take home message

In a series of workshops, ISB examiners identifi ed nine aspects that they assess in the oral exam:

1. Personal qualities; e.g. behaviour, attitudes, personality, honesty, integrity, demeanour

2. Communication skills

3. Professionalism

4. Surgical experience and ability to integrate competencies

5. Organisation and logical, step-wise sequencing of the thought process; ability to focus on the answer quickly

6. Ability to justify an answer with evidence from the literature

7. Clinical reasoning, decision making skills and prioritisation

8. Adaptability to stress and ability to handle stress

9. Ability to deal with ‘grey areas’ in practice and complex issues that may not have been assessed by the other assessments.

This has been collapsed to:

P Professional capability/patient care

P Knowledge and judgement

P Quality of response.

Trainee characteristic Description

Professional capability/ Personal qualitiespatient care Professionalism Surgical experience Ability to integrate competencies

Knowledge Organisation and logical step-wise sequencing of thought processand judgement Ability to justify an answer with evidence from the literature Ability to deal with ‘grey areas’ in practice and complex issues that

may not have been assessed by the other assessments Clinical reasoning, decision making and prioritisation

Quality of response Personal qualities Communication Organisation and logical step-wise sequencing of thought process Ability to focus on the answer quickly Adaptability to stress and the ability to handle stress


36

Constructing oral exam questions at different

levels in Bloom’s taxonomy

How to do it: Developing structured, standardised oral exam questions

Once satisfactorily blueprinted, the oral exam questions need to be developed to represent each selected cell in the blueprint matrix. The steps in developing oral exam questions are:

1. Identify a suitable topic that represents a selected cell in the assessment blueprint. Ideal topics are those that cannot be assessed in a hands-on setting at the clinical exam; e.g. medical emergencies, critical conditions, acute illnesses. Identify all the topics to cover the entire assessment blueprint and distribute these topics among different examiner groups.

2. Each examiner group should develop clinical scenario(s) within their allocated topic(s). The clinical scenario should have just suffi cient information to generate a few questions; i.e. the scenario should neither be too lengthy with redundant data nor too short with insuffi cient data. The clinical scenario should represent a realistic situation. ‘Props’ in the form of photographic material, radiographs, etc. are ideal to support the scenario.

3. Develop at least one question each in the ‘introductory’, ‘competence’ and ‘advanced’ question categories for each scenario.

4. Identify appropriate, acceptable and unacceptable answers to the questions. The clinical scenario, the questions and the answers for each topic should be covered within approximately fi ve minutes.

5. The examiner groups exchange their clinical scenarios, oral questions and specimen answers to validate the assessment material of other groups. It is particularly helpful if the examiner groups try to answer each question of another group, before reading the specimen answer, and then compare their answer with the specimen answer.

The oral examination questions should be: developed on topics refl ecting the assessment blueprint; based on clinical scenarios drawn from actual practice; guided by the hierarchical stages in the Bloom’s taxonomy; and developed through a group process (this group process must cover both construction and validation of assessment material).

take home message


37

Candid

ate

s path

way

How to do it: Implementing the structured, standardised oral exam The key steps in organising

and implementing the structured, standardised oral

examination effectively

Below is a diagrammatic illustration of the conduct of the oral exam.

Score 1

Score 2

0-30

min

utes

Score 3

Score 4

Score 5

Score 6

Score 7

Score 8

Score 9

Score 10

Score 11

Score 12

Examiner 1

Examiner 2

Examiner Pair 1

Topic 1

Topic 2

Topic 3

Topic 4

Topic 5

Topic 6

Examiner 1

Examiner 2

Examiner 1

Examiner 2

Examiner 1

Examiner 2

Examiner 1

Examiner 2

Examiner 1

Examiner 2

Examiner 1

Examiner 2

Score 25

Score 26

61-9

0 m

inut

esScore 27

Score 28

Score 29

Score 30

Score 31

Score 32

Score 33

Score 34

Score 35

Score 36

Examiner 5

Examiner 6

Examiner Pair 3

Topic 13

Topic 14

Topic 15

Topic 16

Topic 17

Topic 18

Examiner 5

Examiner 6

Examiner 5

Examiner 6

Examiner 5

Examiner 6

Examiner 5

Examiner 6

Examiner 5

Examiner 6

Examiner 5

Examiner 6

Examiner 3

Examiner 4

Examiner Pair 2

The structured, standardised oral exam format

Within one-and-half hours (90 minutes), six examiners can independently assess each trainee on a total of 18 topics, with each topic represented by a clinical scenario, and generate 36 test scores, which should provide a valid and reliable measure of the candidate’s ability in terms of professionalism/patient care, knowledge and judgement, and quality of response.

take home message


38

The table below shows the rating rubrics/descriptors for the existing 4-8 rating scale that the ISB examiners developed at a series of workshops.

Descriptors to the ISB structured, standardised oral exam rating scale

How to do it: Marking the structured, standardised oral exam

The descriptors for the rating scale used by ISB

oral examiners

Rating scale

5

6

7

8

Overall Professional Capability/Patient CarePersonal qualities, professionalism and ethics, surgical experience, adaptability to stress, ability to deal with grey areas

The candidate failed to demonstrate competence in the diagnosis and clinical management of patients

The candidate demonstrated competence in the diagnosis and clinical management of patients

The candidate demonstrated confi dence and competence in the diagnosis and clinical management of patients

The candidate demonstrated confi dence and competence in the diagnosis and clinical management of patients to a level which would inspire confi dence in the patient

Knowledge and judgementKnowledge, ability to justify, clinical reasoning

• Demonstrated a lack of understanding• Diffi culty in prioritising• Gaps in knowledge • Poor deductive skills• Poor higher order thinking• Signifi cant errors• Struggled to apply knowledge/judgment/

management • Variable performance

• Good knowledge and judgment of common problems

• Important points mentioned• Instills confi dence• No major errors

• Ability to prioritise• Coped with diffi cult topics/problems• Good decision making/provided supporting

evidence • Reached a good level of higher order thinking • Strong interpretation/judgment but didn’t quote

the literature

• At ease with higher order thinking • Flawless knowledge plus insight and judgment• Good understanding/knowledge/management/

prioritisation of complex issues • Had an understanding of the breadth and depth of

the topic, and quoted from literature • High fl yer • Strong interpretation/judgment

Quality of responseCommunication skills, organisation and logical throught process

Q Frequent use of default questions

A Confused/disorganised answers; hesitant and indecisive

P Required frequent prompting

Q Copes with competence questions

A Methodical approach to answers; has insight

P Requires minimal prompting

Q Goes beyond the competence questions

A Logical answers and provided good supporting reasons for answers

P Fluent responses without prompting, but some prompting on literature

Q Stretches examiners – answers questions at advanced level

A Confi dent, clear, logical and focused answers

P No prompting necessary

Q questionsA answersP prompting

Descriptors for the existing ISB oral exam rating scale have been identifi ed to improve inter-rater reliability.

take home message

4 • Did not get beyond default questions• Failed in most/all competencies • Poor basic knowledge/judgment/understanding to

a level of concern• Serious lack of knowledge

Q Does not get beyond default questions

A Disorganised/confused/inconsistent answers, lacking insight

P Unpersuadable – prompts do not work

The candidate demonstrated incompetence in the diagnosis and clinical management of patients to a level which caused serious concerns to the examiner


39

How to do it: Blueprinting structured clinical exams Choosing clinical

conditions that appropriately sample the whole curriculum, to test important clinical aspects

of the specialty

As with other exams, the structured clinical exam is blueprinted, selecting a wide and representative range of conditions from the curriculum to test important clinical aspects of each specialty. The blueprint for each specialty has yet to be identifi ed and will be circulated when available.

The structured clinical exam blueprint should mesh the clinical conditions (rows) with the important clinical aspects (columns) that each specialty wishes to test.

take home message


40

How to do it: Developing structured clinical exam questions Designing an assessment

of different outcomes in clinical surgery at the

‘shows how’ level

Pairs of examiners will be asked to test specifi ed clinical abilities in relation to the clinical conditions in their bay. As with the structured, standardised oral exam, questions relating to higher order thinking, clinical judgement and evidence-based decision making should be included.

The abilities to be tested in each specialty have yet to be identifi ed and will be circulated as they become available.

The structured clinical exam and the structured, standardised oral exam should complement each other. They can be considered as one exam, with the structured clinical exam based on patients or clinical material and the structured, standardised oral exam based on clinical scenarios, supported where appropriate by clinical material.

take home message


41

How to do it: Implementing the structured clinical exam: the exam format

The key steps in organising and

implementing the structured clinical exam effectively and effi ciently

The structured clinical exam format

Score 1Score from Examiner 1



Score from Examiner 2









Examiner 1

Examiner 2

Short case













Examiner 3

Examiner 4

Within one hour (60 minutes), four examiners working in pairs can independently assess each trainee on a total of 12 short cases. This exam format will generate 24 short case test scores. The reliability of the results of this exam are yet to be calculated.

take home message

Case 1

Case 2

Case 3

Case 4

Case 5

Case 6

Case 7

Case 8

Case 9

Case 10

Case 11

Case 12

Bay 1

Bay 2

The format of the structured clinical exam is illustrated below.

0-5minutes

Score 23

Score 24

56-60minutes


42

How to do it: Marking the structured clinical exam

The descriptors for the rating scale used by ISB

clinical examiners

Rating scale

5

6

7

8

Overall Professional Capability/Patient CarePersonal qualities, professionalism and ethics, surgical experience, adaptability to stress, ability to deal with grey areas

The candidate failed to demonstrate competence in the diagnosis and clinical management of patients

The candidate demonstrated competence in the diagnosis and clinical management of patients

The candidate demonstrated confi dence and competence in the diagnosis and clinical management of patients

The candidate demonstrated confi dence and competence in the diagnosis and clinical management of patients to a level which would inspire confi dence in the patient

Knowledge and judgementKnowledge, ability to justify, clinical reasoning

• Demonstrated a lack of understanding• Diffi culty in prioritising• Gaps in knowledge • Poor deductive skills• Poor higher order thinking• Signifi cant errors• Struggled to apply knowledge/

judgment/management • Variable performance

• Good knowledge and judgment of common problems

• Important points mentioned• Instills confi dence• No major errors

• Ability to prioritise• Coped with diffi cult topics/problems• Good decision making/provided

supporting evidence • Reached a good level of higher order

thinking • Strong interpretation/judgment but

didn’t quote the literature

• At ease with higher order thinking • Flawless knowledge plus insight and

judgment• Good understanding/knowledge/

management/prioritisation of complex issues

• Had an understanding of the breadth and depth of the topic, and quoted from literature

• High fl yer • Strong interpretation/judgment

Quality of responseCommunication skills, organisation and logical throught process

Q Frequent use of default questions

A Confused/disorganised answers; hesitant and indecisive

P Required frequent prompting

Q Copes with competence questions

A Methodical approach to answers; has insight

P Requires minimal prompting

Q Goes beyond the competence questions

A Logical answers and provided good supporting reasons for answers

P Fluent responses without prompting, but some prompting on literature

Q Stretches examiners – answers questions at advanced level

A Confi dent, clear, logical and focused answers

P No prompting necessary

Descriptors for the existing ISB clinical exam rating scale have been identifi ed to improve inter-rater reliability.

take home message

4 The candidate demonstrated incompetence in the diagnosis and clinical management of patients to a level which caused serious concerns to the examiner

• Did not get beyond default questions• Failed in most/all competencies • Poor basic knowledge/judgment/

understanding to a level of concern• Serious lack of knowledge

Q Does not get beyond default questions

A Disorganised/confused/inconsistent answers, lacking insight

P Unpersuadable – prompts do not work

“Bedside manner”Applicable to clinicals with patients

• Abrupt/brusque manner • Arrogant• Inappropriate attitude/behaviour• No empathy• Rough handling of patients• Totally inappropriate examination of

opposite sex

• Does not listen-patronising• No introduction• Unsympathetic• Unobservant of body language

• Appropriate exam of opposite sex• Considerate handling• Observes patient expression• Respects all• Responds to patient• Treats ‘all’ patients appropriately

• Gains patient confi dence quickly

• Good awareness of patient’s reaction

• Puts patient at ease quickly

The table below shows the rating rubrics/descriptors for the existing 4-8 rating scale that the ISB examiners developed at a series of workshops.

Descriptors to the ISB structured clinical exam rating scale

• Acts/talks at patient’s level• Instills confi dence patient/

rapport very good

It is important that the full range of scores is used as indicated by the descriptors.


43

The SBAs/EMIs are subjected to quality assurance procedures in terms of:

• item analysis; i.e. diffi culty level and discrimination index (Case & Swanson, 2001 – chapter 8, pages 107-110)

• psychometric properties; i.e. internal consistency; e.g. Kuder-Richardson 40 and generalisability co-effi cient (Streiner & Norman, 2003 – chapter 9, pages 153-171)

• examiner comments

• trainee feedback.

The included questions or items then need to be banked or stored in a suitable format to facilitate easy retrieval for re-use/updating or discarding in the light of fresh information. The item bank should also facilitate the easy addition of new test items; i.e. SBAs, EMIs.

The SBAs/EMIs are best stored in an electronic format with a backup copy, secured by strictly limited password access. The commercial “Speedwell” system has been selected to store the ISB exam item bank. Having a hard copy of the entire item bank under lock and key is highly recommended.

It is likely that the ISB item bank will eventually contain approximately 1500 SBAs for each subspecialty. Retrieving items from the bank under these circumstances could present diffi culties without a comprehensive coding system. The coding system developed by ISB is shown overleaf.

Safe keeping and storage of assessment items for future use

General issues related to how to do it: item banking

Case, S.M. & Swanson, D.B. (2001). Constructing written test questions for basic and clinical sciences. 3rd edn. National Board of Medical Examiners (NBME), Philadelphia, USA. Available at: http://www.nbme.org/PDF/ItemWriting_2003/2003IWGwhole.pdf (accessed on 26 January 2006).Streiner, D.L. & Norman, G.R. (2003). Generalisability theory. Ch. 9 in: Health measurement scales: a practical guide to their development and use. 3rd edn. Oxford University Press, Oxford.

Items stored in the bank require comprehensive coding to aid retrieval of questions to fi t the exam blueprint.The same coding system is employed for SBAs and EMIs.

take home message


44

Question number Offi ce Use Specialty

Question Originator Question Contributors

Quality Assuror Question Name

Good Medical Practice Assessment Audit Communication Cover Education Emergencies Investigations Laws Limits Managing resources Record keeping Relationships Respect Teaching Treatment

Disease Aspects

Aetiology Clinical presentation Epidemiology Natural history Pathology Prevention

Context of Medical Practice

Age

Co-morbidity

Equipment

Ethical

Gender

Legal

Occupation

Personnel

Prenatal

Race

Resources

Social circumstances

Where

The Healthy Patient in the Natural World and Society

Acoustics Governance Physics Anatomy Haematology Physiology Biochemistry Histology Private health sector Business Immunology Psychology Chemistry Law Public health Child development Management Screening Embryology Materials Sociology Genetics Microbiology Statistics Geography NHS Reproduction Gerontology Optics Glues and adhesives Pharmacology

Disease – Types

Congenital Degenerations Endocrine Genetic Iatrogenic Idiopathic Immune disease Infections Infl ammations Metabolic Non accidental injury Poisonings Substance abuse Tumours Trauma Vascular

Actions

Advise Audiological Audit Biochemical Calculate Diagnose Endoscopy Examination Haematological History Imaging Measurement Microbiological Operate Patient transfer Physiological Non operative treatment Prescribing Resuscitation Team/communication Treatment

Keywords

Coding GuidanceNormally the boxes ‘Question Originator’, ‘Question Contributors’, ‘Quality Assuror’, and ‘Good Medical Practice’ should be completed. If they are not the question will be returned and not entered into the bank.

Any combination can be marked in the shaded title boxes and more than one option can be chosen. Key words would normally include such things as an anatomical site, anatomical system, important disease, operation, or technique for example. Each specialty can choose their own key words and the coding offi ce will try and keep the specialty informed of their own key words. The question bank can search under any combination of ticked boxes and key words.

Format for coding and banking SBAs / EMIs


45

Friedman Ben David, M. (2000). AMEE Guide No. 18: Standard setting in student assessment. Medical Teacher, 22 (2), pp. 120-130. Norcini, J. (2005). Standard setting. In: Dent, J.A. & Harden, R.M. (eds) A practical guide for medical teachers. Elsevier Churchill Livingstone, London.

Standard setting is the process of deciding who passes and who fails an exam.

There are two types of standards.

1. Absolute or criterion-referenced standards: A trainee passes the assessment when he/she has achieved the level of competence (i.e. standard) that has been identifi ed by the examiners. If all the trainees have achieved the desired competence level there will be no failures. Similarly, if no trainees achieve the set competence level, nobody will pass.

2. Relative or norm-referenced standards: This involves ranking the candidates. A fi xed percentage of trainees (e.g. the top 60%), as determined by the examiners, pass the assessment, irrespective of the level of competence they have shown at the assessment. There is no pre-set pass mark or exam score. The implication is that in a ‘bad cohort’ some non-competent trainees may pass the exam and in a ‘good cohort’ some competent trainees may fail the exam.

It is not uncommon to have combinations of criterion and norm referenced standards; for example, to rank the trainees who have achieved the standard.

From an educational standpoint, norm-referenced standards have the following weaknesses (Friedman Ben-David, 2000).

Standards are not content related; i.e. mastery of curriculum content may not be achieved.

The standard or pass mark is not known in advance.

Diagnostic feedback related to the trainee competence/performance is unclear.

Whichever approach to standards is applied, it must be: fi t-for-purpose; based on informed judgement; demonstrate due diligence; supported by research; and easily explained and implemented (Norcini, 2005).

Standards: an overviewAn analysis of the

rationale and concepts of standards

Criterion-referenced standards provide information on trainee competence, while norm-referenced standards help select the best trainee(s).

take home message

O

O

O


46

Conjunctive and compensatory standards

A comparison of two ways of applying

criterion referencing

The decision whether to apply conjunctive or compensatory standards needs to be taken before the assessment is designed.

Conjunctive standards can be diffi cult to apply within one test and can lead to multiple fails. The application of compensatory standards is more realistic, particularly where the reliability of the exam results is low/unknown.

take home message

There are two ways of applying criterion referencing.

a) Conjunctive standards: To pass the whole assessment, the trainee needs to score more than the set score (i.e. the standard) for each test component.

b) Compensatory standards: The trainees can score low in one assessment component, but compensate for this poor score by scoring highly in another component of the same assessment and pass the overall assessment. Only the total mark is considered when deciding the pass/fail score.

The table below highlights some of the differences between conjunctive and compensatory standards.

A comparison between conjunctive and compensatory standards

Conjunctive standards Compensatory standards

The trainee needs to achieve competence in The trainee needs only to pass the overall assessment;(i.e. pass) all parts of the assessment, individually. i.e. they can compensate for a low mark in one part, by scoring highly in another.

Should be adopted only if the individual test items Can be adopted even if individual test parts are low in(assessment parts) are high in reliability; i.e. reliability; the assessment as a whole having acceptable conjunctive standards for unreliable test parts will reliability is suffi cient. result in unreasonable failures.

Should be adopted if the individual assessment Should be adopted if the assessment componentsparts assess unrelated curriculum content or correlate well with each other; i.e. compensation will notassess different competencies/constructs. result in loss of assessment information about the trainee.

Provide clear diagnostic feedback to the trainee; Diagnostic feedback to the trainee may be unclear.i.e. the feedback indicates the trainee’s weak areas.

Must consider ways of dealing with multiple Re-sitting considerations are not a must. However, thefailures (e.g. options for re-sitting the exam), consequences of compensating across test parts (i.e. thebefore adopting conjunctive standards. possible information loss) need to be considered.


47

Different methods of standard setting An outline of various

methods available to decide the pass/fail

score

There are two basic standard setting categories: those that focus on the test and those that focus on the examinee.

1. Test-centred standard setting

The examiners focus on individual assessment items to decide how the hypothetical borderline candidate or the ‘just passing’ trainee will fare, before arriving at the pass mark. The examiners identify the borderline or just passing competence level by:

• using examiner experience to estimate the probability of a borderline candidate passing each test item; e.g. Angoff (1971) method

• using both examiner experience and previous exam results, to determine the probability of a borderline candidate passing each test item; e.g. modifi ed Angoff method (Friedman Ben-David, 2000)

• categorising the test items into a number of categories and then estimating the proportion of test items a borderline candidate will answer correctly in each category; e.g. Ebel’s (1972) method. In the modifi ed Ebel’s method the categories are identifi ed as ‘essential’, ‘important’ and ‘indicated’ (Case & Swanson, 2001)

• estimating the lowest and the highest acceptable score, and the least and the highest failure rate for each test item; e.g. Hofstee’s (1973) method

• identifying the number of options remaining, in a MCQ/SBA, after removing the distractors that the examiners think the borderline trainee will recognise as incorrect and calculating the pass mark as one over this number; e.g. Nedelsky’s (1957) method

• deciding whether a just passing candidate will answer each test item correctly as determined by multiple stakeholder panels (i.e. not only examiners); e.g. Jaeger’s (1982) method.


48

Standard setting calls for experienced examiner judgement. take home message

Angoff, W.H. (1971). Scales, norms, and equivalent scores, in: Thorndike, R.L. (Ed.) Educational Measurement. 2nd edn, American Council on Education, Washington DC. pp. 508-600.Case, S.M. & Swanson, D.B. (2001). Constructing written test questions for basic and clinical sciences. 3rd edn. National Board of Medical Examiners (NBME), Philadelphia, USA. Available at: http://www.nbme.org/PDF/ItemWriting_2003/2003IWGwhole.pdf (accessed on 26 January 2006).Ebel, R.L. (1972). Essentials of Educational Measurement. Englewood Cliffs, Prentice-Hall, New Jersey.Friedman Ben David, M. (2000). AMEE Guide No. 18: Standard setting in student assessment. Medical Teacher, 22 (2), pp. 120-130. Hofstee, W.K.B. (1973). Een alternatief voor normhandhaving bij toetsen. Nederlands Tijdschrift voor de Psychologie, 28, pp. 215-227.Jaeger, R.M. (1982). An interactive structures judgment process for establishing standards on competency test: theory and application. Educational Evaluation and Policy Analysis, 4, pp. 461-476. Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, pp. 3-19.Norcini, J.J. (2003). Setting standards on educational tests. Medical Education, 37, pp. 464-469.Smee, S.M. & Blackmore, D.E. (2001). Setting standards for an OSCE. Medical Education, 35, pp. 1009-1010.

2. Examinee-centred standard setting

The examiners use some other, non-hypothetical reference score achieved by the same/similar group of trainees, as the pass mark. The examiners identify this reference score by:

• considering the previous assessment results

• comparing the candidate score with some other score such as a global mark awarded to the same candidate, at the same exam; e.g. borderline-group method (Smee & Blackmore, 2001)

• considering the ability of either all or a random sample of candidates, and then grouping them into ‘pass’ and ‘fail’ groups; e.g. contrasting group method (Norcini, 2003)

• using their own experience.


49

Modifi ed Angoff method of standard setting The application of

the modifi ed Angoff method to decide the pass/fail mark of a SBA or EMI exam

Angoff (1971) developed a method to decide the pass mark for a multi-component (i.e. multi-item) assessment; e.g. the SBA question paper. In the modifi ed Angoff method, the examiners are additionally supplied with the actual item scores (achievement in individual SBA/EMI by similar candidates), to facilitate the process of determining the probability of a ‘just passing’ trainee answering a given SBA correctly.

The steps of the standard setting process, when applied to the SBA question paper are:

1. The examiners visualise and discuss the characteristics of a ‘just passing’ candidate/trainee until a consensus is reached.

2. The examiners individually consider each test item (i.e. SBA question), one at a time.

3. Each examiner estimates the probability of a ‘just passing’ candidate/trainee answering the SBA correctly. The probability lies within the range of 0 to 1, where 0 = no probability of the ‘just passing’ candidate getting the right answer and 1 = all ‘just passing’ candidates should achieve the correct answer.

4. Add up the individual probability estimates for all the SBAs per examiner to obtain the total probability per examiner.

5. Add the probabilities of all the examiners to arrive at a total.

6. Divide the total in step 5 by the number of examiners to arrive at the average pass mark for the whole assessment.

7. Examiners, then, look at actual previous examination results.

8. Examiners revise, if necessary, their individual item (SBA) probabilities in the light of the previous exam results and re-calculate their own pass mark.

9. Re-calculate the average pass mark for the whole assessment, by repeating steps 4 to 6.

10. Convert the fi nal average pass mark to a percentage. This percentage is the pass/fail cut score or standard for the exam.

A worked example of the above standard setting process is below.

Application of the modifi ed Angoff method to a 10-question SBA using fi ve examiners

SBA no. SBA 1 SBA 2 SBA 3 SBA 4 SBA 5 SBA 6 SBA 7 SBA 8 SBA 9 SBA 10 Total

Examiner 1 0.20 0.65 0.51 0.70 0.40 0.72 0.32 0.56 0.62 0.55 5.23

Examiner 2 0.15 0.58 0.45 0.75 0.38 0.75 0.35 0.54 0.65 0.53 5.13

Examiner 3 0.18 0.59 0.48 0.77 0.45 0.69 0.40 0.51 0.59 0.58 5.24

Examiner 4 0.25 0.63 0.55 0.80 0.48 0.78 0.39 0.58 0.68 0.49 5.63

Examiner 5 0.19 0.66 0.52 0.79 0.38 0.82 0.33 0.60 0.66 0.48 5.43

Total pass mark for all fi ve examiners for 10-item exam 26.66

Average pass mark per examiner for 10-item exam 5.33 (out of 10)

Pass/fail standard as a percentage cut score 53%


50

Some authorities add one standard error of measurement (SEM) to this pass/fail score in high stakes exams to ensure that candidates around the borderline whose abilities might be doubtful, do not pass the exam. This measure is added for patient protection.

The steps in the modifi ed Angoff process of standard setting

Examiners discuss and identify the characteristics of the ‘just passing’ trainee

Examiners consider each test item (i.e. each SBA), one at a time

Each examiner estimates the probability of ‘just passing’ trainee passing each SBA

Individual examiner probability estimates are added up = total probability per examiner

Total probabilities of all examiners are added up = sum of total probabilities

Sum of total probabilities is divided by the number of examiners = average pass mark per examiner

Examiners are provided with the scores of past assessments

Examiners discuss and if necessary change their initial probability estimates

Average pass mark per examiner is re-calculated = fi nal average pass mark

The fi nal average pass mark is converted to a percentage This is the pass fail standard or cut point


51

AdvantagesStudies have shown that this method, when applied properly, produces reliable judgements, which are sensitive to content differences; i.e. candidate ability (Norcini, 2005).

The modifi ed Angoff process evaluates the diffi culty of each test item and sets the pass mark accordingly. Hence, it is a criterion-based method of standard setting.

The diffi culty of visualising the hypothetical ‘just passing’ candidate/trainee, a criticism attributed to the classical Angoff method (Smee & Blackmore, 2001), has been reduced to an extent in the modifi ed Angoff procedure by introducing past assessment results to inform examiner judgements.

DisadvantagesTime consuming. The examiners have extra work in estimating the probabilities.

Needs a substantial number of experienced examiners (10 to 12) for the modifi ed Angoff method to work satisfactorily (Zieky, 2001; Kaufman et al., 2001).

Visualising the hypothetical ‘just passing’ candidate is diffi cult.

Angoff, W.H. (1971). Scales, norms and equivalent scores, in: R.L. Thorndike (Ed.) Educational Measurement. American Council on Education, Washington DC. Kaufman, D.M., Mann, K.V., Muijtjens, A.M.M. & van der Vleuten, C.P.M. (2001). A comparison of standard setting procedures for an OSCE in undergraduate medical education. Academic Medicine, 75, pp. 267-271.Norcini, J. (2005). Standard setting. In: Dent, J.A. & Harden, R.M. (eds) A practical guide for medical teachers. Elsevier Churchill Livingstone, London.Smee, S.M. & Blackmore, D.E. (2001). Setting standards for an OSCE. Medical Education, 35, pp. 1009-1010.Zieky, M.J. (2001). So much has changed. How the setting of cut-scores has evolved since the 1980s, in: Cizek, G.J. (Ed.) Setting Performance Standards: Concepts, Methods, and Perspectives. Lawrence Erlbaum Associates, Mahwah, New Jersey. pp. 19-52.

The modifi ed Angoff approach can be used to judge competence in exams of different levels of diffi culty.

take home message

P

P

P

OO

O


52


53

With reference to the ISB Examiners’ Training Course the following Code of Conduct and guidelines must be observed.

Do not examine a candidate if any of the following applies

• He/she is known to you on a personal basis • He/she is currently working with you or one who has worked for you recently• He/she is someone with whom you have had diffi culties in the past

Introduction to the candidate.

• Courtesy• Settling question – it is helpful to remind the candidate which oral he/she is about to

be examined on in order to give him/her time to settle

Structuring the orals/clinicals

• Prepare questions in advance for calibration at the pre-examination meeting• Start with a reasonable introductory question• Questions should be pitched at a higher cognitive level, calling for answers that

require to assess the candidate’s judgement, evaluation, synthesis (integration) and application of knowledge

Questions

• Questions should be set at the appropriate level (i.e. knowledge of a new consultant)• If there is a candidate topic sheet then avoid questions on topics already covered

Courtesy and encouragement

• Excessive stress damages performance• Courtesy and encouragement reduces stress• Orals are not a test of a candidate’s ability to stand up under fi re• Treat all candidates the same • Mark must be based on performance only

Harassment

• Harassment should be identifi ed and stopped by co-examiner• A good robust argument may be terrifying for a candidate• Try not to respond to inappropriate behaviour by the candidate

Feedback

Give feedback where appropriate such as:

• ‘OK now let’s move on to….’• avoid remarks such as well done, excellent, perfect• avoid being poker-faced

Key points for examinersThe ISB examiners’ code of conduct and

guidelines


54

Time

Every second of the time that the examiner talks, provides less time for the candidate to show if he/she is competent.

• Candidate talks• Examiner listens• Clear questions aid this process

Tutorials

• Prepare clear and unambiguous questions, with default question if candidate is unable to answer

• Avoid tutorials

Hammering on

• If a candidate can’t answer, do not ‘hammer on’ – lead on to another subject

Props: slides / x-rays/ pictures / charts / surgical instruments

• If using props, then check them out prior to the oral with your co-examiner to ensure that they are clear and unambiguous

• Laminated photographs are preferable to laptops

Advice to examiners when commenting on examination performance

Examiners should refer to the marking descriptors and use these as a basis for comments. Please ensure that any comments on examination mark sheets are legible, intelligible and appropriate.

• Comments must be capable of being understood • Comments must be factually correct• Where professional and academic judgement is being expressed, it must be evidence

of refl ection on the response of the candidate• All comments must be phrased in a professional manner• No comments must be made that the examiner would not be prepared to make to the

candidate in person.

And Finally Most Importantly… Note Taking

This is a must to safeguard you and your co-examiner.

This code of conduct must be used by all examiners.take home message


55

GlossaryDefi nition of terms customised for the

purposes of ISB exams

Appraisal. A process in which “the supervising consultant or the educational supervisor provides thorough constructive and regular dialogue, feedback on performance and assistance in career progression” (Jolly & Grant, 1997 - page 11). Appraisal, though not part of assessment, may be informed by some assessments that are applied throughout the year.

Assessment. The activity of measuring the mastery of curriculum content, using pre-defi ned criteria, and passing a judgment by assigning a value (i.e. a grade or numerical value) to such mastery. In general terms, it is “a process for obtaining information that is used for making decisions about students” (Nitko, 1983 – page 4).

Assessment blueprinting. A procedure to ensure that the curriculum has been sampled appropriately. Assessment blueprinting is carried out by preparing a grid with curriculum outcomes in columns and curriculum content in rows. A decision is made regarding the proportion of the exam devoted to individual content and individual outcomes. The blueprint is then used to plan the exam.

Cognition. An intellectual process by which knowledge, understanding and higher order thinking is developed in the mind. Webster’s dictionary (Gove, 1976 – page 440) defi nes cognition as “the act or process of knowing in the broadest sense; an intellectual process, by which knowledge is gained about perceptions and ideas”.

Constructed response questions. A category of written assessment that requires the trainee to formulate and document an appropriate answer, rather than selecting an answer from a pre-prepared list of options; e.g. essay questions, modifi ed essay questions.

Criterion-referenced assessment. An assessment that measures the trainee’s mastery of curriculum content by comparing the trainee’s ability with an acknowledged, established level of ability or standard.

Descriptor. A brief, accurate, specifi c and focused description of the level of candidate ability denoted by a point on a rating scale, e.g. please see page 46.

Distractor. An incorrect option (i.e. response choice) in a multiple choice question option list; e.g. a SBA question has one correct response/option and four distractors.

Examination. An assessment instrument that may form a component of a larger assessment process.

Extended matching item (EMI) examination. A form of multiple choice assessment organised into question sets, with each set containing a long option list that is shared by several clinical scenarios or ‘stems’. For each stem the trainee/candidate has to choose the correct answer from the options list. Thus, an extended matching item set contains: a theme; a long option list; a lead-in; and two or more stems.

Evaluation. Though in the US the terms ‘evaluation’ and ‘assessment’ are synonymous, in the UK evaluation means the process of assessing an educational system (e.g. curriculum evaluation), of which assessment is only one constituent.

Formative assessment. Assessment that is carried out mainly to give feedback to the trainee to improve him/herself or to others (e.g. teachers, accrediting bodies, examiners, educational institutions) about training.


56

Generalisability theory (G study). A statistical procedure to fi nd out the relative contribution of all the possible systematic sources of error affecting the reliability of a given test. The G-coeffi cient is a value between 1 and 0. It provides estimates of almost all possible contributors of variability as is logistically feasible, considering the various uses to which the test may be put.

Generalisability theory (D study). A procedure, which uses the information of the G study to provide information on how different combinations of variability contributors will affect the reliability (G-coeffi cient) of a given test. This is done by statistical modeling of different test formats. Thus, using the D study, decisions can be taken as to how many assessment items/questions need to be included in an assessment to minimise the effect of systematic sources of error for a particular purpose.

Learning contract. An agreement that sets out: a review of previous achievement; dialogue or negotiation by the trainee with the supervisor; learning objectives and timetable; and the record of the outcomes of the contract.

Learning outcomes. A set of broad goals or meta-competencies that the learners need to achieve at the end of the course/learning programme. For example history taking, physical examination and differential diagnosis are the competencies that comprise the outcome ‘clinical skills’.

Modifi ed essay questions. A written assessment, usually with a clinical scenario followed by a number of questions, testing the trainees’ ability in clinical reasoning and application of basic science knowledge.

Multiple choice questions (MCQs). A generic term used for a group of written assessments, where the candidate selects a response from a pre-prepared set of options.

Multiple choice question (MCQ): one-from-fi ve format (called Single Best Answer by ISB exams). A type of written assessment of the selected response variety in which each item contains a question or statement, which is called the lead-in, and fi ve options. The trainee/candidate has to choose the most suitable option as the answer to the question or statement. The incorrect options are called distractors. In some instances the question or statement is preceded by a clinical case scenario or a vignette that is relevant to practice, called the ‘stem’ or ‘case’.

Norm-referenced assessment. An assessment that ranks the trainee’s ability by comparing it with the ability of other trainees who sit the assessment.

Objective structured clinical examination (OSCE). An assessment framework that can be used to assess trainee competence in clinical skills, in a ‘snap shot’ simulated situation.

Patient management problems. A written or computer-based assessment that attempts to assess the problem solving skills of the trainee. A clinical problem is followed by a set of questions focusing on the key features of that clinical case; i.e. the key feature approach.

Performance assessment. An assessment conducted under normal, real-life conditions in which the trainee works; e.g. work-based assessment.

Portfolio. A collection of trainee work, which provides evidence of the achievement of knowledge, skills, attitudes and professional growth through a process of self-refl ection over a period of time.

Portfolio assessment. An assessment framework that includes various assessment tools to measure the trainee achievement of identifi ed learning outcomes.

Professionalism. A set of key values or a code of conduct espoused, either explicitly or implicitly, by a professional body to guide the behaviour of its members. For example, the American Board of Internal Medicine (ABIM) identifi es the characteristics of professionalism as: altruism, accountability, duty, excellence, honour and integrity, and respect to others (Robins et al., 2002).


57

Quality assurance. A system of procedures, checks or audits that evaluates and monitors the work and the products of an institute, and proposes corrective measures, if necessary, to ensure that the outcomes are met as anticipated.

Rating scale. A scale used to measure trainee ability. In the best designed rating scales, the points on the rating scale are ‘anchored’ with descriptors, describing the trainee characteristics that are indicated by each point.

Record of in-training assessment (RITA). A method of postgraduate, workplace assessment for specialist registrars. It contains, for example, records of assessments undergone during training and cycles of learning contracts with the supervisor/trainer.

Reliability. The precision with which a part or whole of the assessment result can be reproduced, usually expressed as a reliability coeffi cient (r), which should be a value between 1 and 0. The assessment can be reproduced totally (100%) if r = 1. High reliability (a coeffi cient more than 0.8 for high stakes exams) is a vital attribute, especially for a competitive, high-stakes assessment, to ensure that the assessment is fair-by-all; i.e. the trainees are judged by the same standards/criteria.

Test-retest reliability. The reproducibility of the result when the same test is administered to the same or similar trainee on two occasions.

Inter-rater reliability. The consistency of two or more raters when assessing a trainee or a similar cohort of trainees.

Internal consistency. The similarity or the correlation among different parts of a test (eg MCQ exam or paper) or a questionnaire; i.e. the homogeneity of the test components.

Intra-rater reliability. The consistency of the ratings by the same examiner when assessing the same or similar candidates, at the same or similar exams, on different occasions.

Rubric. A point on a rating scale representing a discrete level of candidate ability. An assessment rubric is more specifi ed than a grade and gives more accurate, focused and itemised information about the trainee/candidate ability, whereas a grade is a global rating/rank. Hence, an important characteristic of a rubric is that it is defi ned by a descriptor, explaining the level of ability that the rating scale point is related to.

Selected response questions. A form of written assessment that requires the trainee to select the most suitable answer from a list of options; e.g. MCQ, EMI.

Short answer questions. Questions that require the trainee to construct and write down an appropriate answer briefl y; i.e. a word, a sentence, a paragraph or two-three paragraphs.

Simulated patients. Actors or member of the lay public, who are trained to reproduce clinical histories and certain physical signs during assessments such as the OSCE.

Single best answer (SBA) questions. The term used by the ISB exams for one-from-fi ve MCQs.

Skill. An organised, psycho-motor activity that can be learnt and developed by practice.

Standard setting. The process of establishing a cut point for passing or failing candidates at a summative assessment.

Standardised patients. Real patients who have been trained to reliably reproduce clinical histories and physical signs.

Structured standardised orals. An oral exam based on predetermined clinical scenarios, carefully selected and developed by experienced clinician examiners to represent the outcomes assessed, and marked by trained examiners using structured, pre-validated rating scale rubrics with anchored descriptors, to ensure that all the examinees receive a similar form of assessment, in terms of content, outcomes, item diffi culty and examiner leniency. An important characteristic is that the structured, standardised oral exam uses many examiners and many clinical scenarios on which to base its assessment judgement.


58

Summative assessment. Assessment carried out primarily to either pass or fail a trainee/candidate.

Test. A set of questions or exercises measuring knowledge, skills, attitudes and/or professionalism. Alternatively, a test can be any standardised procedure for measuring sensitivity, memory, intelligence or aptitude; eg IQ test. Though the noun ‘test’ can be used interchangeably with ‘examination’, the verb ‘test’ is to discover the worth of something by trial or to improve the quality of something by trial. Both should be the hallmarks of a good assessment instrument.

Three hundred and sixty degree (3600) assessment. A method of performance assessment, which incorporates the input from a number of stakeholders; e.g. senior doctors, peers, patients, junior doctors, co-workers. Usually questionnaires or rating scales are the assessment tools used. This assessment is particularly useful in assessing the trainees’ communication skills, integrity, team work, leadership, etc., which cannot be readily assessed by more formal assessment methods.

True-false item examination. A written examination in the selected response category with a series of statements that the candidate/trainee has to identify either as ‘true’ or ‘false’.

Validation. A process that a test/examination or a questionnaire is put through (i.e. piloted), before it is actually used, to ensure that it is fi t-for-purpose.

Validity. The degree to which an assessment tests what it purports to assess. There are several types of validity.

Face validity. The ability of the assessment to convince the stakeholders that the assessment is fair; i.e. how it is perceived by everyone involved.

Content validity. The degree to which the assessment has sampled the curriculum material and learning outcomes appropriately. Assessment blueprinting is a common procedure used to establish this form of validity.

Criterion validity. This has two parts – concurrent validity and predictive validity.Concurrent validity measures how well the assessment is comparable to or correlates with an assessment that is considered to be the gold standard in that sphere of assessment.Predictive validity is a measure of the accuracy with which the assessment result can foretell the future performance of the trainee.

Construct validity. The ability of an assessment to indirectly measure an innate, underlying attribute of the trainee; e.g. intelligence, clinical reasoning skills, problem solving skill, teamwork.

Viva voce. A traditional form of oral exam, where one or more examiners fi re random questions at the candidate in a face-to-face interview or discussion. Each candidate may receive a different exam with regard to the examiners, assessment content, assessment outcomes, item diffi culty and examiner leniency (Davis & Karunathilake, 2005)

Davis, M.H. & Karunathilake, I. (2005). The place of the oral examination in today’s assessment systems. Medical Teacher, 27(4), pp. 294-7.Gove, P.B. (ed) (1976). Webster’s third new international dictionary. G. & C. Merriam Company, Massachusetts.Jolly, B. & Grant, J. (eds) (1997). Defi nitions: appraisal and assessment. Part 2, Ch. 2, in: The good assessment guide. Joint Centre for Education in Medicine, London. Nitko, A.J. (1996). Distinctions among assessments, tests, measurements, and evaluations. Part 1, Ch. 1, in: Educational assessment of students. 2nd edn. Prentice-Hall, Inc., New Jersey. Robins, L.S., Braddock C.H.3rd, Fryer-Edwards, K.A. (2002). Using the American Board of Internal Medicine’s “Elements of Professionalism” for undergraduate ethics education. Academic Medicine, 77(6), pp. 523-31.


Date post:	05-Oct-2018
Category:	Documents
Upload:	doandan
View:	215 times
Download:	0 times

Joint Committee on Intercollegiate Examinations Examiners ... · Joint Committee on Intercollegiate...

Documents