8/9/2019 IRR Flight School Instructors
1/61
INTER-RATER RELIABILITY OF FLIGHT SCHOOL INSTRUCTORS:
A FOUNDATIONAL STUDY
By
Matthew Vail Smith
An Applied Project Presented in Partial Fulfillmentof the Requirements for the Degree
Master of Science in Technology
ARIZONA STATE UNIVERSITY
December 2007
8/9/2019 IRR Flight School Instructors
2/61
2007 Matthew Vail Smith
All Rights Reserved
8/9/2019 IRR Flight School Instructors
3/61
INTER-RATER RELIABILITY OF FLIGHT SCHOOL INSTRUCTORS:
A FOUNDATIONAL STUDY
by
Matthew Vail Smith
has been approved
December 2007
Graduate Supervisory Committee:
Mary Niemczyk, ChairWilliam McCurry
ACCEPTED BY THE GRADUATE COLLEGE
8/9/2019 IRR Flight School Instructors
4/61
iii
ACKNOWLEDGMENTS
I would like to acknowledge the help of several people:
Dr Joel Hutchinson, who helped me to overcome the mental blocks I struggled with.
Lisa Cahill, ASU Polytechnic Writing Center, for her constructive criticism and helpful
suggestions.
Professors Merrill Karp and Jim Anderson for introducing me to the PCATD and
explaining the possibilities it offered.
Greg and David, the Lab Assistants who taught me how to use the PCATD.
The student volunteers who flew the sample flights.
The four flight instructors who took time out of their busy schedules to watch three hours
of footage.
Committee member Dr. William McCurry for his guidance and suggestions.
And very special thanks to my committee chair, Dr. Mary Niemczyk, without whose
unwavering faith and support, I could never have accomplished this project and
graduated.
8/9/2019 IRR Flight School Instructors
5/61
iv
TABLE OF CONTENTS
Page
LIST OF TABLES............................................................................................................ vii
LIST OF FIGURES ......................................................................................................... viii
CHAPTER
1 INTRODUCTION .......................................................................................1
Statement of Purpose .......................................................................3
Scope................................................................................................3
Assumptions.....................................................................................4
Limitations .......................................................................................4
Equipment Used...............................................................................5
Chapter Summary ............................................................................5
2 LITERATURE REVIEW ............................................................................7
Background......................................................................................7
Cohens Kappa.................................................................................9
Inter-rater reliability in Sports .......................................................12
Inter-rater reliability in Psychology...............................................13
Inter-rater reliability in Health Care ..............................................14
Inter-rater reliability in Education .................................................17
Chapter Summary ..........................................................................19
3 METHOD ..................................................................................................20
Flight Pattern..................................................................................20
Pilot Participants ............................................................................20
8/9/2019 IRR Flight School Instructors
6/61
v
CHAPTER Page
Rater Participants...........................................................................21
Scoring Rubric ...............................................................................23
Flying the Pattern...........................................................................23
Experiment Execution....................................................................24
Chapter Summary ..........................................................................26
4 RESULTS ..................................................................................................27
Raw Scores.....................................................................................27
Contingency Tables .......................................................................28
Summary of Results.......................................................................32
5 CONCLUSIONS AND RECOMMENDATIONS ....................................34
Recurrent Training.........................................................................34
Scoring Rubric Improvements .......................................................35
Technical Improvements................................................................36
Recommendations for Further Research........................................37
Commercial Application of this Study ..........................................39
Summary........................................................................................40
REFERENCES ..................................................................................................................41
APPENDIX
A Scoring Rubric ...........................................................................................45
B Briefing and Script.....................................................................................47
C Instructor Instructions ................................................................................50
D Score Sheet.................................................................................................52
8/9/2019 IRR Flight School Instructors
7/61
vi
LIST OF TABLES
Table Page
1. Example of the inter-rater reliability Contingency Table
used in this Experiment........................................................................25
2. Raters Raw Scores ...................................................................................................27
3. Rater 1 versus Rater 2..............................................................................................29
4. Rater 1 versus Rater 3..............................................................................................29
5. Rater 1 versus Rater 4..............................................................................................30
6. Rater 2 versus Rater 3..............................................................................................30
7. Rater 2 versus Rater 4..............................................................................................31
8. Rater 3 versus Rater 4..............................................................................................31
9. Summary of Results.................................................................................................32
8/9/2019 IRR Flight School Instructors
8/61
vii
LIST OF FIGURES
Figure Page
1. Contingency Table Highlighting Agreement Cells..................................................10
2. Contingency Table with Chance-Corrected Agreement ..........................................11
3. Pattern D ..................................................................................................................22
8/9/2019 IRR Flight School Instructors
9/61
CHAPTER 1
INTRODUCTION
Several educational institutions exist to train students to become professional
pilots. As part of the regular curriculum, students must attend ground school and engage
in the required number of flight training hours. Ground school and written exams issued
by the Federal Aviation Administration (FAA) are standardized as well as the required
flight syllabi. However, training from school to school is not identical, even though fully
compliant with FAA regulations. Even in a flight school that has very exacting
standards, training may be different under different instructors for any number of reasons
such as the instructors abilities and interests. Some pilots dislike instructing and only do
it to build hours and to put experience on a resume. Others do it because they enjoy
sharing their love of flying with others. All instructors regardless of their personal
characteristics must do one thing: evaluate student performance. And yet, because of
their personal characteristics, instructors perceive student performances differently from
one another. The reasons for differences in instructor perception of student performance
can be systematic or arbitrary, conscious or subconscious, innocuous or malicious; one
simply cannot catalog anothers motives, but one can see the result of the instructors
perceptions: difference.
When scoring a student pilot, there is the student pilots performance, which is
objective, and the instructor pilots perception of that performance, which is subjective.
In the best of circumstances, the performance and the recorded perception of that
performance share a high degree of similarity. That is, the instructor ought always to
record a score that accurately and precisely reflects the students performance. However,
8/9/2019 IRR Flight School Instructors
10/61
2
this is not always the case. Some perceptions of performance are too forgiving, while
others are overly critical. In other words, the same student pilot can receive a passing
score from an overly forgiving instructor and a failing score from an overly critical
instructor for an identical or near-identical performance, leaving the student confused or
frustrated. There is a problem ensuring that all student pilots receive standardized scores
that reflect the student pilots performance with a high degree of reliability.
Students, as well as the other stake-holders of flight schools, must be sure that the
scoring system is such that the scores are a meaningful indicator of the students
performance rather than an arbitrary indicator of the instructors perception.
Furthermore, the scores should be consistent from one instructor to another. This
problem can be examined with an inter-rater reliability study. Inter-rater reliability is
used to assess the degree to which different raters/observers give consistent estimates of
the same phenomenon (Trochim, 2001, p.96). This investigation, then, seeks to offer
any flight school a method to determine the inter-rater reliability of its instructor pilots.
Chapter 1 introduces the problem and sets the parameters of this investigation.
Chapter 2, the literature review, examines pertinent inter-rater reliability literature dealing
with both the statistical theory and application of inter-rater reliability. The literature
reviewed in this study does not come from aviation sources because, after an exhaustive
search of reputable science journal databases, the researcher could not find aviation inter-
rater reliability studies. Instead, the literature reviewed comes from other fields such as
sports, psychology, health care and education, where inter-rater reliability studies are
used extensively. Many lessons learned from these fields may be applied to aviation,
especially in the sub-fields of aviation human factors and flight training/pilot education.
8/9/2019 IRR Flight School Instructors
11/61
3
Chapter 3 discusses the methodology used to plan, design and execute the project and to
analyze the data. Chapter 4 examines the results. Chapter 5 discusses two possible ways
to improve inter-rater reliability at the flight school, suggests technical improvements
while executing the project, offers a commercial application, suggestions for further
research and summarizes the project.
Statement of Purpose
The purpose of this investigation, then, is to determine the reliability of rating
student pilot performance between instructor pilots. In order to accomplish this task, this
investigation:
defines inter-rater reliability and discusses its application to pilot training;
reviews literature regarding inter-rater reliability;
describes the method (experiment) that was used to assess inter-rater
reliability;
analyzes the data as collected from the performed experiment using
Cohens kappa coefficient;
discusses the results;
makes recommendations for corrective action;
suggests a commercial application for this research; and
suggests areas for further research.
Scope
The scope of this investigation is a foundational study in which the rating
performances of a cross-section of instructors are analyzed to determine inter-rater
8/9/2019 IRR Flight School Instructors
12/61
4
reliability. Four instructor pilots were asked to watch the flight performances of ten
students flying the same instrument flight pattern as recorded on a DVD. The testing of
the raters took place throughout the course of a single afternoon in a controlled
environment, under the supervision of the researcher.
Assumptions
This investigation assumes that there may be a difference between the raters in
terms of their evaluation of student performance that is worth examining and that the
traditional methods for determining inter-rater reliability, such as the kappa coefficient,
are sound. Furthermore, it assumes that the principles of inter-rater reliability are
transferable from one field to another.
Limitations
This investigation has a few limitations. First, this study does notindeed,
cannotpresume to act as a predictive model. It measures what exists now, but cannot
definitively state that raters will evaluate in this way or that. This study does not consider
questions of gender, racial or other forms of favoritism or bias because bias is an error
that causes a rater to be unreliable. This study does not seek to answer why the raters are
reliable or not, but only to establish a repeatable method for determining inter-rater
reliability. Therefore, this study does not claim to be exhaustive. It is a foundational
study that seeks only to show that inter-rater reliability studies can be adapted from other
fields and made useful for aviation research, and it uses the instructors of the flight school
as test subjects.
It cannot be over-emphasized that this study investigates neither the student pilots
nor their performance. The student pilots and their performance are only means to the
8/9/2019 IRR Flight School Instructors
13/61
5
end of examining inter-rater reliability. Whether a student pilot is a good pilot or a poor
pilot is entirely moot. This study investigates how reliably the raters rate the flight
performances, not the flight performances or the students who flew them.
Finally, there were budgetary limitations. This study was funded entirely by the
researcher. Much of the equipment used, as listed below, belonged to the flight school.
However, the researcher paid for the video camera, accessories and the computer used to
transfer the footage to DVD from personal funds.
Equipment Used
The following equipment was used to complete this project:
Elite brand Personal Computer Aircraft Training Device (PCATD);
a computer projector and a movie screen;
a Sony DCR-HC28 video camera, used to record the flight instruments
(computer simulated instrument panel); the camera was equipped with a
fire wire output in order to transfer the recorded footage to the hard
drive of a computer;
an iMac personal computer with iMovie HD and iDVD, used to organize
the recorded footage and create DVDs for the raters (instructor pilots) to
view; and
a PC, projector and movie screen for showing the DVDs to the raters.
Chapter Summary
In order to ensure that students are scored fairly and consistently, flight schools
must consider the inter-rater reliability of their instructor pilots. This study describes the
8/9/2019 IRR Flight School Instructors
14/61
6
method for testing inter-rater reliability of flight school instructors that the researcher
developed and discusses the research on which this method is based.
8/9/2019 IRR Flight School Instructors
15/61
7
CHAPTER 2
LITERATURE REVIEW
This chapter is a review of literature related to inter-rater reliability. The chapter
begins by establishing the background of inter-rater reliability: explaining what inter-rater
reliability is and discussing a coefficient used to measure inter-rater reliability. The
coefficient discussed, kappa, is the one used to analyze the data in this study. The rest of
the chapter focuses on how, in the absence of inter-rater reliability studies in aviation,
inter-rater reliability studies have been used in other fields, such as sports, psychology,
health care and education.
Background
Inter-rater reliability measures the extent of agreement between two or more
individual raters. Inter-rater reliability is used to measure the consistency of a scoring or
rating system, and those who use it (DeVellis, 2005; Trochim, 2001). Since this study
seeks to establish the inter-rater reliability of instructor pilots, it is helpful to have some
background on inter-rater reliability and how it has been used.
In his 2005 entry into the Encyclopedia of Social Measurement, Robert F.
DeVellis managed to pack extensive information into a few short pages. DeVellis reports
that there are two influences at work in the process of measuring scores: (1) the true
score of the object, person, event, or other phenomenon being measured, and (2) error
(i.e. everything other than the true score of the phenomenon of interest) (p. 315). In
Chapter One, Introduction, true score was referred to as objective performance. Error can
be influenced by the instructors perception. Or, rather, the instructors perception is
susceptible to error, thus the disconnect between the true score (objective performance)
8/9/2019 IRR Flight School Instructors
16/61
8
and the recorded score (instructors perception). Error is simply a phenomenon to be
dealt with through statistical processes and analysis. This investigation seeks to measure
rater error. It does not study what errors are, why errors exist, or the moral implications
of error.
The purpose of the kappa statistic is to account for and eliminate agreement by
chancechance being a type of errorso that the researcher can get a clearer idea of
how much agreement there really is between raters. The coefficient, then, distinguishes
between purposeful agreement and accidental agreement. In a reliability formula, the
quantified possible error becomes the denominator, while the quantified true score is the
numerator. Thus, whatever reliability coefficient is used it is the ratio of variability
ascribable to the true score relative to the total variability of the obtained score
(DeVellis, 2005). Or, in the terms chosen for this investigation, it is the ratio of the
pilots objective performance and the instructors recorded perception of that
performance. In this study, it is assumed that any disconnect in the relationship between
the pilots performance (true score) and the instructors recorded perception (obtained
score) is due to the raters, not the pilot.
The way to find this coefficient, then, is to measure rater against rater rather than
pilot against rater. Each rater observed the exact same flight performances. Therefore,
the raters ought to record identical scores. In practice they may or may not. This is why
one performs an inter-rater reliability study, to discover these discrepancies between true
score and obtained score, should discrepancy (error) exist.
8/9/2019 IRR Flight School Instructors
17/61
9
Cohens Kappa
In the late 1950s and throughout the 1960s, Jacob Cohen conducted seminal
research focusing on inter-rater reliability. Cohen proposed a coefficient represented by
the Greek letter kappa (), as the standard coefficient for inter-rater reliability, with .70
being considered reliable. This is not merely a 70% agreement, because agreement can
happen by chance. Instead, kappa accommodates the expected frequency of ratings; thus
eliminating mere chance agreement (Cohen, 1960; Gwet, 2002b).
Cohens original article,A coefficient of agreement for nominal scales, which
appeared in theJournalof Educational & Psychological Measurement, explains the
kappa coefficient and raises three points that are foundational to inter-rater reliability:
1. The units are independent.2. The categories of the nominal scale are independent, mutually exclusive, and
exhaustive.
3. The judges operate independently. (Cohen, 37)
Dr. Kalim Gwets paper explaining Cohens kappa gave additional information
not presented in Cohens article, such as explaining how to use Cohens kappa step-by-
step. Gwets work gave much inspiration to this investigation and the methodology he
describes has been adapted for use in this project. What follows is a brief paraphrasing of
the methodology provided by Cohen, as explained by Gwet (2002b).
Two raters observe three species of turtles. They are told to identify the species to
which each turtle belongs (y, r or c). Thirty-six turtles are observed and the raters tally
their judgments in a three-by-three table. (Three, because y, r and c.)
If Rater 1 claims Y and Rater 2 claims R, then the tally goes in the box that
corresponds with Y/R: first column, second row. If both raters claim R, then the tally
8/9/2019 IRR Flight School Instructors
18/61
10
goes into the R/R box in the middle of the table: second column, second row. And so on.
The row and column tallies were the totaled in order to ensure that the correct number of
observations, 36, was recorded. The total number of agreements is calculated, by
summing the values of the diagonal cells of the table a= 9 + 8 + 6 = 23. (Gwet, 2002b)
Figure 1 shows Gwets contingency table. The cells showing agreement (Y/Y, R/R and
C/C) are shaded.
Figure 1 Contingency Table Highlighting Agreement Cells (Gwet, 2002b)
Rater 1
Y R CRow totals:
Y 9 3 1 13
R 4 8 2 14Rater 2
C 2 1 6 9
Column totals: 15 12 9 23
Out of the thirty-six turtles observed, the raters agreed on 23 decisions, thus
making the agreement level 64%. That is not good enough because some of the
agreements may have been mere chance agreements.
In order to account for chance agreement, one must compute the expected
frequency (ef) by dividing the product of the row and column totals by the number of
samples (N). Figure 2 shows that by dividing the products of the row and column totals
a = 9 + 8 + 6 = 23 becomes ef= 5.42 + 4.67 + 2.25 = 12.34. This is the expected
frequency by chance.
8/9/2019 IRR Flight School Instructors
19/61
11
Figure 2 Contingency Table with Chance-Corrected Agreement (Gwet, 2002b)
Rater 1
Y R CRow totals:
Y 9 (5.42) 3 1 13
R 4 8 (4.67) 2 14Rater 2
C 2 1 6 (2.25) 9
Column totals: 15 12 9 23 (12.34)
To find kappa, then, one divides the difference ofa minus efby the difference
ofN(number of samples) minus ef(sum of expected frequency). That is:
= (a - ef) / (N- ef) = (23 - 12.34) / (36 - 12.34) = .45
Kappa is evaluated next. As was stated above, a kappa of .70 or greater is
considered satisfactory; less than .70 is not. This example has a kappa of .45, denoting
rather weak inter-rater reliability.
In this case, Gwets recommendation was to retrain the raters to recognize the
species better. Specifically, the raters had trouble with two species in particular, thus
Gwet recommended raters to focus on correctly discriminating between these two types
of turtles, in order to improve inter-rater reliability (Gwet, 2002b).
Gwets explanation of Cohens kappa showed two raters with thirty-six samples
of three species. The current inter-rater reliability study has four raters judging ten flight
performance samples on a scale of 1 to 5. Chapter 3, Method, will discuss the application
of Cohens kappa to this project.
Remarkably enough, Gwets article explaining Cohens kappa coefficient was
later followed by a second article on why kappa is insufficient (Gwet, 2002a). However
interesting Gwets argument is regarding kappas insufficiency and his alternative
8/9/2019 IRR Flight School Instructors
20/61
12
coefficients merits, the researcher did not find Gwets alternative method in literature
other than his own, whereas the researcher found Cohens kappa coefficient used
extensively. Therefore, Gwets criticism of kappa is mentioned here only to make the
reader aware that there are other means (other coefficients) of determining inter-rater
reliability. This study uses Cohens kappa, since it is widely accepted, while Gwets new
coefficient is not.
Inter-rater reliability in Sports
Flying and sports are related activities in that they are both simultaneously
physical and mental, or psychomotor, to denote the inseparability between the physical
and mental aspects. Being physical acts, they can be measured. And being measurable,
they can be used in an inter-rater reliability study.
One such study,Development of an Instrument to Assess Jump-Shooting Form in
Basketball(Lindeman, Libkuman, King, & Kruse, 2000), examined the physical form
and movements of a jump shot. Basketball coaches have written books that discuss what
proper shooting form is, and the study used that information to create an instrument for
assessing jump-shots. Four raters then viewed video tapes of 32 shooters and rated the
shooters form and movement according to the instrument developed. The conclusion
was that the instrument may help discern a correlation between the shooters form and the
shooters success rate.
The jump shot study shows the validity of an inter-rater reliability study when
observing psychomotor activity. By analogy, then, an inter-rater reliability study is likely
valid when observing flight performances, because it, too, observes psychomotor activity.
8/9/2019 IRR Flight School Instructors
21/61
13
Inter-rater reliability in Psychology
Inter-rater reliability studies are often used in psychology to determine if scales
and other methods of measuring patient behavior are reliable means of assessment. These
studies have been used to assess rating scales and assessment methods related to sleep
disorders (Ferri, Bruni, Miano, Smerieri, Spruyt & Terzano, 2005), mental capacity
(Raymont, Buchanan, David, Hayward, Wessley & Hotopf, 2006), agoraphobia
(Schmidt, Salas, Bernert & Schatschneider, 2005), delusions (Bell, Halligan & Ellis,
2006 and Meyers, English, Gabriele, Peasley-Milkus, Heo, Flint, et al., 2006), social
dysfunction in schizophrenia and related illnesses (Monroe-Blum, Collins, McCleary, &
Nuttall, 1996), and other means of rating psychological disorders (Drake, Haddock,
Terrier, Bentall & Lewis, 2007).
Using inter-rater reliability studies to validate psychological testing is not limited
to the United States. It is used in China (Leung & Tsang, 2006), Korea (Joo, Joo, Hong,
Hwang, Maeng, Han, et al., 2004), Japan (Kaneda, Ohmoria & Fujii, 2001), in the Arabic
language (Kadri, Agoub, El Gnaoui, Mchichi Alami, Hergueta & Moussaoui, 2005),
Turkey (Tural, Fidaner, Alkin & Bandelow, 2002), Greece (Papavasiliou, Rapidi, Rizou,
Petrapoulou & Tzavara, 2007 and Kolaitas, Korpa, Kolvin & Tsiantis, 2003), and France
(Thuile, Even, Friedman & Guelfi, 2005). In all of these articles, scales or other methods
of assessment were tested, or foreign language translations of English language scales
and methods of assessment were tested and validated using inter-rater reliability studies.
It seems, then, that inter-rater reliability studies serve a very useful purpose in
determining the validity of scoring or rating rubrics. Thus, one can surmise that an inter-
8/9/2019 IRR Flight School Instructors
22/61
14
rater reliability study may be very useful to a flight school that needs to measure the
reliability of its raters and scoring rubrics.
Inter-rater reliability in Health Care
Training health care practitioners also has parallels to training pilots. Both health
care practice and the practice of flying require both mental aptitude and the physical
skills to carry out their mentally-driven tasks. This fact is true for the entire gamut of
health care practitioners from nurses to surgeons and the gamut of pilots from the simple
sport (ultra-light) pilot to captains of 747s. All of the individuals in these vast and
diverse groups require a level of mental and physical harmony that demands high-level
training. This training regimen is ready-made for inter-rater reliability studies.
Research regarding nursing in triage units verified that live experiments may be
more reliable than paper-based exercises. Triage tool inter-rater reliability: a
comparison of live versus paper case scenarios (Worster, Sardo, Fernandes, Eva, &
Upadhy, 2007) shows that the kappa was acceptable in both live and paper cases,
however, the correlation in live cases was much higher (.90 live, versus .76 on paper).
Therefore it seems that it is better to test inter-rater reliability of instructor pilots with a
live flight scenario rather than a paper-based scenario.
Paper-based scenarios would have been easy enough to create for the
instructor/raters being investigated, but as this triage nursing study makes clear, live is
more desirable because it is more reliable. The researcher did not conduct this present
investigation live due to physical constraints of aircraft and budgetary constraints.
Instead, the performances that the raters observed were captured on video for viewing at
another place and time, which is consistent with other studies reviewed in this chapter.
8/9/2019 IRR Flight School Instructors
23/61
15
Bann, Davis, Moorthy, Munz, Hernandez, Khan, Datta, and Darzi (2005) studied
11 surgical trainees and put them through a 15 minute, six-station rotation of basic
surgical tasks. Each trainee performed the six-station rotation on five separate occasions
for a total of 90 minutes of observation. All of the trainees performances were video
recorded for later review. The six tasks each had criteria determining what makes a
trainee competent or not at that task. For example, in the suturing task, trainees were
rated on the time taken and total number of movements used to complete the task
(Bann, et al., 2005). The trainees were further rated on the quality of the suture, based on
the squareness and orientation of the knots. The authors emphasized that their measuring
instrument was able to discern both quantity and quality of work.
The researchers used the Spearman correlation coefficient (rho) in their statistical
analysis, which is used to examine correlations between sittings. (Bann, Davis,
Moorthy, Munz, Hernandez, Khan, Datta, & Darzi, 2005). (Since neither the pilots nor
the raters sit for their part of the study more than once, there will not be any improvement
to measure. Therefore, rho is not necessary to this study.) On the other hand, the
researchers used Cronbachs alpha coefficient to test a number of internal consistency;
these included the inter-rater reliability of video assessment and intra-task reliability.
(Bann, et al., 2005). The result of this experiment was that video assessment is indeed a
reliable means of assessing performance. Yet another study concluded that inter-rater
reliability of video taped cases was excellent, having a coefficient of .93. (Hulsman,
Mollema, Oort, Hoos & de Haes., 2006)
In a rare example, James D. Michelson, MD drew a direct parallel between
medicine and aviation. Moreover, Michelson specifically cites the usefulness and
8/9/2019 IRR Flight School Instructors
24/61
16
ubiquity of simulator training in aviation, and suggests that more and better simulators be
developed in the training of orthopedic surgeons. Michelson cites other studies that
suggest good, but not perfect, correlation (Michelson, 2006) and later suggests that
simulator-based competency standards be developed and will likely come built-in to the
software packages of off-the-shelf simulators in the future. One benefit of using
simulators is that they are asynchronous. That is, a resident doctor need not have a
supervisor present during training if using a simulator. Furthermore, the data collected
during the simulation can be reviewed by more than one supervisor or rater
independently, meeting Cohens third requirement that raters perform their duties
independently (Cohen, 1960).
Inter-rater reliability studies are not used solely in the training of health care
professionals, but also to verify the rubrics for various cases such as rating the
effectiveness of out-of-hospital CPR (Rittenberger, Martin, Kelly, Roth, Hostler, &
Callaway, 2006) and for rating the severity of rosacea (Bamford, Gessert, & Renier,
2004). The authors of the rosacea article admitted that when the scale ranged from 1 to
10, the inter-rater reliability coefficient indicated unreliable rating. But when the scale
was reduced to a range from 1 to 5, the inter-rater reliability coefficient was much
greater, indicating reliability.
Inter-rater reliability is also used a great deal in physiotherapy. Holey and
Watson (1995) provided a stark example of the necessity for kappa rather than using
mere percentage of agreement when performing an inter-rater reliability study. In some
cases the percentage of agreement was 100%, while the kappa coefficient, which
accounts for chance agreement, was 0.01, the absolute lowest number possible.
8/9/2019 IRR Flight School Instructors
25/61
17
Kappa has also been found useful in determining inter-rater reliability in other
studies. A study conducted by Kolt, Brewer, Pizzari, Schoo, & Garrett (2007) combined
two inter-rater reliability studies, one in which six physiotherapists and physiotherapy
students examined videotaped cases, the other compared two live clinical sessions. The
results were that the inter-rater reliability of the first study was very high (= .87 to .93)
and the second study reliability varied from very good to good (= .76 to .89 and .63 to
.76). Dionne, Bybee, & Tomaka (2006) used kappa to establish moderate reliability (=
.55) in a study using 20 patients and 54 trained clinicians. Fifty-four raters is the greatest
number of raters seen in the entire literature review.
Inter-rater reliability in Education
Laura D. and William L. Goodwin wroteAn analysis of statistical techniques
used in the Journal of Educational Psychology, 1979-1983 (1985) in order to discern the
most popularly used statistical methods in educational psychology. The Journal of
Educational Psychology (JEP) is a long-established, peer-reviewed journal. Therefore, it
is understood that the statistical methods used by its contributors are useful and
appropriate for anyone doing research in a field related to educational psychology,
including this investigation.
From 1979-1983, 40 out of 92 reliability studies in the JEP were inter-rater
reliability studies. Inter-rater reliability studies comprised nearly half of the studiesby
far the greatest percentage. Considering how commonly researchers use inter-rater
reliability studies to establish or verify reliability in an educational setting, the Goodwins
article indicates that performing an inter-rater reliability study at flight schools, which are
rightly considered educational institutions, is a legitimate pursuit.
8/9/2019 IRR Flight School Instructors
26/61
18
A common use of inter-rater reliability studies in education assesses writing. The
question of what constitutes good or bad writing cannot be answered with an inter-rater
reliability study. Instead, much like the rubrics used to rate medical observations or the
jump-shot as discussed previously, the rubrics for scoring essays must be created first by
an expert or group of collaborating experts who know what good writing is. Qualitative
characteristics must be sorted and presented in such a way that raters can quantify their
observations and opinions of the writing samples. Lee (2004) noticed that, given a
holistic scoring rubric, raters scored computer-based writing samples provided by English
as Second Language (ESL) students far more reliably than when using paper-basedthat
is to say, handwrittenwriting samples. The holistic rubric included several criteria that
accounted not only for the quality of content, but also quality of expression, as
determined by the writing experts. Lee suggests that the raters may need to learn how not
to discriminate against messy handwriting, and that correcting that bias may help to make
the scores that the raters awarded more reliable.
Penny, Johnson and Gordon (2000) introduced the idea of augmenting a holistic
rubric with benchmark writing samples. Writing, like many other human activities, is
performed on a continuum. That is, one cannot easily discern discreet moments, but
rather observe ability over the passage of time. Assigning an integer to rate a
performancethat is, shifting from a qualitative to quantitative measuring system
requires a snapshot, or a discreet variable. In many cases, this means assigning a rating
from 1 to 5. Inter-rater reliability studies show whether the quality of writing (or what
ever act is being rated) is being accurately translated into a quantity, which can then be
measured. Introducing benchmark papers helped those charged with assessing writing
8/9/2019 IRR Flight School Instructors
27/61
19
samples to more accurately rate the quality of writing because each integer had an
exemplar to which the raters could refer. Thus, the inter-rater reliability was increased,
and may also have led to a greater external validity.
Chapter Summary
Inter-rater reliability literature is plentiful and offers researchers several methods
and many examples of how to design and execute inter-rater reliability studies. The
articles featured in this study were chosen because the fields of study all involved training
and featured psycho-motor skills that are analogous with and transferable to evaluating
pilot training.
8/9/2019 IRR Flight School Instructors
28/61
20
CHAPTER 3
METHOD
This investigation was designed to assess inter-rater reliability between instructor
pilots when observing flights performed by student pilots. This study included
videotaping the performance of student pilots flying an industry standard instrument
flight rules (IFR) pattern. The researcher transferred the footage to a DVD. Four
instructor pilots reviewed DVDs of the flight performance footage and scored the student
pilots performances on a scale of 1 to 5. The researcher then analyzed the scores using
Cohens kappa coefficient. The resulting coefficients are discussed in Chapter Four,
Results.
Flight Pattern
In The Pilots Manual: Instrument Flying(Kirshner, 1990) there are several flight
patterns to choose from. The pattern used for this investigation is referred to as Pattern
D. It was chosen because it is long enough to give the raters something substantial to
score, yet not so time-consuming as to prove burdensome. An illustration of the pattern
appears in Figure 3.
Pilot Participants
Student pilots enrolled in a flight program at a four-year research university
participated by flying the aforementioned flight pattern using a PCATD. The researcher
explained to the students that they were being videotaped for the purpose of investigating
inter-rater reliability. They were assured that these scores, good or bad, would not figure
into their course average. Their identities were protected by preventing any
distinguishing features from being recorded on video. Also, the order in which the flight
8/9/2019 IRR Flight School Instructors
29/61
21
performances were viewed was different from the order they were recorded. Thus, the
student who flew the first flight on the day of recording might have actually have been
the last flight viewed by the raters. The researcher did not collect or record any
demographic data about the student pilot participants in order to abide by the limitations
as discussed in Chapter One, Introduction.
Rater Participants
The rater-participants were selected from the pool of instructor pilots at the flight
school. All instructor pilots were offered a chance to participate and the researcher
enlisted the help of four volunteers. These instructor pilots watched and scored the
flights that are contained on the DVDs. They are the raters, whose reliability this study
investigates. Just as the student pilots who flew the pattern were assured that their
participation would not affect their scores in school, the raters were assured of their
anonymity and that their performance in this study would not impact their employment at
the flight school. Also just as with the student pilot participants, the researcher did not
collect or record any demographic data about the rater participants in order to abide by
the limitations as discussed in Chapter One, Introduction.
8/9/2019 IRR Flight School Instructors
30/61
22
Figure 3 Pattern D (Kirshner, 1990)
8/9/2019 IRR Flight School Instructors
31/61
23
Scoring Rubric
In order to measure inter-rater reliability, there must be an established scoring
rubric. The flight school at which this study was performed already has a scoring
rubrican explanation of how scores are determinedwhich was used in this
investigation. The reader is referred to the scoring rubric in Appendix A, which explains
what the scores represent.
As stated in Chapter Two, Literature Review, there is a difference between quality
and quantity, yet in studies such as this and those in the social sciences, medical science,
and education, researchers must change qualitative performance into quantitative data in
order to perform statistical analysis. One cannot average words or put words into a
formula. Thus, words (qualities) must be transformed into numbers (quantities). There is
no analyzing poor, good, or great, but one can analyze scores of 1, 3, or 5. This is
precisely the reason for this inter-rater reliability study: to determine if the student pilot
flight performance is being accurately transformed into a quantitative score according to
the scoring rubric.
Flying the Pattern
Prior to sitting at the PCATD, the researcher briefed the student pilots. The
pattern is rather complex, and depending on the skill of the student pilot, the researcher
gave oral instructions, if necessary. As stated in Chapter One, Introduction, this study is
not investigating the student pilots. Therefore, the student pilots ability to perform the
flight pattern well or poorly is immaterial. What this study investigates is whether the
raters agree about the student pilots performance. Therefore helping a lesser skilled
student pilot complete the pattern does not affect the inter-rater reliability. The raters
8/9/2019 IRR Flight School Instructors
32/61
24
were entirely unaware of which student referred to the pattern and which students
performed the pattern from memory. After the flight was finished, there was a
debriefing. See the script in Appendix B.
Experiment Execution
After the flight patterns were recorded, it was time to test the raters. The raters
viewed the DVDs in a controlled environment so as not to influence or be influenced by
other raters, just as specified by Cohen (1960). Then raters were asked to score the
student pilots performances according to the scoring rubric. After the raters scored the
student pilots performances, the researcher analyzed the data.
Cohens coefficient kappa is derived using only two raters. Several studies cited
in Chapter Two, Literature Review, used only two raters, some four, some more. After a
very thorough search, the researcher could not find any research that suggests an optimal
number of raters for inter-rater reliability studies. In this study, there are four raters
because the researcher looked for an even number of raters, as most of the other studies
had, and four instructor pilots made themselves available for testing purposes.
The researcher used six contingency tables similar to the tables described in
Chapter Two, Literature Review, but adapted the table to provide the resultant
information in conformity to the APA style manual. (Gwets contingency tables do not
conform to the APA manual.) The example table (Table 1) shows hypothetical Rater X
versus hypothetical Rater Y. The numbers 1 through 5 indicate the scores which raters
can give to student pilots. A score of 1 represents an unsatisfactory performance; a 2,
marginal; a 3, good; a 4, very good; and a 5, excellent, as described in Appendix A. Pairs
of scores from the ten flights (A through J) were tallied in the table according to the rules
8/9/2019 IRR Flight School Instructors
33/61
25
as described by Gwet (2002b) in Chapter Two, Literature Review. That is, if rater 1
gives a score of 3 and rater 2 gives a score of 3 then one point will be tallied in the
cell (3, 3). (In Table 1 below, the numbers 0 denote nothing, as this is only an
example.Nis 10 because the number of samples is already known.)
Table 1
Example of the inter-rater reliability Contingency Table used in this Experiment
Rater XScore 1 2 3 4 5
RowTotals:
a ef
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
Rater Y
5 0 0 0 0 0 0 0 0
Column Totals: 0 0 0 0 0 N a ef
10 0 0
The tables will account for each possible permutation without replicating pairs.
After the result of each table is tallied according to Cohens kappa method, the resultant
coefficients will then be analyzed to determine the inter-rater reliability of the instructor
pilots in comparison with each other. Each column and row should add up to 10, which
is theN, the only constant in the equation. Column a is the number of agreements. This
number is simply the cells showing agreement (e.g. 1, 1; 2, 2, etc.) brought over to a
single column. Column efis the expected frequency. (The method to derive the efwas
discussed earlier.) At the bottom of column a and column efis the sum ofa (a) and the
8/9/2019 IRR Flight School Instructors
34/61
26
sum ofef(ef). In the next chapter, these tables will have beneath them the kappa
equation worked out, resulting in the kappa coefficient.
Chapter Summary
In summary, the methodology is as follows. The researcher enlisted student pilots
as volunteers to fly Instrument Pattern D using the Elite PCATD. A video camera
recorded the image of the simulated instrument panel on the movie screen during the
flights. After recording the student pilots flights, the researcher transferred the footage
onto DVDs for easier viewing. Each flight was assigned a letter, A through J. The
researcher then enlisted the help of four instructor pilots to be the rater participants. The
instructor pilots watched and scored the flights in a controlled environment. Upon
finishing their task, the researcher collected their score sheets and placed the scores into
the contingency tables. The researcher then took the pertinent numbers from the table
(those that indicate agreement) and put them into the kappa formula.
If the coefficient, kappa, is .70 or greater, the rater pairs can be said to exhibit
greater reliability; if less than .70, then the rater pairs may be said to exhibit lesser
reliability. The next chapter will discuss the results of this experiment.
8/9/2019 IRR Flight School Instructors
35/61
27
CHAPTER 4
RESULTS
The experiment was conducted in a classroom equipped with a PC, projector and
movie screen. The four raters sat in the same room, but were seated far apart to prevent
communication between raters. They were given instructions and a score sheet
(Appendix C and D, respectively) and were briefed by the researcher about how to
behave during the test (i.e. no talking, gesturing, or using other means of communicating
during flights, no talking about the flights during break times, etc.). It took three hours to
watch all of the flights, including two short restroom breaks and one longer break time
during which the researcher switched from the first to the second DVD.
Raw Scores
The raters watched the flights and marked the scores on the score sheet that was
provided. The researcher collected the score sheets and the raw scores are in Table 2
below.
Table 2
Raters Raw Scores
Sample Flight
Rater A B C D E F G H I J
1 4 5 2 1 4 3 2 3 1 52 4 5 1 1 4 4 2 4 1 3
3 3 3 1 1 3 4 1 4 1 2
4 3 5 1 1 3 3 2 4 1 4
8/9/2019 IRR Flight School Instructors
36/61
28
At first glance, these scores appear to show good agreement, especially in sample
flights C, D, G, H and I. A brief examination of the raw scores also reveals that Rater 1
evenly distributed the scores; the only rater to do so. Raters 2 and 4 had very similar
results, with only disagreement being between a score of 3 and 4. Rater 3 gave the most
scores of 1, and gave no scores of 5. However, to properly analyze the data for inter-rater
reliability, these raw scores must be tallied in the contingency tables.
Contingency Tables
To analyze the data, the researcher created a series of contingency tables as
illustrated on page 25. Tables 3 through 8 below are the contingency tables that were
used to sort and analyze the data. These tables were adapted from Gwet (2002b) in order
to conform to APA standards and to show data without the redundancy of tables as in
Gwet (2002b). Beneath each contingency table is the mathematical work used to derive
the kappa coefficient.
8/9/2019 IRR Flight School Instructors
37/61
29
Table 3
Rater 1 versus Rater 2
Rater 1
Score 1 2 3 4 5Row
Totals:a ef
1 2 1 0 0 0 3 2 .6
2 0 1 0 0 0 1 1 .2
3 0 0 0 0 1 1 0 .2
4 0 0 2 2 0 4 2 .8
Rater 2
5 0 0 0 0 1 1 1 .2
Column Totals: 2 2 2 2 2 N a ef
10 6 2
Given:N= 10, a = 6, ef= 2
= (a - ef) (N- ef) = (6 2) (10 2) = 4 8 = .50
Table 4
Rater 1 versus Rater 3
Rater 1
Score 1 2 3 4 5Row
Totals:a ef
1 2 2 0 0 0 4 2 .8
2 0 0 0 0 1 1 0 .2
3 0 0 0 2 1 3 0 .6
4 0 0 2 0 0 2 0 .4
Rater 3
5 0 0 0 0 0 0 0 0
Column Totals: 2 2 2 2 2 N a ef
10 2 2
Given:N= 10, a = 2, ef= 2
= (a - ef) (N- ef) = (2 2) (10 2) = 0 8 = 0
8/9/2019 IRR Flight School Instructors
38/61
30
Table 5
Rater 1 versus Rater 4
Rater 1
Score 1 2 3 4 5Row
Totals:a ef
1 2 1 0 0 0 3 2 .6
2 0 1 0 0 0 1 1 .2
3 0 0 0 0 1 1 0 .2
4 0 0 2 2 0 4 2 .8
Rater 4
5 0 0 0 0 1 1 1 .2
Column Totals: 2 2 2 2 2 N a ef
10 6 2
Given:N= 10, a = 6, ef= 2
= (a - ef) (N- ef) = (6 2) (10 2) = 4 8 = .50
Table 6
Rater 2 versus Rater 3
Rater 2
Score 1 2 3 4 5Row
Totals:a ef
1 3 1 0 0 0 4 3 1.2
2 0 0 1 0 0 1 0 .1
3 0 0 0 2 1 3 0 .1
4 0 0 0 2 0 2 2 .8
Rater 3
5 0 0 0 0 0 0 0 0Column Totals: 3 1 1 4 1 N a ef
10 5 2.2
Given:N= 10, a = 5, ef= 2.2
= (a - ef) (N- ef) = (5 2.2) (10 2.2) = 2.8 7.8 = .38
8/9/2019 IRR Flight School Instructors
39/61
31
Table 7
Rater 2 versus Rater 4
Rater 2
Score 1 2 3 4 5Row
Totals:a ef
1 3 0 0 0 0 3 3 .9
2 0 1 0 0 0 1 1 .2
3 0 0 0 3 0 3 0 .3
4 0 0 1 1 0 2 1 .8
Rater 4
5 0 0 0 0 1 1 1 .2
Column Totals: 3 1 1 4 1 N a ef
10 6 2.4
Given:N= 10, a = 6, ef= 2.4
= (a - ef) (N- ef) = (6 2.4) (10 2.4) = 3.6 7.6 = .47
Table 8
Rater 3 versus Rater 4
Rater 3
Score 1 2 3 4 5Row
Totals:a ef
1 3 0 0 0 0 3 3 1.2
2 1 0 0 0 0 1 0 .1
3 0 0 2 1 0 3 2 .9
4 0 1 0 1 0 2 1 .6
Rater 4
5 0 0 1 0 0 1 0 .1
Column Totals: 4 1 3 3 1 N a ef
10 6 2.9
Given:N= 10, a = 6, ef= 2.9
= (a - ef) (N- ef) = (6 2.9) (10 2.9) = 3.1 7.1 = .44
8/9/2019 IRR Flight School Instructors
40/61
32
Summary of Results
The scores have been tallied and the kappa for each rater pair calculated. As
stated previously throughout this study, the minimum desirable kappa coefficient is .70.
The results in this study were markedly lower.
Table 9
Summary of Results
Rater Pair Kappa
Rater 1 vs. Rater 2 .50
Rater 1 vs. Rater 3 .00
Rater 1 vs. Rater 4 .50
Rater 2 vs. Rater 3 .38
Rater 2 vs. Rater 4 .47
Rater 3 vs. Rater 4 .44
Average .38
The best kappa was .50, and the worst, 0. The average kappa coefficient was .38just
over half of the desired .70.
Although all of the rater pairings in this study fell far below .70, one rater, Rater
3, seemed the least reliable of the four. The three pairings in which Rater 3 was involved
were the least reliable, one of which had a kappa of 0, entirely unreliable. Rater 1, with
whom Rater 3 shared the kappa of 0, enjoyed the two highest reliability scores, .50, with
Raters 2 and 4.
Each rater was paired three times. When each raters three pairings were
averaged, Rater 1 scored a .33, Rater 2, .45, Rater, 3 .27, and Rater 4, .37. However,
8/9/2019 IRR Flight School Instructors
41/61
33
removing Rater 3 from the averages, so that each rater was only paired twice, Rater 1s
average rose to .50, Rater 2 to .48 and Rater 4 to .48. Among Raters 1, 2 and 4, the
scores are extremely similar (pair 1 & 2 .50, pair 1 & 4 .50 and pair 2 & 4 .47). Thus it
seems that removing Rater 3 improved the inter-rater reliability in this study. Without
Rater 3 the overall average reliability increased from .38 to .49. This is still well below
.70, but much better.
The next chapter will discuss two methods to improve inter-rater reliability at the
flight school and recommendations for improving the execution of the study and further
research. The next chapter also includes a commercial application of this study.
8/9/2019 IRR Flight School Instructors
42/61
34
CHAPTER 5
DISCUSSION
The resultant coefficients are such that the study did not yield good inter-rater
reliability. There must be some way to improve inter-rater reliability at the flight school.
Two suggestions are to engage in extensive recurrent training and to improve the scoring
rubric. There are also some ways to improve the technical aspects of the study and to do
further research. Finally, the researcher proposes a commercial application for this inter-
rater reliability study.
Recurrent Training
The previous chapter described the raw scores and the resultant kappa coefficients
for the four raters. These scores show low inter-rater reliability which may indicate the
need for recurrent training, which may help the flight school reinforce the scoring
criteria. In the case of Rater 3, more training would be required than for Raters 1, 2 and
4. In sample C, while Raters 1, 2 and 4 agreed upon a score of 5, Rater 3 awarded a score
of 3. In sample G where all others gave a score of 2, Rater 3 gave a 1. And in Sample J,
where there was no agreement among any raters, Rater 3 gave the low score of 2. After
examining the raw scores, it is evident that the most common disagreement was between
the scores 3 and 4. It may be that Raters 1, 2 and 4 need to review the standards to help
them differentiate between performances that rate a 3 rather than a 4, while Rater 3 needs
a greater amount of training to align that raters expectations of student performance with
flight school standards.
It may also be helpful to start training instructor pilots how to interpret the
standards used to score student pilot performance first using simple maneuvers and
8/9/2019 IRR Flight School Instructors
43/61
35
working their way up to complex patterns, just as the students themselves must work
their way up from simple maneuvers to complex patterns. This recurrent training may be
of little use unless the standards are better defined through an improved scoring rubric.
Scoring Rubric Improvements
It could be that the scoring rubric needs improving. Referring again to Appendix
A, there is a disconnect between the description of the quality of performance and
quantifiable data. For example, An Excellent (5) grade will be issued when a students
performance far exceeds and is well above the completion standards. Unfortunately,
there is little to define exactly what makes a performance far exceed or well above the
completion standards. The same can be said for scores 4, 3, 2, and 1. There definitions
of the scores are too broad.
The scoring sheet (Appendix D) offered the rater the completion standards from
the lesson in which Pattern D is taught. The altitude standard asks only that a student
pilot remain within plus or minus 200 feet of the starting altitude. This standard is very
broadly defined and leaves too much open to interpretation by individual instructor pilots
and hence affects inter-rater reliability. An example of how to fine tune the altitude
standards could include the following scores:
a score of 5 should require the student remain within plus or minus 50 feet;
a 4, plus or minus 100 feet;
a 3 plus or minus 150 feet;
a 2, plus or minus 200; and
8/9/2019 IRR Flight School Instructors
44/61
36
a 1 indicates that the student violated the 200 foot limit in either direction, and
therefore is unsatisfactory.
The other standards, heading, bank angle and airspeed, could also be redefined to
more precisely indicate how skilled the student is, rather than leaving a broad range that
is susceptible to loose interpretation. Perhaps by fine-tuning the standards and requiring
the instructor pilots to be retrained in these newer, more precisely defined, standards
would help to improve inter-rater reliability. Fine-tuning these standards may require
further research.
Technical Improvements
Although the researcher is confident in the methodology, there can be
improvements made to how the experiment is executed on a technical level. This project
was the researchers first attempt to record video footage from a PCATD and then
transfer that footage to DVD. While the footage was usable, the quality could be
improved by recording the footage directly from the PCATD rather than through another
media. The footage had to travel through a few steps of media: from the PCATD to the
projector, to the screen, to the video camera, to the iMac, to the iMovie HD application,
to the iDVD application, to actual DVDs. The transfer from camera to the digital movie
applications iMovie HD and iDVD are not problematic because there is no noticeable
degradation of footage from one digital source to another. Thus, removing the projector,
movie screen, and video camera from the middle, would likely produce higher quality
images, making the footage easier to watch clearly. Since the raters all watched the same
footage, the footage quality does not affect the inter-rater reliability. It would only affect
8/9/2019 IRR Flight School Instructors
45/61
37
inter-rater reliability if some raters watched one set of footage, and other raters watched
an improved version of the footage.
In summary, the technical execution of the project could be improved simply by
learning how to use all of the features of the iMovie HD and iDVD applications to their
fullest extent. There are other high-end software applications for video editing such as
Final Cut that should also be considered provided the future researcher has the budget for
to make these technological upgrades.
Recommendations for Further Research
With the technical improvement recommendations out of the way, this is an
opportunity to discuss the future for which this project is the foundation. As stated in
Chapter One, Introduction, this project was a foundational study, meant to lay the
groundwork and establish a method to study inter-rater reliability at flight schools that
can be used at any flight school that has the resources to carry out the experiment.
The first recommendation is to expand the number of samples, the number of
raters, or both. This researcher would also encourage a future researcher to test other
means of measuring inter-rater reliability. Chapter Two, Literature Review, cited studies
which used alpha and rho. In the interest of finding the best analytical method, alpha,
rho, and other coefficients should be tested along with the increase in samples and raters
until an agreed upon method is derived.
The second recommendation is to choose different patterns. One suggestion is to
begin testing particular maneuvers such as shallow, medium and steep turns, ascending
and descending turns, or constant airspeed climbs. These are just examples, and a future
researcher could experiment with particular maneuvers rather than entire patterns. At the
8/9/2019 IRR Flight School Instructors
46/61
38
same time, one could also consider choosing from a catalog of other instrument patterns,
more or less challenging than Pattern D.
Recommendations one and two do not cast doubt on the methodology of this
study. Adding more raters might lead to more agreement, but it might also lead to more
disagreement. Likewise, adding more sample flights may or may not cause lesser or
greater reliability. What must be avoided at all costs is designing a study that is
structured to create agreement. Testing particular maneuvers rather than patterns is not
necessarily better because doing maneuvers is just one part of flight training and the goal
of flight training is not to make a pilot proficient at doing maneuvers, but to make a pilot
have such a depth of understanding and technical ability that he or she can take the
maneuvers learned through the years of training and spontaneously serialize or combine
discreet maneuvers into an organic flight that has unity from take off to landing. So
testing only maneuvers versus testing patterns or testing spontaneous flights is not
necessarily better. However, more samples, raters, other patterns, and other statistical
methods all deserve to be tested for the sake of expanding our body of knowledge and for
perfecting a method that one day could become tried and true. In short, researchers
must trust in the scientific method to continually develop better means of testing and
never rest contented with existing research.
Upon doing further research, fine-tuning the standards and processing instructors
through updated training, one may find that the method can be adapted for commercial
use.
8/9/2019 IRR Flight School Instructors
47/61
39
Commercial Application of this Study
Upon testing and re-testing this experiment such that the results can be replicated
and are consistent, and the method deemed valid by a panel of experts in related fields,
this study can be developed into an instructor training program that may be created for
the commercial market and sold to flight schools.
Following some of the recommendations above, perhaps the instructor training
program could begin by evaluating maneuvers and testing reliability. Upon reaching a
kappa of .70 or greater, the instructor trainee can move on to the next phase learning how
to evaluate simple patterns, and then moving onto learning how to evaluate complex
patterns, and finally how to reliably rate IFR check rides. The training need not happen
only using a PCATD. The method and training system must be such that as the training
progresses, the footage from the PCATD is replaced by footage from a full simulator, and
the full simulator eventually replaced by footage from an actual aircraft, because the
instructors and their students will experience training in all three media.
Future researchers who wish to apply this project to a commercial application
must establish baseline flights, just as Penny, et al. (2000) establish benchmark essays for
scoring writing samples. For example, a future researcher may find that a particular
flight has been viewed by raters and they have consensus that the flight is a 3. A
researcher for a commercial developer or flight school must build up a catalog of baseline
flights that have all been tested and create a test in which the established baseline scores
are entered into the contingency table as Rater 1, while the rater currently being tested
becomes Rater 2. Thus, the future researcher or tester will place the New Rater versus
the baseline scores. A kappa of .70 or greater shows that the New Rater can score flights
8/9/2019 IRR Flight School Instructors
48/61
40
reliably, while a kappa less than .70 will indicate that the New Rater needs further
instruction before being allowed to rate actual flights. The result may be that flight
schools can effective and economically screen potential flight instructors or maintain
standards with current instructors.
Summary
The search for valid, reliable, feasible, and fair assessments of cognitive and
human performance is, in many ways, at the very heart of educational measurement
(Penny, et al, 2000). In a very real way, instructor pilots are educators, and their
evaluations of student performance are educational measurements. The researcher sought
to find research in scientific and educational journals that would help to lay the
foundation of inter-rater reliability studies in flight training. To that end, four flight
school instructors (raters) were tested according to the methodology inspired by the
literature reviewed and statistical analysis based upon Cohens Kappa coefficient. This
coefficient is commonly used in inter-rater reliability studies in several fields from
education, social science, psychology, medicine and even sports. It is used quite often in
training situations. In this study, kappa was applied to flight training, specifically testing
instructor pilots for inter-rater reliability. Ultimately, the study indicated that the inter-
rater reliability was low; having an average kappa of .38, well below the desired .70.
Nevertheless, this study was successful in that it showed a usable method for testing
inter-rater reliability in flight training and provides the basis for further research and
commercial development.
8/9/2019 IRR Flight School Instructors
49/61
41
REFERENCES
Bamford, J.T.M., Gessert, C.E., & Renier, C.M. (2004) Measurement of the severity of
rosacea. [Electronic Version]. Journal of the American Academy Dermatology,51(5), 697-703.
Bann, S., Davis, I.M., Moorthy, K., Munz, Y., Hernandez, J., Khan, M., Datta, V., &
Darzi, A. (2005). The Reliability of multiple objective measures of surgery andthe role of human performance. [Electronic version]. The American Journal ofSurgery, 189, 747-752.
Bell, V., Halligan P.W., & Ellis, H.D. (2006). Diagnosing Delusions: A review of inter-
rater reliability. [Electronic version]. Schizophrenia Research, 86, 76-79.
Cohen, J. (1960). A coefficient of agreement for nominal scales. EducationalPsychological Measurement, 20(1), 37-46.
DeVellis, R.F. (2005). Inter-Rater Reliability. [Electronic version]. In Encyclopedia ofSocial Measurement(Vol. 2, pp.317-322). New York: Elsevier Inc.,
Dionne, C.P., Bybee, R.F., & Tomaka, J. (2006). Inter-rater reliability of McKenzie
assessment in patients with neck pain. [Electronic version]. Physiotherapy, 92,
75-82.
Drake, R., Haddock, G., Terrier, N., Bentall, R., & Lewis, S. (2007). The Psychotic
Symptom Rating Scales (PSYRATS): Their usefulness and properties in firstepisode psychosis. [Electronic version]. Schizophrenia Research, 89, 119-122.
Ferri, R., Bruni, O., Miano, S., Smerieri, A., Spruyt, K., & Terzano, M. (2005). Inter-
rater reliability of sleep cyclic alternating pattern (CAP) scoring and validation ofa new computer-assisted CAP scoring method. [Electronic version]. ClinicalNeurophysiology, 116, 696-707.
Goodwin, L.D. & Goodwin, W.L. (1985). An Analysis of Statistical Techniques Used in
the Journal of Educational Psychology, 1979-1983. [Electronic version].Educational Psychologist, 20(1), 13-21.
Gwet, K. (2002a) Kappa statistic is not satisfactory for assessing the extent of agreement
between raters. Retrieved December 15, 2006, fromhttp://www.stataxis.com/files/articles/kappa_statistic_is_not_satisfactory.pdf.
Gwet, K. (2002b) Cohens Kappa. Retrieved December 15, 2006, from http://www-
class.unl.edu/psycrs/handcomp/hckappa.pdf.
8/9/2019 IRR Flight School Instructors
50/61
42
Holey, L.A., & Watson, M.J. (1995) Inter-rater reliability of connective tissue zones
recognition. [Electronic version]. Physiotherapy, 61(7), 369-372.
Hulsman, R.L., Mollema, E.D., Oort, F.J., Hoos, A.M., & de Haes, J.C.J.M. (2006) Using
standardized video cases for assessment of medical communication skills:
Reliability of an objective structured video examination by computer. [Electronicversion]. Patient Education and Counseling, 60, 24-31.
Joo, E.-J., Joo, Y.-H., Hong, J.-P., Hwang, S., Maeng, S.-J., Han J.-H., Yang, B.-H., Lee,Y.-S., & Kim, Y.-S. (2004). Korean Version of the Diagnostic Interview for
Genetic Studies: Validity and Reliability. [Electronic version]. Comprehensive
Psychiatry, 45(3), 225-229.
Kadri, N., Agoub, M., El Gnaoui, S., Mchichi Alami, Kh., Hergueta, T., & Moussaoui, D.(2005). Moroccan colloquial Arabic version of the Mini International
Neuropsychiatric Intervire (MINI): qualitative and quantitative validation.
[Electronic Version]. European Psychiatry, 20, 193-195.
Kaneda, Y., Ohmoria, T., & Fujii, A. (2001). The serotonin syndrome: investigation
using the Japanese version of the Serotonin Syndrome Scale. [Electronic version].Psychiatry Research, 105, 135-142.
Kirshner, W.K. (1990) The Pilots Manual: Instrument Flying(4th
ed.). Ames, IA: Iowa
State Press
Kolaitas, J., Korpa, T., Kolvin, I., & Tsiantis, J. (2003). Letter to the Editor. [Electronic
version]. European Psychiatry, 18, 374-375.
Kolt, G.S., Brewer, B.W., Pizzari, T., Schoo, A.M.M., & Garrett, N. (2006). The Sport
Injury Rehabilitation Adherence Scale: a reliable scale for use in clinical
physiotherapy. [Electronic version]. Physiotherapy 93(1), 17-22.
Lee, H.K. (2004). A comparative study of ESL writers performance in a paper-based and
a computer-delivered writing test. [Electronic version]. Assessing Writing, 9, 4-26.
Leung, T.K.S. & Tsang H.W.H. (2006). Chinese version of the Assessment of
Interpersonal Problem Solving Skills. [Electronic version]. Psychiatry Research143, 189-197.
Lindeman, B., Libkuman, T., King, D., & Kruse B. (2000). Development of an
Instrument to Assess Jump-Shooting Form in Basketball. [Electronic version].Journal of Sports Behavior. 23(4), 335-348.
8/9/2019 IRR Flight School Instructors
51/61
43
Meyers, B.S., English, J., Gabriele, M., Peasley-Miklus, C., Heo, M., Flint, A.J., Mulsant,
B.H., & Rothschild, A.J. (2006). A Delusion Assessment Scale for Psychoticmajor Depression: Reliability, Validity, and Utility.Biological Psychiatry, 60,
136-1342.
Michelson, J.D. (2006). Simulation in Orthopaedic Education: An Overview of Theoryand Practice. [Electronic version]. The Journal of Bone & Joint Surgery. 88-
A(6), 1405-1411.
Monroe-Blum, H., Collins, E., McCleary, L., & Nuttall, S. (1996). The social dysfunction
index (SDI) for patients with schizophrenia and related disorders. [Electronicversion]. Schizophrenia Research. 20, 211-219.
Papavasilou, A.S., Rapidi, C.A., Rizou, C., Petrapoulou, K., & Tzavara, Ch. (2006).Reliability of Greek version Gross Motor Function Classification System.
[Electronic version]. Brain & Development, 29 79-82
Penny, J., Johnson, R.L., & Gordon, B. (2000) The effect of rating augmentation on inter-
rater reliability: and empirical study of a holistic rubric. [Electronic version].Assessing Writing, 7,143-164.
Raymont, V., Buchanan, A., David, A.S., Hayward, P., Wessley, S., & Hotopf, M.
(2006). The inter-rater reliability of mental capacity assessments. [Electronic
version]. Law and Psychiatry, 30, 112-117
Rittenberger, J.C., Martin, J.R., Kelly, L.J., Roth, R.N., Hostler, D., & Callaway, C.W.
(2006). Inter-rater reliability for witnessed collapse and presence of bystanderCPR. [Electronic version]. Resuscitation, 70, 410-415.
Schmidt, N.B., Salas, D., Bernert, R., & Schatschneider, C. (2005). Diagnosing
agoraphobia in the context of panic disorder: examining the effect of the DSM-IVcriteria on diagnostic decision-making. [Electronic version]. Behavior Researchand Therapy, 43, 1219-1229.
Thuile, J., Even, C., Friedman, S., & Guelfi, J.-D. (2005). Inter-rater reliability of the
French version of the core index for melancholia. [Electronic version]. Journalof Effective Disorders, 88, 193-208.
Trochim, W.M.K. (2001). The Research Methods Knowledge Base (2nd
ed.). Mason, OH:
Thomson
Tural, U., Fidaner, H., Alkin, T. & Bandelow, B. (2002). Assessing the severity of panic
disorder and agoraphobia: Validity, reliability and objectivity of the Turkish
translation of the Panic and Agoraphobia Scale (P & A). [Electronic version].Journal of Anxiety Disorders, 16, 331-340.
8/9/2019 IRR Flight School Instructors
52/61
44
Worster, A., Sardo, A.A., Fernandes C.M.B., Eva, K., & Upadhy, S. (2007). Triage toolinter-rater reliability: a comparison of live versus paper case scenarios.
[Electronic version].Journal of Emergency Nursing, 33(4), 319-323.
8/9/2019 IRR Flight School Instructors
53/61
APPENDIX A
SCORING RUBRIC
8/9/2019 IRR Flight School Instructors
54/61
46
8/9/2019 IRR Flight School Instructors
55/61
APPENDIX B
BRIEFING AND SCRIPT
8/9/2019 IRR Flight School Instructors
56/61
48
Brief
Thank you for participating in this inter-rater reliability study. You are not being
tested. Your upcoming flight will be scored by instructors for research purposes only.
Your performance here today will not have any effect on your scores in school. Your
name is not being recorded. Even I, the researcher, am not keeping a record of your
name or any information about you.
During this flight you will be asked to fly Pattern D from The Pilots Manual:
Instrument Flying. Whether you have a passing or thorough knowledge of this flight
pattern is not important. I will talk you through the flight, if necessary. I will not keep
track of the time for you. I will, however, give you ample time before the next maneuver.
Remember, you are not the one being tested. This flight is being used to test your
instructors. Even though your performance is not being tested, I ask that you still try
your best just as you would in a real plane with a real instructor pilot.
Do you have any questions?
Instructions
This flight will begin with you already airborne. You are flying at 6000 feet,
straight and level, heading 360, at 130 knots cruising speed. The flight will end with you
airborne as well. Do you have any questions before we begin?
1. Beginnow. Keep the aircraft straight and level for one minute.
2. At the one minute mark, turn left to heading 315.
3. When you come to heading 315, fly straight and level for one minute.
4. Turn right, 180 degrees to heading 135.
8/9/2019 IRR Flight School Instructors
57/61
49
5. When you reach heading 135, fly straight and level for 30 seconds.
6. Turn right 45 degrees to heading 180.
7. When you reach heading 180, fly straight and level for 2 minutes.
8. At the two-minute mark, turn right thirty degrees to heading 210.
9. Fly straight and level for 45 seconds.
10.Turn left 210 degrees to heading 360.
11.When you reach heading 360, fly straight and level for 2 minutes.
12.Turn right 180 degrees to heading 180.
13.Fly straight and level for 2 minutes.
14.Turn right 180 degrees to heading 360. Fly straight and level for 2 minutes.
15.You have finished the flight. Please stop.
Debrief
Thank you for flying this pattern. Your flight is one of many that will be used to
help us test the reliability of the instructor pilots. Although a recording of your flight has
been made, no information about you has been kept, and thus no information about you
can or will be shared.
Do you have any questions before you go?
8/9/2019 IRR Flight School Instructors
58/61
APPENDIX C
INSTRUCTIONS TO RATERS
8/9/2019 IRR Flight School Instructors
59/61
51
Brief
Thank you for being kind enough to participate in this study. We are soon going
to watch DVDs containing 10 sample flights. Before we watch these flights, I must lay
out some ground rules:
1. We will watch each flight only once.
2. Score each flight at the end of the flight. Do not wait until all flights are over to
score them all. Take each flight as it is.
3. You have been given a copy of Pattern D and the scoring rubric, which you may
refer to throughout this process. On the score sheet, there is also a brief summary
of the standards and of the scoring rubric.
4. You must not communicate with each other while watching the flights. This
includes talking, nodding, winking, gesturing, making faces, etc.
5. We will take short breaks after every two videos, and a long break at the end of
the first DVD.
6. You may talk during the break times, but you must refrain from talking about the
flights. Please keep the conversation to unrelated topics.
7. At the end of the viewing, after I have collected your score sheets, we may then
discuss any flights. You will not have the ability to change your scores.
Do you have any questions before we begin?
8/9/2019 IRR Flight School Instructors
60/61
APPENDIX D
SCORING SHEET
8/9/2019 IRR Flight School Instructors
61/61
53
Standards:
Altitude: +/- 200 feet
Heading +/- 15
Bank angle +/- 10
Airspeed +/- 15 KIAS
Grading Scale:
5 Excellent
4 Very Good
3 Good
2 Marginal
1 Unsatisfactory
Note: Standards are taken directly from a lesson pertinent to Pattern D. The grading scale
is the same as described in the scoring rubric.
Give a score of 1 5 for each sample.
A B C D E F G H I J