Rubrics Cube 1
As the call for accountability in higher education sets the tone for campus-wide learning outcomes
assessment, questions about how to conduct meaningful, reliable, and valid assessments to help students
learn have gained increased prominence. The chapter by Yen and Hynes presents the unique
contribution of a heuristic rubrics cube for authentic assessment validation. Just as Bloom’s taxonomy
maps the dimensions of cognitive abilities, the Yen and Hynes cube cohesively assembles three
dimensions (cognitive-behavioral-affective taxonomies, stakes of an assessment, and reliability and
validity) in a cube for mapping and organizing efforts in authentic assessment. The chapter includes a
broad discussion of rubric development. As an explication of the heuristic rubrics cube, they examine six
key studies to show how different types of reliability and validity were estimated for low, medium, and
high stakes assessment decisions.
Authentic Assessment Validation: A Heuristic Rubrics Cube
Jion Liou Yen, Lewis University
Kevin Hynes, Midwestern University
Authentic assessment entails judging student learning by measuring performance
according to real-life-skills criteria. This chapter focuses on the validation of authentic
assessments because empirically-based, authentic assessment-validation studies are sparsely
reported in the higher education literature. At the same time, many of the concepts addressed in
this chapter are more broadly applicable to assessment tasks that may not be termed “authentic
assessment” and are valuable from this broader assessment perspective as well. Accordingly, the
chapter introduces a heuristic, rubrics cube which can serve as a tool for educators to
conceptualize or map their authentic and other assessment activities and decisions on the
following three dimensions: type and level of taxonomy, level of assessment decision, and types
of validation methods.
What is a Rubric?
The call for accountability in higher education has set the tone for campus-wide
assessment. As the concept of assessment gains prominence on campuses, so do questions about
how to conduct meaningful, reliable, and valid assessments. Authentic assessment has been
credited by many as a meaningful approach for student learning assessment (Aitken and Pungur,
2010; Banta et al., 2009; Eder, 2001, Goodman et al., 2008; Mueller, 2010; Spicuzza and
Cunningham, 2003). When applied to authentic assessment, a rubric guides evaluation of
student work against specific criteria from which a score is generated to quantify student
performance. According to Walvoord (2004), “ A rubric articulates in writing the various
criteria and standards that a faculty member uses to evaluate student work” (p.19).
There are two types of rubrics—holistic and analytic. Holistic rubrics assess the overall
quality of a performance or product and can vary in degree of complexity from simple to
complex. For example, a simple holistic rubric for judging the quality of student writing is
described by Moskal (2000) as involving four categories ranging from “inadequate” to “needs
Rubrics Cube 2
improvement” to “adequate” to “meet expectations for a first draft of a professional report”.
Each category contains a few additional phrases to describe the category more fully. On the
other hand, an example of a complex rubric is presented by Suskie (2009) who likewise
describes a four-category holistic rubric to judge student’s ballet performance but utilizes up to
15 explanatory phrases. So, even though a rubric may employ multiple categories for judging
student learning, it remains a holistic rubric if it provides only one overall assessment of the
quality of student learning.
The primary difference between a holistic and an analytic rubric is that the latter breaks
out performance or product into several individual components and judges each part separately
on a scale that includes descriptors. Thus, an analytic rubric resembles a matrix comprised of
two axes—dimensions (usually referred to as criteria) and level of performance (as specified by
rating scales and descriptors). The descriptors are more important than the values assigned to
them because scaling may vary across constituent groups. Keeping descriptors constant would
allow cross-group comparison (Hatfield, personal communication, 2010). Although it takes time
to develop clearly-defined and unambiguous descriptors, they are essential in communicating
performance expectations in rubrics and for facilitating the scoring process (Suskie, 2009;
Walvoord, 2004).
To further elucidate the differences between holistic and analytic rubrics, consider how
figure skating performance might be judged utilizing a holistic rubric versus how it is judged
utilizing an analytic rubric. If one were to employ a holistic rubric to judge the 2010 Olympics
men’s figure skating based solely on technical ability, one might award a gold medal to Russian
skater Evgeni Plushenko because he performed a “quad” whereas American skater Evan Lysacek
did not. On the other hand, using analytic rubrics to judge each part of the performance
separately in a matrix comprised of a variety of specific jumps with bonus points awarded later
in the program, then one might award the gold medal to Evan rather than Evgeni. So, it is
possible that outcomes may be judged differently depending upon the type of rubric employed.
Rubrics are often designed by a group of teachers, faculty members, and/or assessment
representatives to measure underlying unobservable concepts via observable traits. Thus, a
rubric is a scaled rating designed to quantify levels of learner performance. Rubrics provide
scoring standards to focus and guide authentic assessment activities. Although rubrics are
described as objective and consistent scoring guides, rubrics are also criticized for the lack of
evidence of reliability and validity. One way to rectify this situation is to conceptualize
gathering evidence of rubric reliability and validity as part of an assessment loop.
Rubric Assessment Loop
Figure 1 depicts three steps involved in the iterative, continuous quality improvement
process of assessment as applied to rubrics. An assessment practitioner may formulate a rubric
to assess an authentic learning task, then conduct studies to validate learning outcomes, and then
make an assessment decision that either closes the assessment loop or leads to a subsequent
round of rubric revision, rubric validation, and assessment decision. The focus of the assessment
decision may range from the classroom level to the program level to the university level.
Rubrics Cube 3
Figure 1. Rubric Assessment Loop.
While individuals involved in assessment have likely seen “feedback loop” figures reminiscent
of Figure 1, these figures need to be translated into a practitioner-friendly form that will take
educators to the next level conceptually. Accordingly, this chapter introduces a heuristic, rubrics
cube to facilitate this task.
Rubrics Cube
Figure 2 illustrates the heuristic, rubrics cube where height is represented by three levels
Figure 2. Rubrics cube.
Rubrics Cube 4
of assessment stakes (assessment decisions), width is represented by two methodological
approaches for developing an evidentiary basis for the validity of the assessment decisions, and
depth is represented by three learning taxonomies. The cell entries on the face of the cube
represented in Figure 2 are the authors’ estimates of the likely correspondence between the
reliability and validity estimation methods minimally acceptable for each corresponding
assessment stakes level. Ideally, the cell entries would capture a more fluid interplay between
the types of methods utilized to estimate reliability and validity and the evidentiary argument
supporting the assessment decision.
As Geisinger, Shaw, and McCormick (this volume) note, the concept of validity as
modeled by classical test theory, generalizability theory, and multi-faceted Rasch Measurement
(MFRM) is being re-conceptualized (Kane, 1992; Kane, 1994; Kane, 2006; Smith and
Kulikowich, 2004; Stemler, 2004) into a more unified approach to gathering and assembling an
evidentiary basis or argument that supports the validity of the assessment decision. The current
authors add that as part of this validity re-conceptualization, it is important to recognize the
resource limitations sometimes facing educators and to strive to create “educator-friendly
validation environments” that will aid educators in their task of validating assessment decisions.
Moving on to a discussion of the assessment-stakes dimension of Figure 2, the authors
note that compliance with the demands posed by such external agents as accreditors and
licensure/certification boards has created an assessment continuum. One end of this continuum
is characterized by low-stakes assessment decisions such as grade-related assignments.
Historically, many of the traditional learning assessment decisions are represented here. The
other end of the assessment-stakes continuum is characterized by high-stakes assessment
decisions. Licensure/certification driven assessment decisions are represented here, as would the
use of portfolios in licensure/certification decisions. Exactly where accreditation falls on the
assessment-stakes continuum may vary by institution. Because academic institutions typically
need to be accredited in order to demonstrate the quality and value of their education, the authors
place accreditation on the high stakes end of the assessment-stakes dimension of Figure 2 for the
reasons described next.
While regional accreditors making assessment decisions may not currently demand the
reliability and validity evidence depicted in the high-stakes methods cells of Figure 2, the federal
emphasis on outcome measures, as advocated in the Spellings Commission’s report on the future
of U.S. higher education (U.S. Department of Education, 2006), suggests accrediting agencies
and academic institutions may be pressured increasingly to provide evidence of reliability and
validity to substantiate assessment decisions. Indeed, discipline-specific accrediting
organizations such as the Accreditation Council for Business Schools and Programs, the
Commission on Collegiate Nursing Education, the National Council for Accreditation of Teacher
Education, and many others have promoted learning-outcomes based assessment for some time.
Because many institutions wishing to gather the reliability and validity evidence suggested by
Figure 2 may lack the resources needed to conduct high-stakes assessment activities, the authors
see the need to promote initiatives at many levels that lead to educator-friendly assessment
environments. For example, it is reasonable for academic institutions participating in
commercially-based learning outcomes testing programs to expect that the test developer provide
transparent evidence substantiating the reliability and validity of the commercial examination.
Freed from the task of estimating the reliability and validity of the commercial examination,
Rubrics Cube 5
local educators can concentrate their valuable assessment resources and efforts on
triangulating/correlating the commercial examination scores with scores on other local measures
of learning progress. An atmosphere of open, collegial collaboration will likely be necessary to
create such educator-friendly assessment environments.
Returning to Figure 2, some of the frustration with assessment on college campuses may
stem from the fact that methods acceptable for use in low-stakes assessment contexts are
different from those needed in high-stakes contexts. Whereas low-stakes assessment can often
satisfy constituents by demonstrating good-faith efforts to establish reliability and validity, high-
stakes assessment requires that evidence of reliability and validity be shown (Wilkerson and
Lang, 2003). Unfortunately, the situation can arise where programs engaged in low-stakes
assessment activities resist the more rigorous methods they encounter as they become a part of
high-stakes assessment. For example, the assessment of critical thinking may be considered a
low-stakes assessment situation for faculty members and students when conducted as part of a
course on writing in which the critical thinking score represents a small portion of the overall
grade. However, in cases where a university has made students’ attainment of critical thinking
skills part of its mission statement, then assessing students’ critical thinking skills becomes a
high-stakes assessment for the university as it gathers evidence to be used for accreditation and
university decision-making. Similarly, while students can be satisfied in creating low-stakes
portfolios to showcase their work, they may resist high-stakes portfolio assessment efforts
designed to provide an alternative to standardized testing.
Succinctly stated, high-stakes assessment involves real-life contexts where the learner’s
behavior/performance has critical consequences (e.g., license to practice, academic accreditation,
etc.). In contrast, in low-stakes assessment contexts, the learner’s performance has minimal
consequences (e.g., pass-fail a quiz, a formative evaluation, etc.). Thus, from a stakeholder
perspective, the primary assessment function in a low-stakes context is to assess the learner’s
progress. On the other hand, from a stakeholder perspective, the primary assessment function in
a high-stakes context is to assess that the mission-critical learning outcome warrants, for
example, licensure to practice (from the stakeholder perspective of the learner) and accreditation
(from the stakeholder perspective of an institution). Lastly, a moderate-stakes context falls
between these two end points. Thus, from a stakeholder perspective, the primary assessment
function in a moderate-stakes context is to assess that moderating outcomes (e.g, work-study
experiences, internship experiences, volunteering, graduation, etc.) are experienced and
successfully accomplished.
For the low-stakes assessment level, at a minimum, a content validity argument needs to
be established. In the authors’ judgment, while content, construct and concurrent validity would
typically need to be demonstrated for a validity argument at the medium-stakes assessment level;
predictive validity need not be demonstrated. In order to demonstrate a validity argument at the
high-stakes assessment level, all types of validation methods such as content validity, construct
validity (i.e., convergent and discriminant), and criterion-related validity (i.e., concurrent validity
and predictive validity) need to be demonstrated (see Thorndike and Hagen, 1977 for definitional
explanations).
With regard to inter-rater reliability estimation, for the low-stakes assessment level,
percentage of rater agreement (consensus) needs to be demonstrated. For the medium- and high-
stakes assessment levels, both consensus and consistency need to be demonstrated. Needless to
Rubrics Cube 6
say, there are a variety of non-statistical and statistical approaches to estimate these types of
reliability and validity. For those having the software and resources to do so, the MFRM
approach can be an efficient means for establishing an evidentiary base for estimating the
reliability and validity of rubrics. The important consideration in validation is that it is the
interpretation of rubric scores upon which validation arguments are made.
Before examining the interplay of learning taxonomies with assessment decisions, a few
additional comments on methods may be helpful. First, the reliability methods portrayed in the
rubrics cube ignore such reliability estimates as internal consistency of items, test-retest, parallel
forms, and split-half. Instead, the chapter focuses on inter-rater reliability estimates because it is
critical to establish inter-rater reliability estimates whenever human judgments comprise the
basis of a rubrics score. In creating the methods dimension of the rubrics cube, the authors
adapted the Stemler (2004) distinction between consensus, consistency, and measurement
approaches to estimating inter-rater reliability. Figure 2 indicates that at a low-assessment-stakes
level, consensus is an appropriate method for estimating inter-rater reliability. Estimating the
degree of consensus between judges can be as simple as calculating the percent of agreement
between pairs of judges or as intricate as collaborative quality filtering which assigns greater
weights to more accurate judges (Traupman and Wilensky, 2004). Figure 2 also shows that as
the assessment stakes level moves to the “moderate” and “high” levels, then both consistency
and consensus are relevant methods for estimating inter-rater reliability. Among the more
popular approaches, consistency can be estimated via an intraclass correlation derived utilizing
an analysis of variance approach. In the analysis of variance, judges/raters form the between
factor and the within-ratee source of variation is a function of both between judge/rater variation
and residual variation (Winer, 1971, p. 283 - 89). Consistency can also be estimated utilizing an
item response theory (IRT) approach (see Osterlind & Wang, this volume) as well as Linacre
(2003). Because reliability is a necessary, but not sufficient condition for validity, it is listed first
on the methods dimension.
Having discussed the contingency of the various types of validation methods on the
assessment-stakes-level dimension of the rubrics cube, a few general comments on validation
methods as they relate to rubrics are warranted here. Detailed definitions and discussions of the
types of validity are available in The Standards for Educational and Psychological Testing
(American Education Research Association, American Psychological Association, National
Council Measurement in Education, 1999). In order for others to benefit from and replicate the
validation of rubrics described in the assessment literature, it is essential that the rubric scaling
utilized be accurately described. It is also essential that researchers describe the way in which
content validity was maximized (rational, empirical, Delphi technique, job analysis, etc.).
Likewise, researchers’ discussions of efforts to establish construct validity should describe any
evidence of positive correlations with theoretically-related (and not simply convenient)
constructs (i.e., convergent validity), negative or non-significant correlations with unrelated
constructs (i.e., discriminant validity), or results of multivariate approaches (factor analysis,
canonical correlations, discriminant analysis, multivariate analysis of variance, etc.) as support
for construct validity. Central to the concept of predictive validity is that rubric scores gathered
prior to a theoretically relevant, desired, end-state criterion correlate significantly and therefore
predict the desired end state criterion. At the high-stakes assessment level, assembling
information regarding all of these types of validity will enable stakeholders to make an
evidentiary validity argument for the assessment decision.
Rubrics Cube 7
With regard to predictive validity, the authors note that it is possible to statistically
correct for restriction in the range of ability (Hynes & Givner, 1981; Wiberg & Sundstrom, 2009;
also see the Geisinger et al., this volume). For example, if the adequacy of subsequent work
performance of candidates passing a licensure examination (i.e., restricted group) were rated by
supervisors and these ratings were then correlated with achieved licensure examination scores,
this correlation could be corrected for restriction in range, yielding a better estimate of the true
correlation between performance ratings and licensure examination scores for the entire,
unrestricted group (pass candidates and fail candidates).
Assessment Decisions (Assessment Stakes Level) by Taxonomy
The levels of each of the learning taxonomies as they relate to rubric criteria and assessment
decisions are explicated through fictional examples in the tables that follow. Figure 3 represents
Cognitive Taxonomy: Evaluation/Critical Thinking (CT)
Rubric Criteria*
Assessment Decisions
Remedial CT
Training
Targeted CT
Training
Advanced CT
Training
Planning - - +
Comparing - - +
Observing - + +
Contrasting - + +
Classifying - + +
*Criteria were adapted from Smith & Kulikowich (2004)
Figure 3. Rubrics cube applied to medium-stakes decisions, cognitive taxonomy.
an assessment-decision-focused slice of the three-dimensional heuristic, rubrics cube for
Bloom’s (1956) cognitive learning taxonomy involving creative/critical thinking (CT). Five
rubric criteria are crossed with three assessment decisions in a fictional, moderate-stakes
authentic assessment adapted from Smith and Kulikowich (2004).
The pluses and minuses in Figure 3 represent ‘Yes’ - ‘No’ analytic assessment decisions
regarding students’ mastery of critical thinking criteria. For example, the assessment decisions
for students who received minuses in all five CT criteria would be to take remedial CT training.
On the other hand, students who received a majority of pluses would need to take targeted CT
training until they master all criteria. Lastly, students who earned all pluses are ready to move
on to the advanced CT training.
Figure 4 represents an assessment-decision-focused slice of the three-dimensional
heuristic, rubrics cube for the behavioral/psychomotor taxonomy level (operation) with two
Rubrics Cube 8
Behavioral/Psychomotor Taxonomy:
Operating Equipment/Work Performance (WP)
Rubric Criteria*
Assessment Decisions
Repeat
Simulation
Supervised Practicum Solo Flight
Technical 1 2 3 4 1 2 3 4 1 2 3 4
Team Work 1 2 3 4 1 2 3 4 1 2 3 4
Decision rule** Total < 5 Total >= 5 & < 8 Total = 8
*Criteria were adapted from Mulqueen et al. (2000)
**Decision rule represents a cutoff score for each assessment decision.
Figure 4. Rubrics cube applied to high-stakes decisions, behavioral/psychomotor taxonomy.
rubric criteria crossed with three assessment decisions in a fictional, authentic assessment of pilot
performance adapted from a study by Mulqueen et al. (2000). The 4-point scale in Figure 4
represents analytic assessment decisions regarding students’ ability to apply knowledge in three
criteria. For example, an assessment decision could be made whereby students need to repeat
simulation training if they scored a total of less than ‘5’. Similar assessment-decision logic could
apply to the supervised practicum and solo flight where advancement is based on meeting or
surpassing specified cutoff scores.
Figure 5 represents an assessment-decision-focused slice of the three-dimensional
heuristic, rubrics cube for the valuing level of Krathwohl’s (1964) affective taxonomy.
Affective Taxonomy: Value/Attitudes Towards Community (VATC)
Rubric Criteria Assessment Decisions
Self-Assessment Question
Reflection
Describe how your community service has affected your
values and attitudes toward the community and decide
whether additional community service would be beneficial.
Figure 5. Rubrics cube applied to a low-stakes decision, affective taxonomy.
Figure 5 involves authentic assessment because students are involved in a real-life context and
are asked to self assess their community service experiences using holistically-judged reflection.
For some students, their self-assessment decision would be to engage in additional community
service to increase their valuing of community to the level they deem appropriate.
Assessment Stakes Level (Assessment Decisions) by Methods
Thus far, in discussing Figure 2, the learning taxonomy dimension has been discussed as
it relates to the assessment-stakes level dimension and the assessment decisions, but the methods
dimension has not yet been addressed. It is critical to note that valid assessment decisions can be
Rubrics Cube 9
made only if the reliability and validity of the rubrics (i.e., methods dimension) involved have
been established. Without an evidentiary validity argument, the decisions may be called into
question, or in a worst-case, high-stakes assessment scenario, challenged “in a court of law”
(Wilkerson and Lang, 2003, p.3).
Central to the heuristic value of the rubrics cube is the notion that the methods employed
to estimate inter-rater reliability and validity are contingent on the assessment stakes level. If the
assessment-stakes level is low, then the authors suggest only consensus reliability (percentage of
agreement between raters) and content validity need to be established. On the other hand, as
represented by the authors in Figure 2, if the assessment-stakes level is high, then consensus and
consistency inter-rater reliability as well as content, construct, and concurrent validity need to be
demonstrated. The realities of the medium-stakes assessment level will determine how much
evidence may be gathered in support of the assessment decision. Because the establishment of
adequate reliability estimates is a necessary condition for the establishment of validity (and,
indeed, the size of a validity coefficient is limited by the size of the reliability estimate), the
estimations of validity and reliability are both included in the methods dimension of the rubrics
cube. Currently, in the higher education literature, it can be difficult to locate exemplars where
evidentiary validity arguments have been made for rubrics. But this does not justify describing
the task of validating rubrics as an insurmountable one, because some exemplars are beginning
to surface in the higher education literature. With this in mind, Figure 6 presents six key studies
highlighting the types of inter-rater reliability and validity demonstrated for low, medium, and
high-stakes assessment decisions.
Reliability
- Consensus
- Consistency
Content
Validity
Construct
Validity:
- Convergent
- Discriminant
Criterion-
related
Validity
- Concurrent
- Predictive
Low Stakes Assessment
Study 1— Research
Quality (Bresciani, et
al., 2009)
Consistency Yes No No
Medium Stakes Assessment
Study 2— Delphi
method (Allen &
Knight, 2009)
Consensus /
Agreement
Yes No No
Study 3— Comparing
generalizabilty &
multifaceted Rasch
models (Smith &
Kulikowich, 2004)
Consistency Not
described
Yes No
Rubrics Cube 10
High Stakes Assessment
Study 4—
Multifaceted Rasch
model study of rater
reliability (Mulqueen,
et al., 2000)
Consistency Not
described
Not described No
Study 5—Performance
assessment (Pecheone
& Chung, 2006)
Consistency Yes Yes Yes
(Concurrent)
Study6—Minimum
Competency Exam
(Goodman, et al, 2008)
Not describe Yes Yes Yes
(Concurrent /
Predictive)
Figure 6. Key studies highlighting stakes assessment dimension/decisions by methods.
The studies summarized in the above table are next discussed in greater detail.
Low-Stakes Assessment. Study 1—Faced with the need to judge the quality of research
presentations by 204 undergraduate, masters, and doctoral students across multiple disciplines ,
Bresciani et al. (2009) address what may be termed a low-stakes assessment task because
consequences to the student presenters were never specified and appeared to be nil. Based on an
extensive review of internal rubrics as well as rubrics for the review of publication submissions,
a multi-disciplinary team of 20 faculty members devised a rubric comprised of four content area
constructs (organization, originality, significance, and discussion/summary) and a fifth construct
for presentation delivery that were applied by judges with no formal training in the use of the
rubrics. Thus, two taxonomies (cognitive and behavioral) were involved and therefore dealt
with all cells of the lowest layer of the rubrics cube portrayed in Figure 2 with the exception of
those associated with the affective taxonomy. In developing their five-construct rubric, the
researchers employed a 5-point Likert scale with unambiguous, construct-specific descriptors
labeling each point of the scale deemed equally applicable across multiple disciplines. Inter-rater
reliability/internal consistency was estimated using intraclass correlations which were computed
for each of 40 multiple-presenter sessions based on an inter-rater data structure where 3 to 7
judges rated 6 to 8 student presentations per session. Bresciani et al. (2009) reported moderately
high intraclass correlations. It could be argued that Breciani et al (2009) made good-faith efforts
to establish the content validity of their rubrics. Although the researchers did not report any
construct validity or criterion-related validity results, the validity argument they presented was
adequate for their low-stakes assessment situation.
Medium-Stakes Assessment. Study 2— In a medium-stakes assessment context, Allen
and Knight (2009) presented a step-by-step, iterative process for designing and validating a
writing assessment rubric. The study involved assessment activities concentrated on Bloom’s
cognitive taxonomy as displayed in Figure 2. It was considered a medium-stakes assessment
because the decisions had some effects on learners’ knowledge and skills but were not
determinants of critical decisions. In this study, faculty and professionals worked in
collaboration to develop agreed-upon criteria that were in accordance with intended learning
outcomes and professional competence. Using content-related evidence resulting from baseline
Rubrics Cube 11
data, the rubric was further refined and expanded. The Delphi method, a qualitative research
methodology which extracts knowledge from a panel of experts (Cyphert & Gant, 1970), was
adopted by these authors as a means for developing group consensus to assign scoring weights
for each category in the rubric. Accordingly, the method was implemented with separate groups
of professionals and faculty in order to reach consensus on the weights of each rubric category so
as to improve the construct validity of the rubric.
The study then utilized two-way Analysis of Variance (factorial ANOVA) techniques to
identify the sources of variability in rubric scores. Student writing samples were first grouped
into two piles based on quality of writing. Faculty and professionals used the scoring rubric to
grade writing samples that were randomly selected from each pile. Significant differences
between the average scores for the two piles of writing samples indicated that the rubric
differentiated the quality of student writing, while non-significant differences between average
scores given by faculty and professionals indicated that there was rater agreement in scoring.
The same data was reanalyzed using an ANOVA to measure rater agreement within faculty
members. Discrepancies in scores were discussed throughout the scoring process. The authors
concluded that smaller estimated variance in rubric scores existed in higher quality writing
samples. They also raised concerns over various interpretations of rubric categories that had
affected rating consistency. It was suggested that descriptors for each level of performance
needed to be added to each category in the rubric to help clarify the scoring category and guide
the scoring process. The evidentiary validity argument presented by Allen and Knight (2009)
could be improved by the inclusion of concurrent validity estimates.
Study 3—Two approaches, Generalizability theory (G-theory; see the Webb, Shavelson
& Steedle chapter, this volume) and Multi-Faceted- Rasch Measurement (MFRM), were used to
explore psychometric properties of student responses collected from a simulated assessment
activity involving the cognitive taxonomy (problem solving skills) and medium-stakes
assessment decision (students would be selected as the coach of a fictional kickball team). Five
questions that measured the complex problem-solving skills of observing, classifying,
comparing, contrasting, and planning were given to 44 students and scored by two judges at two
points in time with a three-month interval (Smith and Kulikowich, 2004). The authors first
adopted G-theory to explicitly quantify multiple sources of errors and the effect of errors on the
ranking order of subjects. In their first-stage Generalizability analysis, estimates of the
variability of four facets (subject, item, judge, and occasion) were first obtained from the fully
crossed, random-effect model. The estimates— Generalizability (G) coefficients were used as
sample estimates. The second stage of analysis illustrated how different components of
measurement errors could be reduced in repeated studies for the purpose of obtaining optimal
generalizability for making decisions.
These authors continued their study by employing MFRM analysis with the same data to
further estimate variance accounted for by differences in subjects, judges, items, and occasions.
Since MFRM is an extension of the basic Rasch model for which unidimensionality and local
independence are two assumptions, Smith and Kulikowich (2004) noted that if these
assumptions are met, the data should “provide a precise and generalizable measure of
performance” (p. 627). Using associated fit statistics—infit and outfit—Smith and Kulikowich
(2004) investigated the extent of how data fit the model (model-data fit) and identified the degree
of consistency for each element within each facet. FACETS (Linacre,1988), a computer
Rubrics Cube 12
program, was used to calculate reliability of separation and chi-square statistics in providing the
information on individual elements (subjects, items, raters, occasions) within each facet (subject,
item, rater, occasion). The FACETS results indicated that planning was the most difficult item,
followed by comparing, observing, contrasting, and classifying. They further reported that
complex problem-solving skills were different from student to student. In particular, differences
in perceptions on the items were found for five students. The significant reliability of separation
(0.00) and non-significant associated chi-square (0.2) indicated that judges were very consistent
in “their ratings and overall severity level, and their influence on the estimation of person
measures is minimal” (p. 635). Consistency between occasions also provided acceptable
calibration fit statistic values of 0.5 to 1.5 for the mean square, reliability of separation (0.00),
and chi-square statistics (.05). Because judges consistently rated items in accordance with item
difficulty, this supported the underlying construct of the complex problem solving skills. Also,
the Rasch model essentially establishes construct validity because any items not fitting the model
constitute instances of multidimensionality and are subject to modification or deletion.
Therefore, Smith and Kulikowich (2004) demonstrated construct validity and, indeed, indicated
that the findings may provide additional information on the hierarchy of the complex problem
solving skills. However, the evidentiary validity argument presented by Smith and Kulikowich
(2004) could be improved by describing content validity estimates and by presenting some
concurrent validity estimates.
High-Stakes Assessment. Study 4—In contrast to study 1 which provided raters with no
formal training, study 4 by Mulqueen, et al. (2000) not only provided rater training but also
measured the effectiveness of the training in addition to other factors (i.e., ratee ability, task
performance difficulty, and rater severity/leniency/bias) utilizing a MFRM approach. Mulqueen,
et al. (2000) addressed what may be termed a high-stakes assessment task involving pilot
training (job simulation) with trainee consequences being flight certification or additional
training. Based on a series of three videotaped aircrew scenarios, airline pilot instructors were
trained to rate the teamwork and technical ability of each member of a two person crew utilizing
a four-point Likert scale with each point described by a one-word, assessment-decision-focused
descriptor (i.e., repeat-debrief-standard-excellent). Although the events being rated were only
described in general terms (teamwork and technical ability), the researchers indicated
performance was being rated and presumably involved the cognitive and behavioral taxonomies
and therefore dealt with all cells of the highest layer of the rubrics cube portrayed in Figure 2
with the exception of those associated with the affective taxonomy. The researchers did not
describe how they developed their rubric and provided no rational content or construct validity
for the training program. However, Mulqueen, et al. (2000) reported that the training program
displayed “separation reliability” because the three crews videotaped were judged to be low,
moderate, and high in their ability. The multi-faceted Rasch analysis identified raters who were
too lenient or too harsh. Mulqueen, et al. (2000) noted that while the multi-faceted Rasch model
has many advantages, the startup involves cumbersome data and programming by an individual
trained in the multi-faceted Rasch modeling. Because Mulqueen, et. al. (2000) did not present an
evidentiary argument that included content, construct, and criterion-related validity estimation,
additional work is needed in order to meet the stringent validity standards associated with a high-
stakes assessment decision.
Study 5—Pecheone and Chung (2006) reported a study that proposed meeting state
credentialing mandates with utilization of authentic assessment. The study addressed a high-
Rubrics Cube 13
stakes assessment context across all three taxonomies (cognitive, behavioral, and affective) and
therefore dealt with the top layer of the rubrics cube portrayed in Figure 2. In a pilot effort to
promote an alternative assessment to state credentialing, the Performance Assessment for
California Teachers (PACT) was created by a coalition of California colleges and universities.
Because this is a high-stakes assessment, PACT was required by the state to report reliability and
validity estimates for the measures. Pecheone and Chung (2006) defined validity as referring
“…to the appropriateness, meaningfulness, and usefulness of evidence that is used to support the
decisions involved in granting an initial license to prospective teachers” (p. 28). Thus, a series of
studies were conducted in order to collect evidence of content, construct, and concurrent validity
along with rater consistency on PACT scores from pilot samples since 2002. Using expert
judgment from teacher educators, the content representation and job-relatedness of the Teaching
Event (TE) elements (e.g. rubrics) was examined and validated. Factor analysis was performed
to determine whether or not clusters of interrelated elements loaded into hypothesized TE
categories. The results from separate factor analyses conducted in two pilot years supported the
underlying TE construct categories. To further substantiate the construct validity of PACT,
Pecheone and Chung (2007) conducted correlation analyses between mean task scores and
documented the results in a PACT Technical Report (2007). Pecheone and Chung (2006)
concluded significant correlations between mean scores across PACT tasks demonstrated that
“…scorers can differentiate their judgments of teacher competence across scoring tasks and that
there is reasonable cohesiveness across the dimensions of teaching” (p.32).
Pecheone and Chung (2006) conducted an additional two sets of studies to examine the
external validity of PACT scores- namely, their ability to validly differentiate candidates who
met minimum teaching performance standards from those who did not. In the first study, the
authors found strong agreement between TE analytic-rubric scores and holistic ratings of
candidate performance used to grant preliminary teaching credentials. The second study
reported 90% agreement between TE scores and candidate competency as evaluated by their
faculty and supervisors. Consensus estimates, obtained by calculating percent agreement
between raters, resulted in 90% to 91% level of agreement within one point (on a 1-4 point scale)
across a two-year preliminary study. In the second pilot year, the authors also investigated inter-
rater reliability for each task and the full year by using the Spearman Brown Prophecy reliability
statistics and the standard error of scoring (SES) to quantify the amount of variation associated
with raters. The inter-rater reliability for the 2003-04 year was 0.88 and across TE tasks was in
the range of 0.65 to 0.75. In addition, PACT adopted an evidence-based three-stage standard
setting model to determine cut-off scores for granting teaching credentials. Using a consensus
based process; the passing standards have been continuously reviewed and revised and were
adopted in 2007. Although Pecheone and Chung (2006; 2007) proposed a predictive validation
study to see if TE scores predicted candidate performance in a real-job context, the analysis had
yet to be implemented since PACT was not approved as a high-stakes assessment in the state of
California. To summarize, Pecheone and Chung (2006; 2007) presented a convincing
evidentiary validity argument that would be strengthened by the inclusion of the proposed
predictive validity study.
Study 6—In a teacher-education study involving 150 teacher candidates, Goodman et al.
(2008) also addressed a high-stakes assessment context across all three taxonomies displayed in
Figure 2 as was done in Study 5. Next, we consider the methods these researchers employed in
Rubrics Cube 14
developing two sets of performance-based rubrics. One 20-item rubric designed to assess
professional attributes had a maximum score of 100. Although student performance was judged
by faculty members and master school-based teacher educators (SBTE), no inter-rater reliability
estimates were reported. Nor was an internal consistency reliability estimate reported in the
study. The authors reported significant concurrent validity estimated by a Pearson correlation (r
= .39) between the professional attributes rubric and portfolio scores. The authors also reported
significant predictive validity estimated by a Pearson correlation (r = .34 and r = .25) between
the professional attributes rubric and two teacher certification examinations.
The second rubric, designed to assess student teaching performance via a portfolio, had a
maximum score of 300 based on student presentation rated by a cluster coordinator. Although
the Goodman et al. (2008) referred to the domains being assessed, they did not describe the
methods utilized in creating the rubric criteria for these domains, nor did they provide descriptors
for the scoring. As noted above, the authors reported significant concurrent validity estimated by
a Pearson correlation between the portfolio scores and the profession attributes rubric scores (r =
.39). The Goodman et al. (2008) also reported significant predictive validity estimated by a
Pearson correlation between the portfolio scores and one teacher certification examination (r =
.27). Thus, Goodman et al. (2008) presented a sound evidentiary validity argument that will be
strengthened by increased reliability estimates which should improve the validity estimation
evidence.
Discussion and Summary
The issue of rubric score generalizability is one that warrants discussion. Unlike
psychological test development efforts intended to be universally applicable (such as tests
designed to measure needs or motivations), rubrics upon initial consideration appear to be
domain specific and often are drawn from small, convenience samples. If indeed, the rubrics are
domain specific, then perhaps the heuristic rubrics cube introduced in this chapter can also serve
a categorizing/coding role as was done in the discussion of the studies summarized in Figure 6.
In other words, the rubrics cube itself may serve as a rubric for judging the evidence being
generated by rubric validation studies. Through systematic recording of rubric findings
according to the rubrics cube three dimensions, generalizable findings may eventually be
deduced from the patterns of evidence that emerge from taxonomic-domain-specific findings.
Studies that directly assess the variability associated with a task facet (or a rubric facet) may be
particularly helpful in increasing the generalizability of findings.
The issue of the small, convenience samples serving as the basis of rubric validation
efforts can be addressed from a variety of approaches. For one thing, meta-studies of rubric
validation will eventually be feasible as the field progresses. Also, the findings of state-wide
initiatives (such as study 5 – Pecheone and Chung, 2005 and study 6 - Goodman, et. al., 2008)
begin to increase the scope of the inference space that is generalizable. A suggested role for
research sponsored by the Department of Education would be to gather requests for proposals to
systematically study a high-stakes assessment “setting” (such as university systems, states,
accreditation agency jurisdictions, etc.) facet for critical areas such as authentic teacher
certification efforts. It would be reasonable for consortia of colleges of education to conduct
rubric validation studies in the high-stakes assessment area of authentic teacher certification
measures. Accreditation agencies could foster these efforts by serving as a dissemination vehicle
Rubrics Cube 15
for promising rubrics. Such efforts would help foster an educator-friendly validation
environment and would supplement efforts by institutional research and assessment offices.
The issue of fairness in the high-stakes assessment context was raised in study 5 and
study 6; they are to be commended for testing effects of race/ethnicity on performance scores.
The authors emphasize that in accordance with the Standards for Educational and Psychological
Testing (AERA, APA, and NCME, 1999), valid assessment rubrics should apply equally across
gender and ethnicity subgroups. Rubrics displaying gender/ethnicity subgroup differences need
to be revised per Figure 1 until they apply equally across these subgroups.
In summary, learning goals are not formulated and rubrics are not utilized in theoretical
vacuums. Indeed, one way of making sense of rubrics is to acknowledge that they are
methodological tools framed by one of the assessment-stakes levels depicted in the heuristic
rubrics cube (Figure 2). Using the rubrics cube as a conceptual organizer, this chapter
systematically cited examples from the literature and compared/categorized how researchers
utilized and validated rubrics. Lastly, from a utilitarian-assessment perspective, a rubric will
have served its purpose well if accreditors (in the case of high-stakes educational assessments) or
other stakeholders accept the evidentiary argument of valid learning gathered by the rubric as the
basis for a positive assessment decision. Otherwise, the rubric will not have served its purpose
and would need to be revised as part of a continuous quality improvement process.
Rubrics Cube 16
REFERENCES
Aitken, N & Pungur, L. (2010). Authentic assessment. Retrieved January 11, 2010 from
http://education.alberta.ca/apps/aisi/literature/pdfs/Authentic_Assessment_UofAb_UofL.PDF,
Allen, S., & Knight. J. (2009). A method for collaboratively developing and validating a rubric.
International Journal for the Scholarship of Teaching and Learning, 3(2).
American Educational Research Association, American Psychological Association, and National
Council of Measurement in Education (1999). Standards for educational and psychological
testing. Washington, D.C.: AERA
Banta, T.W., Griffin, M, Flateby, T.L., and Kahn, S. (2009). Three Promising Alternatives for
Assessment College Students’ Knowledge and Skills. Retrieved January 3, 2010 from
http://learningoutcomesassessment.org/documents/AlternativesforAssessment.pdf,
Bloom, B. S. (Ed.). (1956) Taxonomy of educational objectives: Handbook 1, Cognitive domain.
New York: Longman.
Bresciani, M. J., Oakleaf. M., Kolkhorst. F., Nebeker, C., Barlow. J., Duncan, K., & Hickmott, J.
(2009). Examining design and inter-rater reliability of a rubric measuring research quality across
multiple disciplines. Practical Assessment, Research & Evaluation, 14(12).
Cyphert, F.R., & Gant, W.L. (1970). The Delphi technique: A tool for collecting opinions in
teacher education. Journal of Teacher Education, 31, 417-425.
Eder, D. J. (2001). Accredited programs and authentic assessment. In Palomba, C. A., & Banta,
T. W. (Eds.), Assessing student competence in accredited disciplines: pioneering approaches to
assessment in higher education (pp. 199-216). Sterling, Virginia: Stylus Publishing, LLC.
Geisinger, K.F., Shaw, L.F., & McCormick, C. (this volume). The validation of tests in higher
education.
Goodman, G., Arbona, C., and de Rameriz, R.D. (2008). High-stakes, minimum-competency
exams: how competent are they for evaluating teacher competence? Journal of Teacher
Education. 59(1), 24-39.
S. Hatfield, (personal communication, March 10, 2010).
Hynes K. and Givner, N. (1981). Restriction of range effects on the New MCAT’s predictive
validity. Journal of Medical Education. 56: 352-3.
Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112,527-
535.
Rubrics Cube 17
Kane, M. (1994). Validating the performance standards associated with passing scores. Review
of Educational Research, 64 (3), 425-461.
Kane, M. (2006). Validation. In R. L. Brennan (Ed). Educational measurement (4th
edition, pp.
17-64). Washington, DC: American Council on Education/Praeger.
Krathwohl, D.R., Bloom, B.S., and Masia, B.B. (1964). Taxonomy of educational objectives:
Handbook II: Affective domain. New York: David McKay Co.
Linacre, J. M. (1988). FACETS. Chicago: MESA Press.
Linacre, J.M. (2003) A User’s Guide to Winsteps Rasch-Model Computer Programs, Chicago.
Mueller, J (2010). Authentic assessment toolbox. Retrieved December 22, 2009 from
http://jonathan.mueller.faculty.noctrl.edu/toolbox/rubrics.htm.
Mulqueen, C., Baker, D., & Dismukes, R. K. (2000). Using multifacet Rasch analysis to examine
the effectiveness of rater training. Retrieved February 8, 2010 from
http://www.airteams.org/publications/rater_training/multifacet_rasch.pdf.
Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: validity and reliability.
Practical Assessment, Research & Evaluation, 7(10). Retrieved December 17, 2009 from
http://PAREonline.net/getvn.asp?v=7&n=10.
Osterlind, S. J. & Wang, Z. (this volume). Item response theory in measurement, assessment, and
evaluation for higher education.
Pecheone, R. L. and Chung, R. R. (2006). Evidence in teacher education: the performance
assessment for California teachers (PACT). Journal of Teacher Education, 57(1), 22-36.
Pecheone, R. L. and Chung, R. R. (2007). Technical report of the performance assessment for
California teachers (PACT): summary of validity and reliability studies for the 2003-04 pilot
year. Retrieved March 24, 2010 from
http://www.pacttpa.org/_files/Publications_and_Presentations/PACT_Technical_Report_March0
7.pdf
Smith, E. V., Jr., & Kulikowich, J. M. (2004). An application of generalizability theory and
many-facet Rasch measurement using a complex problem-solving skills assessment.
Educational and Psychological Measurement, 64(4), 617-639.
Spicuzza, F. J. and Cunningham, M. L. (2003). Validating recognition and production measures
for the bachelor of science in social work. In Banta, T. W. (Ed.), Portfolio assessment: uses,
cases, scoring, and impact. San Francisco: Jossey-Bass.
Rubrics Cube 18
Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to
estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). Retrieved
February 16, 2010 from http:// PAREonline.net/getvn.asp?v=9&n=4.
Suskie, L. (2009). Assessing student learning. San Francisco: Jossey-Bass.
Thorndike, R. L. and Hagen, E.P. (1977). Measurement and evaluation in psychology and
education. Wiley, New York.
Traupman, J. and Wilensky, R. (2004). Collaborative quality filtering: establishing consensus or
recovering ground truth? Retrieve March 22, 2010 from
http://maya.cs.depaul.edu/webkdd04/final/traupman.pdf
U.S. Department of Education, A test of leadership: charting the future of U.S. higher education.
Washington, D.C., 2006.
Walvoord, B.E. (2004). Assessment clear and simple: a practical guide for institutions,
departments, and general education. San Francisco: Jossey-Bass.
Webb, N. M., Shavelson, R. Steedle, J. (this volume). Generalizability theory in assessment
contexts.
Wiberg, M. and Sundstrom, A. (2009). A comparison of two approaches to correction of
restriction of range in correlation analysis. Retrieved March 22, 2010 from
http://pareonline.net/pdf/v14n5.pdf
Wilkerson, J. R., and Lang, W. S. (2003) Portfolios, the pied piper of teacher certification
assessments: legal and psychometric issues. Retrieved January 3, 2010 from
http://epaa.asu.edu/ojs/article/viewFile/273/399
Winer, B. J. (1971). Statistical principals and experimental design (2nd
ed). McGraw Hill, New
York, 1971.
Rubrics Cube 19
Jion Liou Yen is Associate Vice President of Institutional Research and Planning at Lewis
University in Romeoville, Illinois. She provides leadership in university-wide assessment of
student learning and institutional effectiveness as a member of the Assessment Office. Her
research interests focus on college student access and persistence, outcomes- and evidence-based
assessment of student learning and program evaluation.
Kevin Hynes directs the Office of Institutional Research and Educational Assessment for
Midwestern University with campuses in Downers Grove, Illinois and Glendale, Arizona. His
research interests include educational outcomes assessment and health personnel distribution.