RUNNING HEAD: Implementation of Embedded Formative Assessments
ON THE FIDELITY OF IMPLEMENTING EMBEDDED FORMATIVE
ASSESSMENTS AND ITS RELATION TO STUDENT LEARNING
Erin Marie Furtak
Max Planck Institute for Human Development
Maria Araceli Ruiz-Primo
University of Colorado at Denver and Health Sciences Center
Jonathan T. Shemwell
Stanford University
Carlos C. Ayala1
California State University, Sonoma
Paul Brandon
University of Hawaii
Richard J. Shavelson
Stanford University
Miki Tomita
University of Hawaii and Stanford University
Yue Yin
University of Hawaii
1 Order of authors at this point is alphabetical
Implementation of Embedded Formative Assessments
Abstract
Given the current emphasis in the field of educational research on conducting high-
quality experimental studies (Shavelson & Towne, 2002), the present study a case in point, it is
becoming increasingly important for researchers to accompany their studies with evaluations of
the extent to which their experimental treatments are realized in classrooms; that is, to perform
studies of the fidelity of implementation of the experimental treatments. This paper compares the
form and extent of the treatment six middle school physical science teachers participating in an
experimental study of the effects of formative embedded assessment on student learning actually
delivered to the observed learning gains of their students. The study used videotaped lessons of
the treatment as its primary datasource, and coded them according to each teacher’s enactment of
critical aspects of the treatment’s intended structure and processes defined by the embedded
assessments’ designers. While the codes for critical aspects of treatment structures dealt with the
timing and sequencing of the embedded assessments, codes for critical aspects of the treatment
processes included asking students to provide explanations, encouraging argumentation, and
supporting ideas with evidence. The findings on fidelity of teacher implementation were then
compared to students’ learning as measured by 38 common pre-posttest multiple-choice
achievement test items (see Yue et al, this issue).
Results indicated that teachers varied in their enactment of the experimental treatment
across almost all of the critical aspects examined (Yue et al., this issue). When teachers’
enactment of formative assessment, regardless of treatment group, was compared to the results of
student learning, there was a 0.71 correlation between treatment processes and student learning.
The results suggested that the nonsignificant results of the overall study may have been due, at
1
Implementation of Embedded Formative Assessments
least in part, to the failure of many of the experimental-group teachers to implement the
formative-assessment treatment as intended by the study’s designers.
2
Implementation of Embedded Formative Assessments
Introduction
Teachers in American public schools have a wide range of academic specializations and
abilities, take a variety of paths to certification, and bring their own backgrounds, beliefs, and
experiences into their classrooms (Richardson, 1996; Zumwalt & Craig, 2005). Furthermore, the
context of every classroom varies with the student population, the conditions of the school, the
community, and the district and state in which a teacher works. This means that providing any
curriculum to six different teachers across multiple schools and states will result, to a certain
degree, in six different variations upon what may have been intended in the development of the
curriculum. A similar contention can be made with respect to instructional treatments. In the
case of our project, the effectiveness of the embedded formative assessments depended not only
on the quality of the assessment prompts as they were designed (Furtak & Ruiz-Primo, 2007),
but also on how they were implemented. This means that in order to draw valid conclusions
relating to the potential of formative assessments to improve students’ learning, it is critical to
know whether teachers implemented the formative assessments as intended by their designers.
We therefore explored in this study what teachers actually did with the embedded
formative assessments, called in the experiment, Reflective Lessons. In the absence of this
information, it would be difficult to determine whether the results observed in the project (see
Yin et al., this volume) can be attributed to an absence of a formative-assessment treatment
effect, a poor conceptualization of formative assessments in this study or to an implementation
that not only varied between teachers, but also strayed considerably from what had been intended
by the assessment designers. By looking at implementation, we move beyond the mere design of
the instructional treatment to compare the form and extent of the treatment teachers actually
delivered to the observed learning gains of their students. More importantly, we identify
3
Implementation of Embedded Formative Assessments
shortcomings in the design and implementation of embedded assessments that must be overcome
to be effective instruments for learning in the classroom.
In this paper, we provide one possible model for examining fidelity of treatment
implementation. Using this model as an analytic lens, we focus on the following research
questions:
(1) Were the critical characteristics of the embedded assessments implemented by the
teachers as envisioned by the Assessment Development Team and as described in the
Teacher’s Guide to the Reflective Lessons?
(2) Was implementation fidelity related to students’ learning?
In what follows we provide the framework we used to approach the study of the
implementation of the embedded formative assessments. We then describe the methodological
characteristics of the study. Finally, we present the results and the lessons learned from them.
An Approach for Measuring Fidelity of Implementation
Fidelity of implementation is generally considered to be a way of determining the
alignment between the implementation of a treatment and its original design. However, there is
no clear consensus on what exactly constitutes fidelity of implementation, and empirical
evidence on the relationship between fidelity of implementation and program success is limited
(see Dane & Schneider, 1998; Dusenbury, Brannigan, Falco, & Hansen, 2003; Ruiz-Primo,
2005). In this paper, we intend to contribute evidence that links fidelity of implementation to
treatment effectiveness in the service of better understanding the results of the our project.
4
Implementation of Embedded Formative Assessments
The process of designing, implementing, and measuring a treatment can be divided into
three categories for analysis: the intended, the enacted, and the achieved effectiveness of the
treatment (McKnight, Crosswhite, Dossey, Kifer, Swafford, Travers, & Cooney, 1987; Ruiz-
Primo, 2005).1 In this paper, we attempt to connect these three facets. To do so, we focus on two
aspects of fidelity of implementation as defined by Dane & Schneider (1998) that deal with the
extent to which the enacted curriculum matches the intended. The first aspect is adherence, or
the extent to which specified components of a program, curriculum, or treatment are delivered as
prescribed; and the second is quality of delivery, or the extent to which teachers approach a
theoretical ideal in terms of prescribed processes of a program, curriculum, or treatment (Figure
1).
--------------------------------Insert Figure 1 About Here--------------------------------
Measuring fidelity of implementation must begin by identifying the critical treatment
characteristics that are supposed to achieve its effects (Bauman, Stein, & Ireys, 1991; Ruiz-
Primo, 2005). Clear specification of what the treatment entails is necessary to ensure that the
active ingredients critical to its success are being delivered (Moncher & Prinz, 1991). We
identified the critical characteristics based on adherence and quality of delivery. While adherence
refers to the implementation of the structure of the treatment, quality describes the
implementation’s fidelity to the process of the treatment (Mowbray, Holter, Teague, & Bybee,
2003; Ruiz-Primo, 2005). Therefore, to evaluate the extent to which the teachers in the
experimental group adhered to the Reflective Lessons as intended, we focused on the structure
and the delivery process of the Reflective Lessons as defined in the intended treatment. Finally,
5
Implementation of Embedded Formative Assessments
we used the results from the comparison of the intended and enacted treatments to guide our
interpretation of the achieved curriculum.
Defining Critical Aspects of the Embedded Assessments
As described in other papers in this issue (e.g., Ayala, this issue), the embedded formative
assessments are formal prompts inserted into the curriculum that are designed to help teachers
check student understanding at key points during instruction and reflect on the next steps needed
to move students forward in their learning. However, it is important to go beyond defining the
goals of the embedded assessments (e.g., reduce the gap between were students are and where
they should be) or describing general requirements for their administration (e.g., formative
assessments require three days to implement) so that we may define the critical operational
features or aspects of the embedded assessments. Defining the critical aspects requires a careful
analysis of the treatment at hand; in this case, a careful analysis of the envisioned or intended
structure and process of the embedded assessments and their implementation.
Adherence: Treatment Structure
In order to measure teachers’ adherence to the Reflective Lessons as the Assessment
Development Team intended, we first determined the critical aspects of the treatment using the
FAST Teacher’s Guide to the Reflective Lessons (2003, referred to as Guide). This Guide was
the primary source of information about the treatment and was used to design and carry out the
summer training program with the experimental teachers.
Two types of embedded formative assessments were used in the study (e.g., Ayala et al.,
this issue). They varied based on the type of assessment prompt used and the structure of
6
Implementation of Embedded Formative Assessments
implementation. Reflective Lessons Type I consisted of four formative assessment prompts
(Graph, Predict-Observe-Explain, Short-Answer, and Predict-Observe) used to assess students’
conceptions and/or mental models around why things sink and float and to support students in
fashioning increasingly coherent and evidence-based explanations of the phenomena. Reflective
Lessons Type II employed a concept map as a formative assessment prompt and focused on
checking students’ progress in understanding key concepts in the unit. Each type of Reflective
Lesson had a different structure of implementation, and thus presents different critical program
components to be measured.
In the Reflective Lesson Type I, formative assessment prompts were designed to build on
each other; therefore, it was expected that all assessment prompts were implemented in the
sequence prescribed. Furthermore, teachers were to intersperse discussions within the sequence
of written prompts. In the case of the Predict-Observe-Explain, teachers were provided with
three possible sequences they might use to mix written work, class discussion, and teacher
demonstrations.
Based on this information about the structure of the Reflective Lessons Type I, we
considered three aspects of the treatment as critical to their effectiveness: (1) implementation of
each assessment prompt, (2) sequence in which they were to be implemented, and (3) placement
of discussions between written prompts. We also identified a fourth component, the amount of
time teachers would take to implement the prompts, not as being critical to the effectiveness of
the Reflective Lessons, but as an important piece of information about the feasibility of using the
embedded assessments in a reasonable amount of time. The Assessment Development Team
envisioned the Reflective Lessons Type I to be carried out across two to three 45-minute class
periods, although the exact amount of time teachers used was not discussed at the minute level in
7
Implementation of Embedded Formative Assessments
the introductory workshop. Figure 2 illustrated the implementation structure and critical
components of the Reflective Lessons Type I.
--------------------------------Insert Figure 2 About Here--------------------------------
The Guide also suggested an order for the Predict-Observe-Explain assessment, which we
identified as related to sequencing, which we expanded to include not only the sequence between
prompts, but also sequence within prompts. Since teachers were given three options for the
sequence of activities in carrying out the Predict-Observe-Explain assessment, we viewed this
aspect as non-critical. Each sequence involves the students recording their predictions and
reasons, the teacher collecting, clustering, and posting those predictions and reasons, and asking
students to write their observations and explanations. The sequences differed in the placement of
discussions, and at which point the students are asked to write their observations and
explanations.
In the Reflective Lesson Type II, the Guide specified that students should create concept
maps as individuals, and then combine their best ideas into a small-group concept map. The
Reflective Lesson Type II was intended to be carried out in one class period. Therefore, three of
the four aspects from Reflective Lessons Type I also were identified for this type of Reflective
Lesson: (1), implementation of prompts (i.e., individual and group concept maps), (2),
implementation in sequence, and (4), timing. Teachers were given one possible sequence for
implementing the Concept Maps; to begin by training students in the procedure for making the
map, then having students make a map working individually, a map in a small group, and then
constructing a concept map as a class.
8
Implementation of Embedded Formative Assessments
Quality of Delivery: Treatment Processes
The evaluation of the quality of the treatment processes of the Reflective Lessons focused
on the teaching strategies (as they are called in the Guide) conceived by the Assessment
Development Team and the summer teacher training. These strategies were developed to be
consistent with the models of formative assessment in scientific inquiry settings. In this paper,
we define the critical treatment processes of our Reflective Lessons in terms of the two major
formative assessment processes they embodied: making students’ thinking visible and
advancing students in their learning. We divided the first process, making students’ thinking
visible, into two critical aspects: (1) eliciting (publicly) student conceptions about sinking and
floating, and (2) tracking and clustering these conceptions in relation to each other and to our
target learning trajectory. Advancing students’ learning of the program content included three
more critical aspects: (3) helping students provide reasons for their explanations; (4)
encouraging argumentation; and (5) helping students base their claims on evidence collected
from in-class investigations. These processes are described in more detail below.
Since formative assessment assumes that teachers’ instructional actions must be based
upon what students currently know (e.g., National Research Council, 2001a), a fundamental
element of its enactment is eliciting and making public students’ conceptions. In our project,
teachers were provided with lists of strategies to help elicit students’ ideas, including asking
students to come to a consensus at their table, facilitating student presentations, taking votes, and
simple questioning in whole-class, small group, and individual settings.
Since students can quickly produce a wide range of conflicting or redundant ideas in
scientific inquiry settings, teachers can monitor students’ ideas by recording or making them
9
Implementation of Embedded Formative Assessments
visible in some way. The Guide specifically asked teachers to track students’ conceptions and
present them in a visual manner, such as writing students’ ideas on the board, tallying votes for
predictions, or recording ideas on pieces of paper that could be moved around and compared. In
addition, teachers were specifically asked to cluster students’ conceptions, consolidating similar
ideas and summarizing them into central ideas. For instance, a teacher might collapse ideas like
‘flat things float more easily’ and ‘boats might be heavy, but they still float’ into a more general
statement such as ‘shape matters.’
Another important element to formative assessment teaching strategies is for students to
communicate their ideas to each other, and to provide reasons, evidence and explanations for
their ideas (Black & Wiliam, 1998). The teacher’s role therefore is to promote reasoning, by
asking students to provide explanations and justifications, probing for deeper meaning, and
comparing/contrasting student ideas (Ruiz-Primo & Furtak, 2006, 2007). A focus of the training
program and the Guide was to train teachers to push students to clarify their ideas and provide
evidence and reasoning for them.
In the context of formative assessment, argumentation can serve the function of self-and
peer-assessment, where students listen to the ideas of others, consider supporting evidence, and
progress to higher levels of understanding (Sadler, 1989). Arguing scientific ideas is also
fundamental to the practice of scientific inquiry, both in the classroom and in the field of science
(e.g., AAAS, 1990; 1993; Newton & Osborne, 1999; Osborne, Erduran, & Simon, 2004).
Therefore, in the Guide and training, teachers were encouraged (and were given opportunities to
practice during the summer training) to promote student-to-student discussions and debate rather
than merely responding to questions posed by the teacher. This argumentation was intended to
10
Implementation of Embedded Formative Assessments
provide students immediate feedback about their conceptions as they reflected on how evidence
could be used to support their claims.
Finally, and aligned with the previous category, the training and Guide emphasized that
teachers should encourage students to provide evidence for their ideas, so that this evidence
might be evaluated and used to revise knowledge claims. Evidence-based reasoning is a
cornerstone of effective formative assessment practice in the context of scientific inquiry
(National Research Council, 1996; 2001b; Duschl, 2001). To capture the scientific inquiry nature
of this instructional transaction, we also created a component named student use of evidence-
based reasoning to capture whether or not students were citing evidence from the investigation
they completed in class, and whether or not this evidence was then used to refine, develop, and
support universal explanations for sinking and floating. Table 1 provides a summary of treatment
structures and quality of delivery used as the analytic framework to determine the enacted
treatment for the implementation study.
--------------------------------Insert Table 1 About Here--------------------------------
Method
The purpose of this study is to determine the extent to which each of the six teachers in the
experimental group implemented the treatment, the Reflective Lessons, as intended, and whether
any differential quality of implementation can be related to the effectiveness of the formative
assessments to improve students’ learning. This section will provide information about the six
Experimental Group teachers and their classes, data collection and analysis procedures.
11
Implementation of Embedded Formative Assessments
Participants
The six teachers who were randomly assigned to the Experimental Group represent
various backgrounds and levels of experience; more information about each of the six teachers is
provided in Table 2 (see also Shavelson et al., this issue, for additional information).
--------------------------------Insert Table 2 About Here--------------------------------
As Table 2 shows, not all of the teachers in the study had post-secondary background in
science, and several were close to the beginning of their teaching careers. The curriculum was
implemented in 6th grade in some districts, and 7th grade in others. Class sizes ranged from 20 to
31, and class-period length also varied.
Sources of Information
Intended Treatment. The Guide served as the major source of information about the
intended treatment, as it reflected both the intentions of the Assessment Development Team, as
well as the summer training carried out with the experimental group teachers (see Ayala et al.,
this issue).
Enacted Treatment. Evidence about how the Reflective Lessons were implemented came
from classroom videotapes of each teacher’s focus class (for information on how focus classes
were selected, see Yin et al., this issue).
Achieved Treatment. Students’ responses to the 38 items which appeared on the pre- and
post-achievement test were used as evidence to link the quality of the treatment implementation
on the effectiveness of the treatment (see Yin et al., this issue).
12
Implementation of Embedded Formative Assessments
Data Collection
Videotapes. Since teachers were distributed across the country, it was not feasible to have
Assessment Development Team staff videotape the implementation of the Reflective Lessons.
Rather, teachers were shown how to set up and videotape lessons during their training; these
videotapes served as the primary data source for this study. More specifically, as part of the
summer training workshop, all teachers were shown a videotape explaining how to place the
camera in the classroom and how to familiarize students with the equipment. Each teacher was
then given a Canon ZR60 Digital Video Camcorder, a tripod, a lapel microphone with pocket
transmitter, and an ample supply of mini-DV tapes and batteries for operating the cameras.
Teachers then practiced setting up and operating the camera under the guidance of members of
the Stanford Education Assessment Laboratory (SEAL) members. Each teacher then placed the
camera at the back of their classroom so that the teacher and some students could be observed.
Teachers videotaped each day that they taught investigations from the first 12 investigations of
FAST in which the assessments were embedded, as well as all of the Reflective Lessons, which
were embedded after lessons 4, 6/7, 10/11 (see Ayala et al., this issue, for more information
about the design and placement of the Reflective Lesson prompts).
Each week, teachers submitted their videotapes in pre-addressed, stamped envelopes.
SEAL staff then logged the videotapes according to date and lesson taught into a database,
numbering each tape sequentially so that they could be easily kept in order. An outside
contractor transcribed the videotapes and converted them to RealPlayer files so that the videos
could be copied and viewed.
Each videotape marked by the teachers as containing a Reflective Lesson was separated
for analysis. Since teachers took different amounts of time to teach the sinking and floating unit
13
Implementation of Embedded Formative Assessments
in which the assessments were embedded, the total number of Reflective Lesson videotapes
collected from each teacher varied greatly from 8-16 videotapes of 424-825 minutes duration.
While most of the teachers submitted videotapes of most of the prompts, Becca did not submit
videotapes of four prompts - more than one-fourth of the Reflective Lessons. The other
exception is Andy, who did not submit any videotapes of Concept Maps. Of these tapes received,
about 5% had poor or no sound and were not further analyzed.
Pre- and Posttests of Student Learning. The pre-post achievement test included
proximal and distal multiple-choice and short-answer items addressing the concepts of mass,
volume, density, relative density and sinking and floating, and graph interpretation. Thus student
performance on the pre- and posttest can be identified as a measure of various aspects of
students’ conceptual understanding. More information on the validity and reliability of the 38
common pre-posttest multiple-choice achievement test items can be found in Shavelson et al.’s
paper in this volume.
Pretests were administered to students at the beginning of the students’ science course.
In most cases, this occurred in the fall (since Carol’s science class did not start until January,
pretests were administered to her students after the winter break). Posttests were then
administered as closely as possible to the last day of instruction.
Data Analysis
Since the first two sources of data, the Guide and videotapes, were both coded according
to a framework aligned with the goals of the project, information about that coding system will
be presented first. Then, the method used to analyze the Guide and videotapes will be discussed
in some detail.
14
Implementation of Embedded Formative Assessments
Fidelity of Implementation Coding System. We designed a coding protocol to capture in
the most direct way possible each teacher’s alignment with the intended treatment – the critical
aspects falling under the headings treatment structures and quality of delivery - that were listed
in Table 1. These critical aspects were operationalized into one or more categories intended to
provide measures that capture the extent to which each teacher enacted the treatment as intended
(Table 3). The first two codes focus on the class organization suggested in the Guide to promote
certain processes. For example, the more time spent on teacher talk/task setting, the less time was
available for other important processes (e.g., argumentation). The next three codes focus on
events that were expected to be observed in an appropriate treatment process.
--------------------------------Insert Table 3 About Here--------------------------------
Intended Treatment: Coding the Teacher Guide. The codes above were used, when
relevant, to identify what might be called the “ideal implementation profile,” i.e., the Assessment
Development Team’s vision of treatment implementation. Codes were applied to relevant pieces
of the Guide; for example, applying the ‘Focus of Instruction’ codes to the treatment
implementation sequences described above. In other cases, information from the guide was
translated directly into the ideal profile of implementation. For example, the Guide’s suggestion
that Reflective Lessons should be implemented in 2-3 days was used to interpret the length of
time that teachers actually took to implement the lessons.
Enacted Treatment: Coding the Videotapes. The ‘Fidelity of Implementation’ coding
system was then applied to the videotapes of the Reflective Lessons collected by the teachers.
One-minute time intervals served as the unit of analysis. Since we wanted to capture all relevant
strategies used by each teacher, and to be able to make statements about the duration of different
15
Implementation of Embedded Formative Assessments
strategies and elements to the Reflective Lessons, coding the videotapes minute-by-minute
provided a broad overview of how each teacher used his or her time to implement the lessons.
To begin, all videotapes submitted by each teacher were reviewed, and those that were
labeled as containing Reflective Lesson material were verified and designated for analysis. Then
a transcript of each video was segmented by minute. In some case, the process of segmenting the
videotapes revealed that some of the lessons were not properly marked; that is, these lessons
contained other activities not related to the Reflective Lesson implementation. These videotapes
were then removed from the sample.
The coding strategy was then applied to each time interval once; with only a single code
permitted for each category. The “Focus of Instruction” code was applied first and determined
which subsequent codes would be applied or designated as “not applicable”; for instance, the
“Student Task” code was only applied when students were working individually or in small
groups, and other codes were only applied during whole-class discussions. Table 4 illustrates
how codes were applied to four minutes of videotape; coding categories are shown in gray if they
were not applied to a particular minute of videotape.
--------------------------------Insert Table 4 About Here--------------------------------
Three raters, the first three authors, participated in training in coding the videotapes.
Once satisfactory levels of agreement were reached in reliability training analyses, each rater
independently coded 22% of the videotaped lessons. Information about the coders’ is presented
below; all statistics exceed satisfactory threshold levels and are presented in Table 5.
16
Implementation of Embedded Formative Assessments
--------------------------------Insert Table 5 About Here--------------------------------
Once agreement was established, the remaining videos were divided between two of the
three raters and coded independently.
Achieved Treatment: Analysis of Student Learning. We tested differences between the
six groups with before and after instruction. We also provide evidence about the relationship
between the quality of the implementation and student learning.
Results
The goals of the study were to determine whether the teachers implemented the critical
aspects of the embedded assessments as prescribed by the Guide and to link the quality of
implementation to the effectiveness of the formative assessments to improve student learning.
This section is organized to directly address these goals. First, information regarding the
Intended and Enacted Treatment will be presented; then the results of student learning– the
Achieved Treatment – will be presented. Finally, we will attempt to link the Intended, Enacted,
and Achieved Treatments together to guide interpretation of the Romance Project findings.
Intended and Enacted Treatment
For the purpose of clarity, results for the Intended and Enacted Treatment will be
presented first for adherence to treatment structure, and second for quality of delivery of
treatment processes. Each sub-section will be subsequently presented according to the aspects of
fidelity of implementation identified earlier in the paper.
17
Implementation of Embedded Formative Assessments
Measuring Adherence to Treatment Structure
(1) Implementation of All Prompts. Our expectation was that teachers would implement
all of the prompts that made up the Reflective Lessons Types I and II. This expectation was
reiterated in the summer teacher training and in the Guide. Unfortunately, since teachers were
responsible for submitting videotapes themselves, we cannot know if we received videotapes of
all the lessons that were taught. Given that limitation, we still attempted to determine if we had
videotape evidence from all teachers enacting all Reflective Lesson prompts. Diana was the only
teacher who submitted videotapes from every prompt, and Aden, Carol, and Robert were missing
only one. Andy and Becca were missing three and four lessons, respectively.
(2) Sequence within and between prompts. The Guide supplied model sequences for the
Reflective Lessons Type I and II overall, as well as sequences for the implementation of the
Predict-Observe-Explain assessments in particular). While we emphasized the importance of
implementing all prompts in sequence, teachers were provided with several options for the
implementation of the Predict-Observe-Explain assessment, which blend multiple teaching
strategies and levels of grouping.
With respect to within-prompt sequences, we coded the three suggested Predict-Observe-
Explain sequences according to grouping level and strategies. Half the teachers experimented
with multiple suggested implementation sequences. The data also show that two of the teachers –
Carol and Diana – did not use the suggested sequences at all; instead, these two teachers blended
small group work within the sequences provided by the Guide.
Turning to the between-prompts level of analysis, we found that while five of the six
teachers did not implement all prompts, all six teachers always implemented the Type I
Reflective Lessons in the sequence described by the Guide. In contrast, the teachers were less
18
Implementation of Embedded Formative Assessments
likely to follow the sequence guidelines for the Reflective Lessons Type II. A closer look at the
data indicates that teachers with sequences categorized as ‘other’ often skipped some elements of
the concept map Reflective Lessons, for example, making a group map or a class map. In other
cases, teachers varied focus of instruction (e.g. whole-class discussion, small group) more than
suggested by the sequences in the teacher guide (e.g., proceeding in order from individual work
to small group work to whole-class discussion). Results of these within and between-prompt
sequence analyses are shown in Table 6.
--------------------------------Insert Table 6 About Here--------------------------------
(3) Placement of discussions. The Guide instructed teachers to initiate and promote
scientific discussions during the Reflective Lessons. During the summer training, teachers were
given opportunities to hold discussions during each type of embedded assessment prompt. Figure
3 illustrates the relative percentages of time that each teacher devoted to whole-class discussion,
as well as other foci of instruction. Teachers spent an average of 27% of Reflective Lesson class
time in whole-class discussion (Min. = 20%, Max, =35%).
--------------------------------Insert Figure 3 About Here--------------------------------
Looking more closely at when these discussions happened, we focused on the discussions
that fit within the Reflective Lesson Type I sequence after the Short-Answer and Predict-
Observe-Explain prompts. A discussion was viewed to have ‘taken place’ if there was at least
one minute of discussion after the Predict-Observe-Explain task was completed, or if more than
1 minute of discussion took place after a period of independent student work following the Short
Answer prompt. This slightly more stringent standard was applied to the Short Answer than the
19
Implementation of Embedded Formative Assessments
Predict-Observe-Explain prompt since the teachers were encouraged strongly to have discussions
of longer duration at this point in the unit. A comparison of where teachers placed discussions is
provided in Table 7.
--------------------------------Insert Table 7 About Here--------------------------------
Among the six teachers, none held a discussion following every Predict-Observe-Explain
prompt. Results of the Short Answer prompt indicated that while Carol and Diana never missed a
discussion, Robert, Andy, and Becca skipped discussions once, and Aden twice. The fact that
many of these Short Answer whole-class discussions were missed is important, since this was the
place during which the treatment processes, to be discussed below, were expected to take place.
(4) Timing within and between prompts. According to the Guide, Reflective Lessons
Type I were to be implemented in two to three class sessions, assuming a 45-minute class period.
Reflective Lessons Type II were to take one class session to implement. According to the
sequence of implementation described in the Guide, conservative estimates can be made of how
long each prompt might take to implement. Table 8 combines these estimates with each teacher’s
average duration of implementation for each of the prompt types.
--------------------------------Insert Table 8 About Here--------------------------------
Taken together, the Reflective Lessons were expected to add between 8 and 11 class
periods to the sinking and floating Unit. Analysis of the videotapes submitted by teachers1
indicated that, on average, teachers actually spent more time on the Reflective Lessons than even
20
Implementation of Embedded Formative Assessments
the more liberal 11-lesson estimate provided in the Guide (See Table 9; Mean Number of
Lessons=13, S.D.=2.83.).
--------------------------------Insert Table 9 About Here--------------------------------
We initially worried that the addition of the Reflective Lessons would make for longer
units overall for the teachers. These concerns were confirmed; in fact, the actual number of
calendar days (including weekends and holidays) the investigations 1-12 and Reflective Lessons
varied greatly between classrooms (Mean=139.17 days, S.D.=55.43). Although these
calculations incorporate many days in which class was not held, the extent to which the unit and
Reflective Lessons were spread out across multiple months is still remarkable (e.g. Becca’s 242
days vs. Carol’s 87 days).
Measuring Quality of Delivery of Treatment Processes
Comparing the implementation of treatment processes to what was intended is less
straightforward than doing so with treatment structures. There was no definitive sequence given
to teachers as to how to conduct whole-class discussions; rather, the Guide contained many
suggested teaching strategies, particular questions to ask, and types of feedback to provide.
Therefore, in this section, we do not present an ‘ideal’ implementation profile. Rather, we
discuss the critical aspects as described in the Guide. Results are presented in the form of
proportions since each teacher used different amounts of time to conduct whole-class
discussions; to account for this discrepancy, the number of minutes (N) is provided in each
analysis. In some cases, the N for a particular teacher varies slightly between codes because
some minutes for some codes were dropped from the analysis due to inaudible statements by the
teacher and/or students.
21
Implementation of Embedded Formative Assessments
(5) Eliciting student conceptions. The Guide emphasized that, in order to use the
reflective lessons effectively, teachers needed to collect and organize student ideas into coherent
groups. We classified their strategies for doing so as holding class presentations, asking students
to vote to indicate their predictions or reasoning, or simply to ask students to share their ideas
(collecting). Figure 4 shows the frequency with which each teacher used these strategies within
whole-class discussions.
--------------------------------Insert Figure 4 About Here--------------------------------
By far teachers spent most of their time collecting students’ ideas among the four
elicitation strategies. This allocation of time to collecting ideas might be expected, since the code
served as a kind of ‘grab bag’ for the different ways that teachers asked students to make their
thinking explicit. The more specialized strategies of class presentations and voting revealed
important differences among the teachers’ implementations of the formative-assessment
treatment. Only two teachers used class presentations regularly as a strategy to make students’
thinking explicit (Diana=35.7%, Becca=27.5%), whereas four teachers asked students to share
their ideas through voting, for varying amounts of time (mean = 6.25, Min. = 0%, Max. =
13.7%).
(6) Tracking and clustering student conceptions. In order to keep a visual record of what
students were thinking, teachers were also asked to track students’ ideas by displaying them in
some manner and then grouping those that had common features. Once these ideas were
displayed, the Guide asked teachers to cluster similar ideas together so that students could
compare them with each other. In fact, the Guide described clustering as essential to using the
22
Implementation of Embedded Formative Assessments
Reflective Lessons effectively. Figure 5 shows the frequency with which these strategies were
used in the classrooms.
--------------------------------Insert Figure 5 About Here--------------------------------
All six teachers usually displayed student ideas in some way by using the whiteboard/
chalkboard, an overhead projector, poster paper, sticky notes, or smaller pieces of paper that
were attached to the board with magnets or tape. However, once these ideas were displayed, only
four of the teachers made an attempt to cluster the ideas together in any way (mean percentage of
minutes across teachers = 5%, Min.=0%, Max.=13.2%).
(7) Asking students to provide reasons for their explanations. Since an important purpose of
the Reflective Lessons was to help students advance in their understanding of density and
buoyancy, teachers needed to ask students to provide explanations and follow-up questions to get
at the conceptions that lay beneath their predictions, claims, and other statements. Figure 6
presents the frequencies with which teachers used these two strategies during whole-class
discussions.
--------------------------------
Insert Figure 6 About Here--------------------------------
The results shown in Figure 8 indicate that most of the time, teachers were not asking
students ‘Why’-types of questions that were intended to get at underlying reasoning; the majority
of time in five of the six classrooms involved no instances of teachers asking their students to
provide explanations (mean percentage of minutes across teachers=65.7%, Min.=45.5%,
23
Implementation of Embedded Formative Assessments
Max=82.4%). In addition, while all teachers did occasionally ask follow-up questions, only two
teachers – Aden and Diana – did so for more than 10% of the time).
(8) Students argue ideas and evidence. The Guide explicitly states that teachers should
encourage argumentation through discussions and debate. An important piece of the idea of
‘debate’ is that students respond to each other’s ideas as opposed to responding only to their
teacher’s statements and questions. Figure 7 illustrates how often students were involved in this
kind of debate.
--------------------------------Insert Figure 7 About Here--------------------------------
For the vast majority of whole-class discussion time in all six classrooms, students were
speaking to the teacher, not each other (Figure 9; mean=82.6%, Min.= 73.1%, Max.= 91.2%). In
Becca’s class, students never spoke directly to each other, and in Aden and Andy’s class, this
form of argumentation occurred less than five percent of the time. Only in Carol’s class was a
somewhat larger portion – 20% - of whole-class discussion time spent with students addressing
each other.
(9) Students provide evidence for their claims. Students should, according to the
Guide, support their claims with empirical evidence – that is, provide data from systematic
observations conducted in class that relate to the conceptions underlying their explanations. The
Guide describes different qualities of evidence, and possible sources. Nevertheless, we simply
looked to see if students, in each classroom, provided any evidence when making statements
during whole-class discussions. Table 10 presents the results of this analysis.
24
Implementation of Embedded Formative Assessments
--------------------------------Insert Table 10 About Here--------------------------------
On average, students supported their claims with evidence about 25% of the time;
however, there was considerable variation among teachers (S.D.=9.3, Min.=11.8%, Max.
=39.4%).
Learned Treatment
To examine the relationship between fidelity of RL implementation and student learning,
we focused on student performance on the multiple-choice pre-posttest. While other measures of
student learning were assessed as part of the Romance Project (see Yin et al., this issue), the 38-
item multiple-choice achievement test was the only assessment administered before and after the
treatment, and as such was the only assessment that could provide measures of what students
actually learned during the course of the treatment. First, we determined whether or not
students’ average achievement differed by teacher on the pretest. A one-way ANOVA indicated,
as expected given the different demographic profiles of the participating schools, a significant
mean difference between the six groups at the beginning of the study (F(5,139) = 2.99, p = .014).
The Tukey’s HSD indicated that the significant difference observed was only between Becca and
Robert’s students; however, this difference was not surprising given that the mean score of
Becca’s students was the lowest observed and Robert’s the highest among the six classes (see
Table 11).
--------------------------------Insert Table 11 About Here--------------------------------
Next, we focused on differences between students in classes at the end of the study. We
used an analysis of covariance, controlling for pre-test. Results indicated a significant difference
in the post-test scores among experimental-group teachers (F(5,120) = 8.72, p = .000). Post-hoc
25
Implementation of Embedded Formative Assessments
comparisons indicated significant mean differences. Significant differences were observed
between Carol (highest adjusted mean score) and all of the other teachers, but Andy (p = .057).
Robert’s students (lowest adjusted mean score) were significantly lower than Andy’s, Carol’s,
and Diana’s students, but not to Aden’s and Becca’s students.
Comparing Enacted and Learned Curricula
In this section we take the final step in connecting the Intended, Enacted, and Learned
Treatment in our examination of the relationship between implementation fidelity and student
achievement. Since we did not have videotapes for all lessons for all teachers, there is little
validity in making comparisons between enacted treatment structures and student learning.
However, since analyses of treatment processes were based on proportions and not total data, we
were able to develop a ranking for each of the critical processes based on the extent to which
each teacher’s enacted treatment aligned with the intended treatment as defined by the Guide.
The rankings for the Quality of Delivery are based upon teachers’ congruence with the
Treatment Processes that guided our analysis; meaning, the higher the percentage of time those
processes were implemented, the higher the teacher’s ranking. For the first aspect, eliciting
student conceptions, no ranking was given since none of the three strategies coded were viewed
as being more critical than the others. The second aspect, tracking and clustering students’ ideas,
did have a critical aspect identified by the Guide; that of clustering students’ ideas. For the third
aspect, asking students to provide reasons for their explanations, we viewed both asking for
explanations and asking follow-up questions to be critical; therefore, the teachers who had higher
proportions of both of these codes received higher rankings. The fourth aspect, encouraging
argumentation, was simpler; teachers whose students talked to each other more were rated more
26
Implementation of Embedded Formative Assessments
highly. Finally, students provide evidence for their claims also translated readily into a ranking;
teachers whose students cited evidence more received higher rankings.
These results and average teacher rankings are shown in Table 12 alongside the pre-
posttest gain scores. The means for the enacted treatment rankings and learned pre-posttest gain
scores were then used to produce overall rankings for the teachers on the enacted and learned
treatments.
--------------------------------Insert Table 12 About Here--------------------------------
The relationship between each teacher’s learned and enacted treatment ranking is shown
in Figure 8.
----------------------------------Insert Figure 8 About Here----------------------------------
Spearman’s rank-order correlation indicates a relationship between the rankings of the
enacted and learned treatments; however, the correlation, while fairly high in magnitude is not
significant, likely due to the small sample size ( = 0.714, p > 0.05). The comparison between
the rankings does reveal three groupings of teachers; first Carol, who ranked first on both the
enacted and learned treatment; second, Aden, Diana, and Andy, who had intermediate ranks on
both measures; and finally Robert and Becca, whose rankings were lowest for both enacted and
learned treatment. This relationship, while by no means causal, at least indicates that there was
some similarity between the ranking of teachers’ enacted quality of delivery and learned
treatment.
27
Implementation of Embedded Formative Assessments
Discussion
The goal of this paper was to provide information about the quality of implementation of
the formative-assessment treatment, the embedded assessments, with the purpose of gaining
some understanding about the results reported by Yin et al. (this issue)—large teacher effects and
no treatment effect. We argued that in the absence of information about implementation, it would
be difficult to determine whether Yin et al.’s results could be attributed to a poor
conceptualization of formative assessments or to inadequate implementation.
Results indicated that adherence to the treatment structure varied by type of embedded
assessment. Higher levels of adherence to the structure were observed in Reflective Lessons
Type I than Type II. It seems that both the Guide and the training emphasized Type I more than
Type II Reflective Lessons. It might be that the importance of the information gained about
students’ level of understanding through concept maps was not emphasized enough in the Guide,
in the training, or by the researchers. Teachers devoted much more time to the discussion of
Reflective Lessons Type I.
Results also indicated that teachers’ quality of delivery departed even more from the
envisioned implementation than from the treatment structures. Some critical aspects were
implemented across teachers (e.g., whole class discussions), while some others (e.g., clustering
students’ conceptions or asking for students’ explanations) were almost totally ignored by most.
Whole classroom conversations and collecting information from students, alone, are not what
makes for a good implementation. It is clustering students’ conceptions and asking for
explanations, elaborations, and supporting evidence are the most fundamental characteristics and
intentions of the embedded assessments.
28
Implementation of Embedded Formative Assessments
The quality of delivery seemed to be more consistent with teachers’ rankings on the gain
scores as opposed to the adherence rankings. This result supports the contention that simply
giving students the embedded assessments in sequence may not be enough to help students learn;
the quality of delivery of the critical teaching strategies is an essential element in helping
students learn.
While we cannot conclude that the variation among the teachers’ implementations of the
treatment led to the differences in student learning reported by Yin et al. (this issue), the results
at least suggest a correlation between the consistency of treatment enactment with the project’s
intention and student learning.
Implementing quality formal formative assessment requires a careful consideration of
design and practical issues not discussed with enough depth in the literature. Formative
assessment tasks should be designed not only to make students’ conceptions explicit but also to
do it in an efficient and effective way so the embedded assessments do not take a lot of time to
be administered and focus attention on the critical issues that the teacher and the students need to
pay attention to. How much do we know about the types of assessment task that best reflect
students’ conceptions? Few attempts have been made to learn about assessment tasks in the
context of classroom assessment (but see Furtak & Ruiz-Primo, 2007). We believe that the data
collected in this study indicate the need to develop embedded assessments that are easy not only
to implement, but also to gather student’s conceptions that clearly reveal what steps need to be
taken to close the formative assessment loop (“how to get there,” the third step in the three-step
cycle proposed by Ramaprasad in 1983).
The results of this study suggest that the Assessment Development Team did not place a
clear enough emphasis on what it was considered the critical aspects of the treatment. It is
29
Implementation of Embedded Formative Assessments
possible that the complex structure of the Reflective Lessons made them difficult to learn to
teach in one week, and as a result, the critical aspects that were implemented more – the
treatment structures – may have been easier for the teachers to learn than the treatment processes.
The study also raises important questions about the feasibility of formative assessment in
general. For instance, embedded formative assessments may be too restrictive and time-
consuming for teachers. In addition, some of the most critical aspects of implementation – e.g.
pushing students to support their claims with reasons and evidence, encouraging students to
argue with each other - are difficult for any teacher, especially those with limited teaching
experience and weak backgrounds in science. One possible solution may be to avoid focusing on
the structure of formative assessments and instead work with teachers to improve their ability to
lead whole-class discussions that truly engage learners in sharing and arguing their ideas.
In hindsight, we believe that we should have put less effort into presenting teachers with
many possible teaching strategies, and more effort into identifying what we believed were the
most important strategies to help students learn. Furthermore, a model to guide teachers’ quality
of delivery may have helped explicate our instructions to “Cluster student ideas” and “argue and
debate,” although the effectiveness of this model would be entirely dependent upon teachers’
willingness to use it – as we found with the materials we already provided them. A possible
model might be for teachers to gather student ideas, display them, and then cluster and discuss
them, seeking supporting evidence.
Perhaps the most important lesson we learned in the process of completing the Romance
Project is that, despite our best intentions to design a treatment and instruments to measure it, the
actions of the project’s ideas were only as good as our ability to help teachers enact them in the
classroom.
30
Implementation of Embedded Formative Assessments
References
American Association for the Advancement of Science. (1990). Science for All Americans. New
York: Oxford University Press.
American Association for the Advancement of Science. (1993). Benchmarks for Science
Literacy. New York: Oxford University Press.
Ayala, C. (2007) This issue.
Bauman, L. J., Stein, R., E., K., & Ireys, H. T. (1991). Reinventing fidelity: The transfer of social
technology among settings. American Journal of Community Psychology, 19(4), 619-639.
Black, P., & Wiliam, D. (1998). Assessment and Classroom Learning. Assessment in Education,
5(1), 7-74.
Dusenbury, L., Brannigan, R., Falco, M., & Hansen, W. B., (2003). A review of research on
fidelity of implementation: Implications for drug abuse prevention in school settings. Health
Education Research. Theory and Practice, 18(2), 237-256.
Duschl, R. A. (2003). Assessment of Inquiry. In J. M. Atkin & J. Coffey (Eds.), Everyday
Assessment in the Science Classroom (pp. 41-59). Arlington, VA: NSTA Press.
Furtak, E. M., & Ruiz-Primo, M. A. (2007). Studying the Effectiveness of Four Types of
Formative Assessment Prompts in Providing Information about Students' Understanding in
Writing and Discussions. Paper presented at the American Educational Research Association
Annual Meeting, Chicago, IL.
McKnight, C.C., Crosswhite, F. J., Dossey, J. A., Kifer, E., Swafford, J. O., Travers, K. T., &
Cooney, T. J., (1987) The underachieving curriculum: Assessing U.S. school mathematics
from an international perspective. Champaign, IL: Stipes Publishing.
31
Implementation of Embedded Formative Assessments
Moncher, F. J., & Prinz, R. (1991). Treatment fidelity in outcome studies. Clinical Psychology
Review, 11, 247-266.
Mowbray, C., Holter, M. C., Teague, G. B., & Bybee, D. (2003). Fidelity criteria: Development,
measurement, and validation. American Journal of Evaluation, 24(3),315-340.
National Research Council. (1996). National Science Education Standards. Washington, D.C.:
National Academy Press.
National Research Council. (2001a). Classroom Assessment and the National Science Education
Standards. Washington, D.C.: National Academy Press.
National Research Council. (2001b). Inquiry and the National Science Education Standards.
Washington, D.C.: National Academy Press.
Newton, P., & Osborne, J. (1999). The Place of Argumentation in the Pedagogy of School
Science. International Journal of Science Education, 21(5), 553-576.
Osborne, J., Erduran, S., & Simon, S. (2004). Enhancing the Quality of Argumentation in School
Science. Journal of Research in Science Teaching, 41(10), 994-1020.
Richardson, V. (1996). The Role of Attitudes and Beliefs in Learning to Teach. In J. Sikula
(Ed.), Handbook of Research on Teacher Education (2nd ed., pp. 102-118). New York:
MacMillan.
Ruiz-Primo, M. A. (2005). A multi-method and multi-source approach for studying fidelity of
implementation. CSE: Technical Report 677. Los Angeles, CA: Center for Research on
Evaluation, Standards, and Student Testing/ University of California, Los Angeles.
Ruiz-Primo, M. A., & Furtak, E. M. (2006). Informal Formative Assessment and Scientific
Inquiry: Exploring Teachers' Practices and Student Learning. Educational Assessment,
11(3&4), 237-263.
32
Implementation of Embedded Formative Assessments
Ruiz-Primo, M. A., & Furtak, E. M. (2007). Exploring Teachers' Informal Formative Assessment
Practices and Students' Understanding in the Context of Scientific Inquiry. Journal of
Research in Science Teaching, 44(1), 57-84.
Sadler, D. R. (1989). Formative Assessment and the Design of Instructional Systems.
Instructional Science, 18, 119-144.
Stanford Education Assessment Laboratory. (2003). Teacher's Guide to the Reflective Lessons.
Unpublished manuscript.
Yin, Y (2007). This issue….
Zumwalt, K., & Craig, E. (2005). Teachers' Characteristics: Research on the Indicators of
Quality. In M. Cochran-Smith & K. M. Zeichner (Eds.), Studying Teacher Education: The
Report of the AERA Panel on Research and Teacher Education (pp. 157-260). Mahwah, New
Jersey: Lawrence Erlbaum Associates.
33
Implementation of Embedded Formative Assessments
Footnotes
1. We acknowledge that this approach has been used in studying curricula rather than
treatments, but we believe the strategy equally applies to the study of treatments
2. Since the estimates of timing are based upon the videotapes submitted by teachers, we
acknowledge that, beyond videotape we know was lost due to poor audio quality, these figures
may be an underestimate of the time teachers spent.
34
Implementation of Embedded Formative Assessments
Table 1
Adherence and Quality of Delivery of the Reflective Lessons
Intended treatment Critical and non-critical aspectsAdherence:Treatment Structures
(1) Implementation of all prompts(2) Sequence within and between prompts(3) Placement of discussions(4) Timing within and between prompts
Quality of Delivery: Treatment Processes
(1) Eliciting student conceptions(2) Tracking and clustering student conceptions(3) Asking students to provide reasons for their explanations(4) Students argue ideas and evidence(5) Students provide evidence for their claims
35
Implementation of Embedded Formative Assessments
Table 2.
General Information about Experimental Group Teachers and Their Classes
Aden Andy Becca Carol Diana Robert
Gender Male Male Female Female Female Male
Ethnicity White (not Hispanic origin)
White (not Hispanic origin) Asian White (not Hispanic
origin)White (not Hispanic
origin)White (not Hispanic
origin)
Highest Degree Earned MA A or BS B Ed MA or MS MA BA
Major in Science Yes No No No No Yes
Minor in Science Yes No No No No Yes
Teacher CredentialResidency
Certification K-8th, -
Science & English
Elementary Education
Secondary General Science, Elementary
ScienceContinuing K-12 Pre K-6 State in Science,
Diverse Areas
Years of Teaching 2 5 18 23 3 14
Years of Teaching Science 2 5 17 10 1 14
Years teaching 6th/7th grade 2 4 12 10 1 3
Grade Level Taught 7th 7 7th 6th 6th 7th
Science Sessions Length (minutes) 55 53 54 40 40 55
Class Size at Pretest 29 31 22 20 25 27
36
Implementation of Embedded Formative Assessments
Table 3.
Description of the Categories and Codes Used in Coding Protocol Critical aspects
Category Codes Description
Treatment Structures
Focus of Instruction
Teacher talk/task setting
IndividualSmall Group TalkWhole Class Discussions
Teacher speaks to class without engaging students in discussions, sets Reflective Lesson task, or carries out demonstration
Students work individually Students work together in pairs or small groupsTeacher and students engage in a discussions, or teacher
works with students’ ideas
Student Task Table consensus/survey
Peer review
Self-review
Students at each table come to consensus regarding issue or confusion, or conception; or students collect from their groups all conceptions about the question of interest
Students review work of their peers using teacher-provided answer sheet
Students review own work using teacher-provided answer sheet
Treatment Processes
Making students’ thinking explicit
Collecting
Class Presentation
Voting
Teacher asks students to share their observations, predictions, hypotheses, evidence, examples, definitions, procedures or to answer simple, yes-no, or fill-in-the-blank questions.
Teacher asks students to report in front of class, either individually or in groups.
Teacher asks students to raise hands to indicate their prediction/ explanation/conception; tally captured in some way on board
Tracking student conceptions
Displaying Students’ ideas (no clustering)
Clustering Students’ ideas
Students’ ideas displayed on board/overhead/papers/posters/etc. without explicitly organizing or grouping. If ideas are displayed and the discussions about that idea continues into next minute, continue coding the idea as displayed until a new idea is discussed
Teacher actively clusters, categorizes, funnels, or groups Students’ ideas, concepts, procedures, or terms (Teacher arranges ideas into groups or categories; can only happen after teacher has collected at least one conception).
Promoting Reasoning
Asking for student’ explanations, reasoning, or justification
Asking students to Elaborate; Differentiating, Comparing, or Contrasting Student ideas/responses
Teacher asks ‘why’ questions which are initial or follow-up queries to elicit students’ thinking and make students’ reasoning explicit. This also includes teacher asking students to provide evidence for claims they have made
After asking a ‘why’ question and receiving a student response, teacher asks one or more follow-up questions (independent of their quality), or teacher explicitly compares/contrasts S ideas, promotes disagreement, inviting disagreement from students, or in some other way calls attention to differences in student ideas (differentiates)
37
Implementation of Embedded Formative Assessments
Argumentation Students respond to questions or statements introduced by the teacher
Students respond to other students' questions or statements, either directly or mediated by the teacher
Interval contains exchange of ideas between teacher and student/s, but no exchange of ideas between students; only applies to verbal exchanges
Interval contains exchange of ideas or argumentation (e.g. the process of assembling claims, evidence, and reasoning) between students
Role of Evidence Students cite data or evidence from class
Students explicitly reference data or evidence collected in class, including graphs, data points or data tables, POE’s and PO’s, or observations
38
Implementation of Embedded Formative Assessments
Table 4.
Example of Four Coded Minutes
Speaker/Dialogue Coding Category Codes AppliedMinute 1Teacher: Friday, we left off with you predicting
which straw would sink the furthest, which would have the greatest depth of sinking, and which straw would have the least depth of sinking. So I want you to have that page in front of you. Make sure you have that page in front of you. What I’m going to do now is take a quick survey. So everyone look very carefully at your paper. Look at the part of the table that says greatest depth of sinking and look to see which straw you circled,
Focus of Instruction Student TaskMaking Students’ Thinking ExplicitTracking Student ConceptionsPromoting ReasoningArgumentationRole of Evidence
Teacher Talk/Task Setting
Minute 2Teacher: look to see which straw number you
circled, and then put your head down. You should know what page it is. Look to see on the table which straw you predicted would sink the greatest, keep it in your head and put your head down. Raise your hand if you thought that straw number one would sink the furthest? Your head should be down, Nate. Raise your hand if you think that straw number two would sink the furthest? Raise your hand if you thought straw number three would sink the furthest. (Teacher writes tally on board after each vote).
Focus of Instruction Student TaskMaking Students’ Thinking ExplicitTracking Student ConceptionsPromoting ReasoningArgumentationRole of Evidence
Whole Class Discussions
Voting
Displaying students’ ideasReasoning not promotedNo argumentationNo evidence cited
Minute 3And raise your hand if you thought straw number four would sink the furthest. Okay, raise your heads. These are the results. Zero people thought straw number one would sink the furthest, one person thought straw number two would, one person thought straw number three would, and 20 people thought that straw number four would. Now, we have some interesting things going on here. For those of you who thought that straw number four would sink the furthest, what were some reasons for that?
Focus of Instruction Student TaskMaking Students’ Thinking ExplicitTracking Student ConceptionsPromoting Reasoning
Argumentation
Role of Evidence
Whole Class Discussions
Voting
Displaying students’ ideasAsking for students’ explanations, reasoning, or justificationStudents respond to questions/statements posed by the teacherNo evidence cited
Minute 4Teacher: I’ll show them to you quickly to refresh
your memory. We had different amounts of sand in all the same sized straws. What did you think?
Focus of Instruction Student TaskMaking Students’ Thinking ExplicitTracking Student ConceptionsPromoting ReasoningArgumentation
Whole Class Discussions
Collecting student responses
Displaying students’ ideasAsking students to elaborateStudents respond to questions/statements posed by the teacher
Student: Because it weighed more.Teacher: Okay, weighed more. Hm, I heard a lot of
gasps. Because you probably felt me cringing. It’s okay, Steven. And then I heard Steven go oh. Do you want to
39
Implementation of Embedded Formative Assessments
change that word? Role of Evidence Students cite Data/evidence from class Student: More mass.
Teacher: Oh, thank you. I’m going to cross that out so Mrs. Jones can live another day and not have a heart attack. No, I think I know my own last name, guys. He’s saying that straw four massed more. How do you know that?
Student: Because it had the most sand in it.
40
Implementation of Embedded Formative Assessments
Table 5.
Results of Consistency of Scorers’ Analyses
Agreement Reliability Percent Direct
Agreement Kappa Interrater Coefficient
Coding CategoryRater 1- Rater 2
Rater 1- Rater 3
Rater 2- Rater 3
Rater 1- Rater 2
Rater 1- Rater 3
Rater 2- Rater 3
Focus of Instruction 98 0.909 0.952 0.951 0.997 0.999 0.999Student Task 98 0.881 0.946 0.927 0.913 0.952 0.951Thinking Explicit 99 0.908 0.929 0.934 0.937 0.960 0.960Tracking 98 0.892 0.910 0.914 0.958 0.960 0.961Reasoning 96 0.786 0.845 0.836 0.986 0.960 0.960Argumentation 97 0.859 0.915 0.892 0.930 0.950 0.960Evidence 96 0.861 0.908 0.817 0.870 0.950 0.965
41
Implementation of Embedded Formative Assessments
Table 6.
Frequency of the Reflective Lessons that Followed the Sequence Suggested by Type
Within-Prompt Predict-Observe-
Explain
Between-PromptRL Type II
Followed Sequence Other
Followed Sequence Other
Aden 3 0 1 1Andy 2 1 0 0Becca 2 1 0 2Carol 0 3 0 2Diana 0 3 0 1Robert 3 0 1 1
42
Implementation of Embedded Formative Assessments
Table 7.
Frequency of Discussions After Predict-Observe-Explain and Short Answer Prompts After POE After SA
Held Not HeldTotal
Possible Held Not HeldTotal
PossibleAden 1 2 3 1 2 3Andy 2 1 3 2 1 3Becca 0 2 2 2 1 3Carol 1 2 3 3 0 3Diana 2 1 3 3 0 3Robert 2 1 3 2 1 3
43
Implementation of Embedded Formative Assessments
Table 8.
Mean Duration of Each Reflective Lesson Element, by Teacher Graph Predict-Observe-
ExplainShort Answer Predict-Observe Concept Map
Teacher N Mean SD N Mean SD N Mean SD N Mean SD N Mean SDAden 3 30.00 7.21 3 34.33 2.08 2 11.00 1.41 3 11.67 7.51 2 49.00 7.07Andy 2 14.00 1.41 3 42.00 8.54 3 24.67 16.44 3 26.33 12.01 a a a
Becca 2 47.50 14.85 2 21.50 7.78 3 36.00 2.65 1 31.00 b 2 61.50 41.72Carol 3 30.00 15.72 3 43.33 27.50 3 45.33 20.53 3 20.33 6.66 1 42.00 b
Diana 2 32.00 12.73 3 23.67 20.23 3 31.33 16.56 3 16.33 3.79 2 84.50 9.19Robert 3 24.00 12.29 3 32.33 7.02 3 16.67 10.79 2 13.50 2.12 2 68.00 22.63a No videotapes submittedb Only one lesson submitted so standard deviation (SD) not calculated
44
Implementation of Embedded Formative Assessments
Table 9.
Number of Class Periods per Reflective Lesson, by TeacherReflective Lesson Total # of class
periods4 6 7 10 11Guide 2-3 1 2-3 2-3 1 8-11Aden 3 1 3 4 1 12Andy 3 0 2 3 0 8Becca 3 2 4 2 2 13Carol 6 2 5 3 0 16Diana 3 2 3 3 3 14Robert 4 2 4 3 2 15Mean 3.67 1.5 3.5 3 1.33 13
Note: ‘0’ in the table above indicates that we do not have video data for the teachers on this lesson; it is possible that RL elements were implemented and not taped.
45
Implementation of Embedded Formative Assessments
Table 10.
Percentage of Whole-Class Discussion Minutes with students providing evidence Teacher Evidence ProvidedAden 39.4Andy 16.9Becca 17.6Carol 25.8Diana 28Robert 11.8Mean 23.3
46
Implementation of Embedded Formative Assessments
Table 11.
Pre- and Posttest Multiple-Choice Mean and Standard Deviation by Teacher
Pre-Test Post-Test(Max = 35) (Max = 43)
Teacher n Mean S.D n Mean S.DAden 29 14.55 5.08 28 21.78 7.19Andy 26 13.11 3.68 22 22.90 5.9Becca 20 12.25 2.75 16 20.25 5.73Carol 19 13.15 4.48 17 28.11 6.27Diana 25 15.52 3.60 23 24.78 4.49Robert 26 16.00 4.57 23 20.47 6.88
47
Implementation of Embedded Formative Assessments
Table 12.
Teacher Rankings Across Critical Aspects and Pre-Posttest Gain Scores
Enacted Learned
Quality of Delivery Pre-Posttest Gain Scores
1 2 3 4 5 Mean Rank Mean SD Rank
Aden n/a 5 1 4 1 2.8 2 7.42 5.62 4
Andy n/a 2 3 5 5 3.8 4 10.33 5.32 2
Becca n/a 3 5 6 4 4.5 6 7.81 5.75 5
Carol n/a 1 5 1 3 2.5 1 15.17 4.27 1
Diana n/a 4 2 3 2 2.8 3 9.17 4.18 3
Robert n/a 5 4 2 6 4.3 5 4.81 5.83 6Note: Rank of 1 indicates teacher was most consistent with critical aspect, 6 indicates teacher was least consistent
48
Implementation of Embedded Formative Assessments
Figure Captions
Figure 1. Relationship between intended, enacted, and achieved curricula, and the aspects of
fidelity of implementation, Adherence and Quality of Delivery, and student learning.
Figure 2. Treatment structure and critical aspects of Reflective Lesson Type I
Figure 3. Percentage of time spent on each focus of instruction by teacher (number of one-
minute units in parentheses).
Figure 4. Percentages of eliciting strategies and total discussion time in minutes by teacher
across all Reflective Lessons
Figure 5. Percentages of tracking and clustering and total discussion time by teacher
Figure 6. Percentages of strategies to promote reasoning and total discussion time by teacher
Figure 7. Percentages of argumentation levels and total discussion time in minutes by teacher
Figure 8. Correlation between each teacher’s enacted and learned treatment rank
49
Implementation of Embedded Formative Assessments
Intended Enacted Achieved
Treatment Structure Adherence Student Learning
Treatment Processes Quality ofDelivery
Figure 1
50
Implementation of Embedded Formative Assessments
Figure 2
51
Implementation of Embedded Formative Assessments
Figure 3
52
0%10%20%30%40%50%60%70%80%90%
100%
Aden (N=348)
Andy (N=307)
Becca (N=400)
Carol (N=459)
Diana (N=447)
Robert (N=382)
Per
cent
age
of T
ime
(Min
utes
)
Teacher Talk/Task Setting Individual Work Small Group Work Whole Class Discussion
Implementation of Embedded Formative Assessments
Figure 4
53
0%10%20%30%40%50%60%70%80%90%
100%
Aden (N=66)
Andy (N=85)
Becca (N=91)
Carol (N=120)
Diana (N=157)
Robert (N=102)
Perc
enta
ge o
f Tim
e (M
inut
es)
None Collecting Class Presentation Voting
Implementation of Embedded Formative Assessments
Figure 5
54
0%10%20%30%40%50%60%70%80%90%
100%
Aden(N=66)
Andy(N=85)
Becca(N=91)
Carol(N=121)
Diana(N=151)
Robert(N=102)
Perc
enta
ge o
f Tim
e (M
inut
es)
None Displaying Clustering
Implementation of Embedded Formative Assessments
Figure 6
55
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Aden (N=66)
Andy (N=84)
Becca (N=91)
Carol (N=120)
Diana (N=150)
Robert (N=93)
Perc
enta
ge o
f Tim
e (M
inut
es)
None Asking for Explanations Asking Follow-up Questions
Implementation of Embedded Formative Assessments
Figure 7
56
0%10%20%30%40%50%60%70%80%90%
100%
Aden (N=66)
Andy (N=84)
Becca (N=91)
Carol (N=120)
Diana (N=150)
Robert (N=93)
Perc
enta
ge o
f Tim
e (M
inut
es)
None Ss respond to teacher questions Ss respond to each other
Implementation of Embedded Formative Assessments
0
1
2
3
4
5
6
0 1 2 3 4 5 6Enacted Treatment Rank
Lear
ned
Trea
tmen
t Ra
nk
Figure 8
57
Carol
Aden
Diana
Andy
Robert
Becca