Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | robert-freeman |
View: | 216 times |
Download: | 0 times |
Using Classroom Artifacts to Measure Instructional Practice in Middle School
Mathematics: A Two-State Field Test
Hilda Borko, Suzanne Arnold, Beth Dorman, Karin Kuffner (CU-Boulder)
Brian Stecher, Mary Lou Gilbert, Alice Wood (RAND Corporation)
CRESST Conference 2004September 10, 2004
Artifact Packages for Characterizing Instructional Practice:
A Validation Study
Goal: an instrument to capture instructional practice reliably and efficiently
Rationale for artifact packages Richer descriptions than surveys
Fewer resource demands than case studies
Validation study to investigate reliability and validity
The Scoop Metaphor
“What is it like to learn mathematics in your classroom?”
A Scoop of Classroom Material
One way that scientists study unfamiliar territory (e.g., freshwater wetlands, Earth’s crust) is to scoop up all the material they find in one place and take it to the laboratory for careful examination. Analysis of a typical Scoop of material can tell a great deal about the area from which it was taken.
We would like to do something similar in classrooms, i.e., scoop up a typical week’s worth of material and use it to learn about the class from which it was taken. The artifacts would include assignments, homework, tests, projects, problem solving activities, and anything else that is part of instruction during the week.
The Scoop Notebook
“Scoop” a typical week’s worth of instructional materials
Variety of methods for capturing instructional practice
Daily calendar
Instructional materials
Samples of student work
Photographs
Teacher Reflections
Methods
Participants
36 middle school mathematics teachers
Teachers from Colorado (23) and California (13)
Variety of curricula, traditional to reform
Data from 30 teachers used in reliability and validity analyses
Data collection
Scoop Notebook completed by teacher (5 days of instruction)
Researcher observation and ratings (2 - 3 days)
Audiotape of instruction (8 teachers, 2 - 3 days)
Scoring Guide
11 Dimensions of Classroom Practice
Collaborative Grouping Explanation & Justification
Structure of Lessons Problem Solving
Multiple Representations Assessment
Use of Mathematical Tools Connections & Applications
Cognitive Depth Overall
Discourse Community (Notebook Completeness)
(Confidence)
Rating Observations and Notebooks
Five-point rating scale
Scoring Guide with descriptions and examples for each dimension: high (5)
medium (3)
low (1)
Scoring GuideExample: Problem Solving
Overall Description: Extent to which instructional activities enable students to identify, apply and adapt a variety of strategies to solve problems. Extent to which problems that students solve are complex and allow for multiple solutions. [NOTE: this dimension focuses more on the nature of the activity/task than the enactment. To receive a high rating, problems should not be routine or algorithmic; they should consistently require novel, challenging, and/or creative thinking.]
High: Students work on problems that are complex, integrate a variety of mathematical topics, and draw upon previously learned skills. Problems lend themselves to multiple solution strategies and have multiple possible solutions. Problem solving is an integral part of the class’ mathematical activity, and students are regularly asked to formulate problems as well as solve them.
Example: During a unit on measurement, students regularly solve problems such as: “Estimate the length of your family’s car. If you lined this car up bumper to bumper with other cars of the same size, about how many car lengths would equal the length of a blue whale?” After solving the problem on their own, students compare their solutions and discuss their solution strategies. The teacher reinforces the idea that there are many different strategies for solving the problem and a variety of answers because the students used different estimates of car length to solve the problem.
Ratings of Instructional Practice
Notebook Only Contents of Scoop Notebook
Gold Standard Observations and contents of Scoop
Notebook
Notebook + Discourse Transcripts of audio-taped classroom
lessons and contents of Scoop Notebook
Reliability Research Questions
Do raters agree on the scores they assign to the dimensions of classroom practice, based on the Scoop Notebook?
Is agreement among raters higher for some dimensions than others?
Is agreement among raters higher for some teachers than others?
Agreement Among Raters: Calculation Procedures
Three raters per notebook; pairs of ratings compared
1-2-3: three pairs (1,2), (1,3), & (2,3)
Exact agreement = 0%
Within 1 rating point = 67%
4-4-1: three pairs (4,4), (4,1), (4,1)
Exact agreement = 33%
Within 1 rating point = 33%
Agreement by Dimension
Average ratings across teachers close to 3.0 for all dimensions
Relatively high levels of agreement for all dimensions
Exact agreement ranged from 21.1% to 44.3%
Agreement within 1 point ranged from 70.1% to 82.3%
Agreement fairly consistent across dimensions
Agreement by Teacher
Wide range of values
Average notebook ratings (1.55 to 4.21)
Exact agreement: 12.0% to 60.5%
Agreement within 1: 57.5% to 97.0%
No apparent relationship to:
Average notebook rating (traditional versus reform practices)
Notebook completeness
Rater confidence
Validity Research Questions
1. Do ratings based only on the Scoop Notebook agree with ratings based on the Scoop Notebook and classroom observations (“Gold Standard” ratings)? Is agreement higher for some dimensions than others? Is agreement higher for some teachers than others?
2. Are there differences in the ratings of Colorado teachers and California teachers?
3. Do ratings based on the Scoop Notebook and transcripts of classroom lessons agree with Gold Standard ratings?
Methods Similar to the Reliability Analysis
Comparisons between average Notebook Only rating (averaged across 3 raters) and Gold Standard rating
Two levels of agreement (on 5-point scale)
Within 0.33
Within 0.67
Agreement by Dimension
Moderately high levels of agreement for all dimensions
Agreement within 0.33 ranged from 30.0% to 53.3% across the 11 dimensions
Agreement within 0.67 ranged from 43.3% to 66.7%
Differences in agreement among dimensions make sense
Structure of Lessons “easy” to rate
Mathematical Discourse and Assessment more “difficult” to rate
Agreement by Teacher Pattern similar to reliability data
Large differences among teachers in levels of agreement
Agreement within 0.33 ranged from 9.09% to 81.8%
Agreement within 0.67 ranged from 9.09% to 90.0%
Level of agreement is not related to:
Average notebook rating
Notebook completeness
Rater confidence
Notebooks Detect Known Differences in Curriculum
Average ratings differed for teachers using traditional vs. reform-based curricula Notebook ratings: 3.42 vs. 2.59
Gold standard ratings: 3.47 vs. 2.30
Differences between ratings varied by dimension and match known differences in the curricula Ratings most alike on Structure of Lessons and
Assessments
Ratings most different on Cognitive Depth, Discourse Community, etc.
Validity Analyses with Classroom Transcripts
How do the ratings based on the Scoop Notebook and transcripts of classroom lessons compare to Gold Standard ratings?
To what extent does analysis of classroom discourse provide additional insights about instructional practices?
Discourse Plus Scoop Notebook vs. Gold Standard
Exact agreement occurred in 45.4% of cases
Range across dimensions: Grouping: 14.3%
Structure of Lessons: 71.4%
Agreement within 1.0 point occurred in 92.2% of cases.
Agreement within 1 was 100% for 7 of 11 dimensions
In general, relatively high levels of agreement
Qualitative Analysis
On which dimensions does discourse provide more information and insights than the Scoop Notebook alone?
Mathematical Discourse Community
Explanation/Justification
Cognitive Depth
Connections/Applications
Assessment
Additional Insights: Mathematical Discourse Community
How teacher solicits, explores, & attends to student thinking
How teacher models & emphasizes use of mathematical language
Student-to-student communication
Common classroom discourse patterns (e.g., IRE; more open ended)
Conclusions:Feasibility of the Approach
Teachers were interested, supportive, and cooperative
Teachers were able to follow artifact collection instructions well
Notebooks returned in timely manner
Student work represented a broad range of curriculum and instructional activities
Photographs and reflections were descriptive
Conclusions: Reliability and Validity Agreement among raters is reasonably high for all
dimensions and very high for some
Agreement between Notebook Only ratings and Gold Standard ratings is moderately high for all dimensions
Some dimensions and teaching practices present greater challenges than others for artifact-based tools such as the Scoop Notebook
Raters reported struggling with some dimensions (e.g., Mathematical Discourse Community) more than others
Information about classroom discourse provides additional insights about some dimensions
Disagreements among raters may be greater when there are inconsistencies in the data
Implications and Future Directions Scoop Notebook is useful for describing
instructional practice in broad terms
Results do not support use of the Scoop Notebook to make judgments about individual teachers
Additional research needed to answer questions such as: Why are some classrooms and teachers more difficult to
rate than others?
Are there systematic differences among individual raters?
Possible future uses of the Scoop Notebook Tool for professional development
Trace changes in teachers over time or across different instructional units