+ All Categories
Home > Documents > Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Date post: 24-Dec-2015
Category:
Upload: sophie-gaines
View: 218 times
Download: 2 times
Share this document with a friend
Popular Tags:
66
Ronald Carriveau, Ph.D. Richard Herrington, Ph.D. University of North Texas University of North Texas http://www.unt.edu/rss/rich/IUPUI/
Transcript
Page 1: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas University of North Texas

http://www.unt.edu/rss/rich/IUPUI/

Page 2: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Outcomes

An introduction to the UNT Student Evaluation of Teaching Effectiveness (SETE) survey.

Steps for developing a teaching evaluation survey.

An online handbook of Questions and Answers regarding survey development and implementation.

An introduction to the validity study process and scale score development (with online access).

Page 3: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The Value of Teacher Evaluations

Clicker survey on participants’ opinions

Page 4: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

1. How many years have you been teaching or working at the college or university level?

A. 1 to 4 yearsB. 5 to 10 yearsC. More than 10 years

Page 5: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

2. During your career, how often have you been evaluated by students as a job requirement?

Seldom or n

ever

Often

Almost

always

0% 0%0%

A. Seldom or neverB. OftenC. Almost always

0 of 30

Page 6: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

3. To what degree do you consider student evaluations of teachers to be high stakes?

Low

Medium

High

0% 0%0%

A. LowB. MediumC. High

0 of 30

Page 7: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

4. How important are teacher evaluations for improving instruction?

Not im

portant

Somewhat im

portant

Very im

portant

0% 0%0%

A. Not important

B. Somewhat important

C. Very important

0 of 30

Page 8: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

5. How valid are student evaluations as a measure of teaching effectiveness?

Not v

alid

Moderat

ely val...

High

ly va

lid

0% 0%0%

0 of 30

A. Not valid

B. Moderately valid

C. Highly valid

Page 9: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

6. How much of an influence is the instructor’s enthusiasm and style on student ratings?

Small i

nfluence

Moderat

e influence

Big influence

0% 0%0%

A. Small influence

B. Moderate influence

C. Big influence

0 of 30

Page 10: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

7. To what degree are you involved with teacher evaluation at your institution.

Small

Moderat

e H

igh

0% 0%0%

A. Small

B. Moderate

C. High

0 of 30

Page 11: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Why Student Evaluation of Teacher Effectiveness

Growing concern for accountability and self accountability

Need evaluation of instructor in terms of their instruction

Used in decisions about pay, promotion, and tenure.

Provide feedback to faculty on their instruction

Improve teaching/instruction

Improve student learning

Page 12: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Primary Challenge

How do we get from the raw scores on a teacher effectiveness survey to validated scale scores that are psychometrically and legally defensible.

Need to overcome the lack of validity inherent in using raw score means and medians.

Page 13: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Development Challenges

Need for a clearly defined purpose and what exactly to measure

Transparency and inclusion to get faculty buy-in

Conduct a comprehensive item selection process

Conduct faculty and student focus groups

Develop a defensible item response scale

Page 14: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Committee’s charge from the ProvostNeed a common scale across all courses for individual

effectiveness scores and for making comparisons

Committee recommended purpose Measure ‘teaching effectiveness’ versus ‘course

effectiveness.’

HB 2504 – Transparency Texas law requires posting of teacher evaluation scores

Page 15: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Transparency And Inclusion For Faculty Buy-in Committee makeup

Professors, Lecturers, Faculty Senators, Staff, Student

Faculty Senate Executive Board and Chair/President Get support of Chair/President first

Faculty Senate Make initial presentation and give updates Challenge is Senate may resent directive came from Provost

(i.e. administration) rather than with Senate input and approval

Page 16: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The item selection process vs. writing the items

Four step process starting with 3000 plus items

Initial screening – measurement intent and clarity

Rubric Scoring #1 – teaching vs course effectiveness

Rubric Scoring #2 - match items to effectiveness factors

Faculty and Student validation via Focus Groups and Surveys – importance, observable, student judgment

Page 17: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Faculty And Student Focus Groups

Stratified random sampling to select faculty across all colleges by proportion of enrollment (i.e. representativeness).

Standardized focus group questions and consent form for IRB Necessary for focus groups, field test, and research

Qualitative analysis Analyze all focus group comments, before and after field test

Developmental field test Small group of students with post administration analysis and

discussion

Page 18: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Faculty and Student Agreement On Items

Clicker Slides

Page 19: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

My instructor is knowledgeable about the course content.

0%0%

A. Something the student can judge

B. Not Something the student can judge.

0 of 30

Page 20: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The book my instructor chose for the course was appropriate.

Someth

ing the st

udent c..

Not S

omething t

he stud...

0%0%

A. Something the student can judge

B. Not Something the student can judge.

0 of 30

Page 21: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

My instructor used appropriate instructional materials.

Someth

ing the st

udent c..

Not S

omething t

he stud...

0%0%

A. Something the student can judge

B. Not Something the student can judge.

0 of 30

Page 22: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Developing The Item Response ScaleResearched based

Multiple references, one primary reference as a guide.

Ronald A. Berk, Thirteen Strategies to Measure College Teaching: A Consumer’s Guide to Rating Scale Construction and Assessment.

A four point agreement scale

Be prepared for challengesAnchor points on scaleNo middle pointNo na optionPositive/negative statements

1 2 3 4

Strongly Disagree Disagree Agree Strongly Agree

Page 23: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

What are the factors of teaching effectiveness that are measured?

In addition to the overall construct of teaching effectiveness, there are three specific factors that are being measured. The twelve items on the SETE were chosen from the final pool of 28 usable items, and the four items chosen for each factor best represent the overall factor of teaching effectiveness (SETE rating instrument - not to be used without legal permissions – Copyright © 2010 by University of North Texas) .

Factor 1: Organization and explanation of materials1. My instructor explains difficult material clearly.2. My instructor communicates at a level that I can understand.3. My Instructor makes requirements clear.4. My instructor identifies relationships between and among topics

Factor 2: Learning Environment

1. My instructor establishes a climate of respect.2. My instructor is available to me on matters pertaining to the course.3. My instructor respects diverse talents4. My instructor creates an atmosphere in which ideas can be exchanged freely.

 Factor 3: Self Regulated Learning

1. My instructor gives assignments that are stimulating to me. 2. My instructor encourages me to develop new viewpoints.3. My instructor arouses my curiosity.4. My instructor stimulates my creativity.

Page 24: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions From Faculty and Senate

What is the purpose of the SETE?Facilitate student evaluation of their instructors, allowing

university-wide comparison in key areas (Charge from Provost). Should not be the only measure.

Provide a measure of teaching effectiveness as perceived by students.

Provide scores for a particular instructor that can be used for self evaluation and improvement and for measuring improvement over time.

Provide scale scores that can be aggregated into group scores for use by administrators.

Provide teacher evaluation information for UNTSatisfy the requirements of House Bill 2504 that calls for

transparencey in reporting and posting to the web.

Page 25: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont. How were the SETE items selected?

It was determined that every effort should be made to find existing surveys and published lists of survey items and to evaluate them for usefulness versus writing new items.

Validity evidence to support selection decisions was collected through seven faculty focus groups, four student focus groups, faculty and student interviews, results from a survey sent to faculty, surveys sent to students, and an item tryout field test, plus scoring rubric results from the committee members’ evaluation of items.

An item selection process started with a pool of approximately 3,000 survey items, including all current UNT department surveys and items published by other universities that are used by over 100 universities. After an initial screening process looking at measurement intent and clarity, this large pool was narrowed to 1,488 items. Evaluating these items with rating scales reduced this number to 788, and a second evaluation matching items to specific effectiveness elements reduced the number to 346.

Page 26: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions cont.How were the SETE items selected, cont. Using specific scoring criteria to qualify items for inclusion,

committee members reduced the number of items to 51. These 51 items were then presented to students in a

developmental field test and a final draft selection of 38 items was based on the field test results and faculty focus groups.

A final review was conducted using the criteria of student viewpoint, student observable, statement measurability, conformity to the research elements, duplicity, and universality in terms of class size and in terms of online and in-class administration. The result of this process is the final survey item pool of 28 statements.

Over 400 people were involved in the process (after the validity study started, over 550 people had been involved).

Page 27: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions From Faculty and Senate

How were the final 12 survey items chosen? The second phase of the SETE development included three teams

made up of faculty and staff who specialized in or had experience with assessment development and psychometrics. Team A conducted the validity study and psychometrics; Team B administered spring pilot tests of the SETE items and conducted follow-up faculty and student focus group; Team C developed open-ended response items.

The final 28 SETE items were pilot tested using a stratified sampling across the University. The pilot test was administered at the end of the Spring semester 2009, and a validity study team was assembled to analyze the data, validate the model fit, conduct item reduction studies, and develop a scoring methodology. The result of the psychometric work was the 12 item survey that was administered across the university in the fall of 2009.

Page 28: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont.

Why was a four point scale used for the SETE?Research shows that after five points there are diminishing

returns in terms of reliability. Additionally, information may be lost if the scale exceeds the respondents ability to discriminate among the anchor points. A 28 item survey with a 4-point scale can yield high reliability coefficients.

It was determined that four anchor points were appropriate using a response scale of 1) Strongly Disagree, 2) Disagree, 3) Agree, and 4) Strongly agree.

Page 29: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont.

Why is there no midpoint position on the scale (i.e. neutral, uncertain, or undecided)?

Information is lost when a midpoint position is included in a set of bi-polar (i.e. both positive and negative) anchors that are intended to measure the degree (intensity) of a respondent’s opinion.

The neutral mid-point is also problematic because it will lower the mean for a teacher who receives a high score and adds no compensation for a teacher who received a low score. From a measurement viewpoint, nothing is gained from a neutral response.

Berk (2006) states that, “For rating scales used to measure teaching effectiveness, it is recommended that the midpoint position be omitted and an even-numbered scale be used, such as 4 or 6 points.”

Page 30: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont.Why is there no NA (not applicable) choice?

The use of NA was avoided because the teacher effectiveness scale will be used for a class level analysis, and every time a student chooses NA, that student’s scale score will be different because one or more of the items will not be part of the score. This is a major problem in terms of measurement, analysis, and validity.

The committee recognizing that there are class conditions across the university (even on the teacher effectiveness only scale) that would require an NA option, so they followed recommended procedures for identifying which items might require an NA. Faculty were asked to identify those items which they felt could not be observed by students across all classes and thus would require an NA. Students were asked to identify those items which they felt could not be observed by students across all classes and thus would require an NA. Identified items were eliminated.

Page 31: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont.Why is there no item reversal (negative and positive items)

to address response set bias? This type of bias is referred to as acquiescence, the tendency to

agree or give positive responses regardless of the content of the items (similar to Halo effect). A strategy used to minimize the effect of this survey taking behavior is to word half of the statements positively and the other half negatively (but in random order). However, this method does not eliminate (or reduce) the bias, it simply cancels out the effect of the bias with the result that the effect of the bias is reduced to zero.

Berk (2006) recommends that reversals may be appropriate for some scales, but not for teacher effectiveness scales because the positive/negative reversals can be confusing and result in increased response time and response errors. The SETE effectiveness scale is designed to rate the teacher’s positive behaviors, not negative ones.

Page 32: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont.

What is the applicability of SETE items to courses delivered online?

Application of the SETE items to online courses was a major consideration of the committee. Expertise in delivering online instruction was well represented in the committee. Additionally, input was gathered from faculty and student groups. Several online courses were included in the SETE field test in order to do a comparison of online versus not-online student responses.

The structural equation modeling used to confirm the

structure of the student responses included online courses. Faculty and student review groups were convened at the beginning of the fall semester 09 to confirm final recommendations regarding the usefulness of SETE survey items for online courses.

Page 33: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont.

What do my SETE scores mean? How should they be interpreted?

Your SETE scores are a measure of your students’ perception of your teaching effectiveness. The scores are based on a scale across the University. In other words, all individual scores are on the same scale so that a score of, for example 600, for a teacher of a particular course in a particular department or college has the same meaning in terms of teaching effectiveness as a teacher of a particular course in a different department or college. To help with score interpretation, the following factor descriptions of effectiveness are provided on the individual teacher reports.

Page 34: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions cont. How should they be interpreted? Each of the three

effectiveness factors has its own unique scale and thus each teacher gets a separate scale score for each factor. The overall construct of Teacher Effectiveness also has its own scale score, and thus is not simply the average of the factor scores.

Organization and Explanation

Learning Environment

Self-Regulated Learning

Overall Effectiveness

Highly Effective

710 – 981 659 – 972 747 – 998 702 – 998

Effective438 – 709 347 – 658 495 – 746 406 – 701

Somewhat Effective

167 – 437 35 - 346 243 – 494 111 – 405

Page 35: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions Cont.What do my factor scores mean?

A high score means that you were perceived as having a high degree of what is shown in the description below. A low score means a low degree of what is shown.

Factor 1: Organization and Explanation of Materials This score reflects the student’s perception of how well the instructor:

makes the course requirements and student learning outcomes clear to the students; gives assignments, activities, and materials that are helpful and that contribute to understanding the subject; explains difficult material clearly; shows the relationships among topics and new concepts; and evaluates student work in ways that are helpful to learning.

Factor 2: Learning Environment This score reflects the student’s perception of how well the instructor:

establishes a climate of mutual respect and encouragement; motivates students to work and engage in learning; is available and encouraging; is skillful in actively engaging students in learning; and provides useful feedback.

Factor 3: Self-regulated Learning This score reflects the student’s perception of how well the instructor

guides and encourages self-directed learning in which the student is encouraged: to be open to the viewpoints of others; to develop new viewpoints; to connect course topics to a wider understanding of the subject; and to contribute to the learning process.

Page 36: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions cont. Is my overall SETE score the average of my three factor scores?

No. The overall construct of Teacher Effectiveness has its own scale score, and thus is not simply the average of the three factor scores. A measurement model with appropriate external control variables is used in determining how items should be weighted when calculating individual scale scores for each factor.  This estimation process provides a reasonably fair and unbiased estimate of the individual scale scores as well as providing a high degree of reliability and generalizability to the scale scores.

Page 37: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions cont. The response rate for my class was very low. Does

this make my SETE score invalid? No, the scoring methodology used allows response rates

as low as one or two students to be put on the SETE scale. A weighted average methodology is used that is similar to what is used for the GRE and similar tests so that a valid point estimate can be calculated for these lower response rates in terms of class size and scale scores obtained. This lower response rate estimate may not be as interpretable (i.e. error free) as if there was a higher response rate, but it is psychometrically sound and usable and will be necessary when looking at scores over time.

Page 38: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Challenges and Questions cont. What if there is a dramatic spike downward in my

scale scores for a particular semester? A dramatic spike upwards or downward for a particular

semester over scores from several semesters can be a concern when looking at continuous improvement. To address this, starting with the Fall 2010 administration, prediction methodologies will be applied that will use information from previous semesters to smooth the scores across semesters so that a more fair and reasonable interpretation of effectiveness scores can be made for purposes of continuous improvement.

Page 39: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Structure Of The SETE (Rich)

Two aspects of the theoretical model that is of interest:The behavioral domains The inter-item relationships of the item domains

A particular factor analysis model is used to represent our theoretical SETE modelBi-factor model – a general effectiveness (G)

factor and three sub-domain factors which are relatively independent in relation to G

Page 40: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Theoretical Model

Higher order construct is Teaching Effectiveness

Teaching effectiveness is modeled as three sub-domains:Organization and explanation of materialsLearning environmentSelf-regulated learning

Page 41: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

explains difficult material clearly

communicates at a level that I can understand

makes requirements clear

identifies relationships between and among topicsestablishes a climate of respectis available to me on matters pertaining to the courserespects diverse talentscreates an atmosphere of free exchange of ideas

gives assignments that are stimulating to me

encourages me to develop new viewpointsarouses my curiosity

stimulates my creativity

Organ.& explain

LearningEnviron.

Self-reg.learning

SETE ItemDescriptors

GeneralTeaching

Factor

SETE Bifactor Latent Variable Model: One General Factor and Three Specific Content Factors

Page 42: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Measurement Model Estimation, Survey Sampling Issues, and Validation

OVERVIEWMeasurement IssuesFactor AnalysisShort Form Item SelectionSurvey Sampling DesignCase Weighting: Using Inverse Probability

WeightingExternal Control VariablesContextual Effects and Multi-level ANOVAScale Score DevelopmentMissing ValuesFuture Refinements for ImplementationSoftware and Data ProcessingQuestions

Page 43: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Domain of SETE – the behavioral objectives of teaching effectivenessDomain – the elements to which a variable is limitedDomain score - true score of an infinitely long set of items (i.e.

mean on all possible items) Item domain – the universe of all possible items under

considerationReliability

A reliability coefficient is the square of the correlation between the observed score and the domain score (varies between 0 and 1)

The reliability of the SETE varies between .90 and .95Percentage of true score variance out of the total variance (true

score variance plus error variance)Construct Validity – the extent to which an item (score)

measures the attribute it was designed to measureFactor analysis is a primary methodological tool for examining

the internal structure of items and provides evidence of validityHomogenous items within factors provides evidence of validityExtensive conceptual/semantic analysis also lends evidence of

validity (i.e. student and faculty focus group evaluations of the items)

Measurement Issues

Page 44: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Generalizability - degree to which observed scores generalize to domain scoresA generalizability coefficient measures the generalizability of

the observed items to a corresponding infinite behavioral domain

Omega coefficient (varies between 0 and 1) The Omega coefficient is a measure of construct validity Omega is both a reliability coefficient and a generalizability

coefficient The Omega coefficient of the 12 item SETE varies

between .90 and .95Measurement Invariance – the domain(s) as represented by

the internal structure of the items, do not vary across various sub-samples of the populationSizes of factor loadings do not vary substantially across sub-

samples ( e.g. demographic variables - department, student major, gender, etc.)

Factor loading patterns do not vary substantially across sub-samples

Predictive Validity – the observed score accurately predicts other outcome measures of related interest (e.g. current semester rating predicts next semester rating with high accuracy)

Measurement Issues (cont.)

Page 45: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Factor analysis is often referred to as both a measurement error modeling method, and a latent variable modeling method

Latent score is defined as a true score which is an unobservable outcome (i.e. factor scores)

Observed score = True score + Error (O=T+E) True score variance = Observed score variance + Error Variance squares – observed; circle - unobserved

Inter-correlation of multiple item scores contributes to our estimates of true scores and true score variance

In classical measurement theory, error is inferred: E = O – T

The ordinality of items suggests the use of a “polychoric correlation” matrix rather than a “Pearson correlation” matrix; both were used

Factor Analysis

Page 46: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Graphic schematic of a simple three item factor

Factor Analysis (cont.)

Page 47: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The factor structure of SETE is representative of more than 100,00 responses collected over three terms, and is currently representative of two different institutions - University of North Texas & Texas Woman’s University

Confirmatory Factor Analysis Fit (CFA) resultsRMSE<.04 (less than .05 is considered excellent)GOF>.97 (greater than .95 is considered

excellent)

Factor Analysis (cont.)

Page 48: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Short form is used whenever the administration of the long form is problematic (e.g. fatigue)

The development of short forms (e.g. 12 item version of SETE) are a common practice

The developer of the short form should demonstrate the relative equivalence between the long form (e.g. 28 items version of SETE) and short form in terms of reliability and generalizability (the 12 and 28 item SETE both have reliability and generalizabilty greater than .90)

The “iterative” use of factor analysis (e.g. three factors with 4 items per factor) in the selection of short forms can be problematic (e.g. fitting – removing item(s) – refitting - etc.)each successive fit ignores all possible/potential

multivariate relationships between items this produces sub-optimal subsets of items

Short Form Item Selection

Page 49: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The 12 item SETE was selecting using a heuristic optimization method called Ant Colony Optimization (ACO)Optimal subsets were selected from the final list of 28

items such that certain properties were maximized or minimized: Maximized: large item loadings (correlation between observed

score and latent score minus measurement error) Maximized: large correlations of factor scores with other

external measures (e.g. I would recommend this course to other people)

Maximized: large model fit indices (e.g. goodness of fit: 0-1) Minimized: small RMSE (distance between model and data)

ACO selection is automated and produces near optimal results – based on estimating all possible configurations of items and selects those configurations that optimize important psychometric criterion

Multiple subsets (parallel forms) can be obtained automatically as well; these multiple subsets can be rank ordered in terms of fit

Short Form Item Selection (cont.)

Page 50: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Delivery of course surveys through web delivery presents sampling issues

Non-random responses produces response biasExisting “clusters” of the student populations

produce data that bias model estimates of teaching effectiveness

One approach is to post-stratify on “cluster” variables thereby reducing this confounding influence

The predominant approach in the survey sampling literature is to use case weighting and re-sampling methods to estimate and reduce this bias (e.g. inverse probability weighting)

Survey Sampling Design

Page 51: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Survey sampling methodology provides algorithms that are useful in estimating non-responder bias in survey samples

Inverse probability weighting (IPW) uses background variables (external control variables) to estimate “probability response classes” e.g. background demographic information – dept.,

college, major, gender, class size, grade assigned, grade expected, etc.

The inverse of these probabilities down-weights high probability response classes, and gives more weight to low response probability classes

The effect of IPW is to reduce the relationship of the background variables with the principal outcome measure of interest (teaching effectiveness)

IPW also reduces the effect of bias on the model being estimated e.g. relatively unbiased item loadings in the factor

analysis

Case Weighting: Using Inverse Probability Weighting (IPW)

Page 52: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The 12 item SETE modeling process looked at course level, faculty level, and student level background variables 9 background measures were selected from a total of 17

variablesprior to this, the 17 variables were pre-selected from 30

variables based on relevanceCourse level

course size, in class vs. internet, instruction type, time course held

Faculty level status (e.g. lecturer , assist. prof., full prof.), age, number of

years employed at institution, gender of instructorStudent level

anticipated grade, actual grade assigned, mean GPA, academic level, current course load, students gender, total credits earned, pre-requisites present (yes/no)

Department and student major were handled by using “multi-level ANOVA” methods (contextual modeling) – more on this later

External Control Variables (Background Variables)

Page 53: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Selecting the “best” external control variables are important since non-relevant variables increase error variability and reduce the predictive validity of the SETE items

Bayesian Model Averaging was used (BMA) to select the best subset of variables from 17 in predicting general teaching effectiveness (G)

BMA model selection strategy can select models that have relatively better prediction accuracy, compared to models with smaller posterior probabilities

External Control Variables (cont.)

Page 54: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The most relevant background variables accounted for about 9% of the total variance in general teaching effectiveness (G) 6 student level variables, 2 course level variables, and 1 faculty

level variable A relative importance metric was generated for the 9

variables this metric decomposes the variance accounted for in G (9%)

into non-overlapping components of variance relative importance allows a rank ordering of the importance of

variables

IPW with these 9 external control variables reduced the 9% variance accounted for in G to about 2-3% variance accounted for in G

External Control Variables (cont.)

Page 55: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

The academic literature on background variables influencing student ratings of faculty indicate that anticipated grade and class size are two of the largest influences on outcome ratings larger classes are associated with lower effectiveness

ratings lower anticipated grade is associated with a lower

effectiveness ratingsOur preliminary results support these findings from the

literatureIPW can reduce the influence of background

variable bias on rating outcomesThe 12 item SETE (without IPW) already has a relatively

small amount of bias with regard the background variables investigated (9% )

this bias was further reduced to a negligible amount (2-3%)department and student major have non-significant effects

on mean effectiveness rating (G)We attribute these small effect sizes to the care with

which the original 28 items were selected

External Control Variables (cont.)

Page 56: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Context effects refer to the differential influence of “level” specific variables (contextual variables) on outcome measures of interest

Examples: students are nested within courses students can be nested within student majors courses are nested within departments departmental influences on teaching may vary across

departments students will respond more “similarly” (as compared to other

students in other courses) since they are exposed to the same instructor

different courses may have different courses sizes the same course can vary in size across semesters

The nested structure of response units creates what is known as “within-class correlation” (known as intra-class correlation) within-class correlation may not bias mean estimates, but will

likely create large bias for confidence intervals (conf. intervals too narrow)

the predictive accuracy of course means will also be lowerContextual effects can be modeled with “multi-level” ANOVA

methods

Contextual Effects: Department and Student Major

Page 57: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Analysis of variance (ANOVA) models between group and within group variation

Ideally, we would expect that between department variation in SETE ratings would be small as compared to within department variation on SETE ratingsthis creates within course correlation (intra-class

correlation – also known as intra-class reliability)Courses are nested within departments and vary

in course sizethis contributes to differences in response

consistency across courses (i.e. differences in reliability across courses)

A strategy in dealing with varying reliabilities and varying course sizes is to base course ratings on a weighted average

Multi-level ANOVA

Page 58: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

This weighted course average is based on:

weighted_course_mean = (r) * course_mean + (1 – r) * pop_mean - where r is the course reliability - course_mean is the mean of the course - pop_mean is the mean of all course means Courses averages with low reliabilities are moved toward

the population meanCourse averages with high reliabilities do not move as

much toward the population meanReliabilities are a function of course size and response

consistencyWeighted averages calculated in this manner are

sometimes referred to as “Hierarchical Bayes” estimates or “Empirical Bayes” estimatesHeirarchical Bayes estimates “borrow strength” across

groups (pooling information across groups) to estimate group means

Heirarchical Bayes estimates are very good at reducing error (i.e. prediction error)

Multi-level ANOVA

Page 59: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Scale Score Development

Page 60: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Scale Score Development (cont.)

Page 61: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Accounting for non-response (missing values) is important for reducing bias in model estimates (e.g. means, factor loadings)

Simple (but inadequate) methods for dealing with missing values include: removing records with missing data, and mean substitution

Better methods exist that take into account the multivariate patterns in the complete and missing data when making a “data imputation” (e.g. maximum likelihood, multiple imputation)

Missing data patterns in SETE data are estimating using “k-nearest neighbor imputation” within a coursenearest neighbors are records that have similar completed

data patterns Within a course, the average of the k-nearest neighbor’s

completed data are used to impute the value for a variable that is missing it’s valuek-nearest neighbors assumes missing at random (MAR) – i.e.

missing data only depends on the observed data; able to take advantage of multivariate relationships in the completed data

the drawback of k-nearest neighbors is that does not include a component to model random variation, consequently uncertainty in the imputed value is underestimated

Missing Values

Page 62: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Data within a course:

v1 v2 v3 v4 v5 v6 3 3 4 3 4 4 -| 3 3 4 3 4 4 |- 4 nearest 3 2 4 4 4 4 | neighbors 3 2 4 4 4 4 -|

3 2 4 ? 4 4 before imputation 3 2 4 3.5 4 4 after imputation | imputed value

Missing Values (cont.)

Page 63: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Future implementations of SETE will focus on using SETE scores to predict teaching effectiveness one time period ahead (e.g. following semester) – dynamic measurement model

Current course ratings become a prior distribution of ratings used in predicting the following semester ratings

The current course ratings are calculated as a weighted average of the previous semester’s ratings (prior) and the current semester’s ratings (data) to produce a posterior estimate of the current semester’s rating

This posterior distribution of ratings is essentially a weighted average that lies between the previous semester rating and the current semester rating

This posterior distribution of ratings become the prior distribution for the following semester

This estimation procedure creates a moving average across semesters (weighted average)This moving average reduces unwanted variability across

semesters due to fluctuating course size and also minimizes the effect of outlying semester ratings

Further Refinements for Implementation

Page 64: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Two software systems were used in developing the SETE: Mplus and R

Mplus is commercial software available at http://www.statmodel.com

R is a public domain, open-source software system available at http://www.r-project.org/

Software and Data Processing

Page 65: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Any Questions?References:

Test Theory: A Unified Treatment (Roderick P. McDonald,1999)

Bayesian Analysis for the Social Sciences (Simon Jackman, 2010)

Model Assisted Survey Sampling (Sarndal, et.al, 1992) Introduction To Variance Estimation, 2nd ed. (Wolter, 1997) Forecasting with Exponential Smoothing (Hyndman, Koehler,

Ord, Snyder, 2008) Relative Importance: Grömping, U. (2007),

Estimators of Relative Importance in Linear Regression Based on Variance Decomposition, The American Statistician, 61, 139-147.

Bayesian Model Averaging: http://www.stat.washington.edu/raftery/Research/bma.html

ACO: Leite, Huang & Marcoulides (2008), Item Selection for the

Development of Short Forms of Scales Using an Ant Colony Optimization Algorithm, Multivariate Behavioral Research , v43 n3, p411-431.

Slide download: http://www.unt.edu/rss/rich/IUPUI/

Page 66: Ronald Carriveau, Ph.D. Richard Herrington, Ph.D.University of North Texas

Contact Information MEETING THE CHALLENGES OF DEVELOPING A TEACHING EFFECTIVENESS

INSTRUMENT THAT MEASURES COURSES ACROSS THE CAMPUS ON A COMMON SCALE

Slide Download Available at:https://www.unt.edu/rss/rich/IUPUI/ Ron [email protected] Richard [email protected]

Notes:_________________________________________________________________________


Recommended