+ All Categories
Home > Documents > Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for...

Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for...

Date post: 19-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
21
1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract Purpose: This paper discusses how the use of an Evidence-Centered Design (ECD) (Mislevy & Riconscente, 2006) approach for assessment development can aid in the identification of appropriate measurement models for complexly-structured performance tasks. In an ECD approach the connections between the student model (what it is we want to say about the student), the evidence model (which includes the measurement model), and the task model (features of the task that provide evidence) are made explicit. The development of these connections helps aid in the identification of appropriate measurement models. Theoretical Perspective: Recent developments in assessment have seen a shift from traditional standardized testing to include the use of technology-enhanced scenario-based assessment tasks (Quellmalz & Pellegrino, 2009). These assessments present opportunities, in which students can engage in complex tasks such as designing investigations and manipulating representations of real-world tools. The use of these assessment tasks also provides challenges to the assessment designer when determining how the tasks should be scored. Traditional methods of scoring such as IRT are often not appropriate as the local independence assumption of items, and the unidimensionality assumption are violated by the presentation of an overarching scenario and multiple constructs being measured. In addition, advances in technology have led to new and innovative item types, and one task might present several different types of items which might not be scored the same. This paper presents some of the issues when it comes to the development of the measurement model for scoring these complex scenario-based assessment tasks. Methods: This paper presents three measurement models that can be applied. One measurement model is an extension to IRT that allows for multiple dimensions to be measured, the MRCML model (Briggs & Wilson, 2003). Another method that will be presented is diagnostic classification models (DCMs) (Henson, Templin, & Willse, 2009). A final measurement model that will be discussed is Bayesian Networks (Almond, DiBello, Moulder, & Zapata-Rivera, 2007). All of these models allow for the incorporation of multiple dimensions into the measurement model and provide benefits when it comes to scoring. The paper will provide background on each of the models, and discuss benefits and drawbacks to each method based on the literature. It will also demonstrate the application of these models to an example of a scenario-based assessment task. Results: While there is not one measurement model that would apply in all situations, the identification of the evidence needed to make valid inferences about the student, and the evidence that can be accumulated from the task can provide information that can be used to identify the appropriate measurement model for that task. Significance: With more and more complex tasks being developed and used, the identification of a measurement model that can best leverage the information provided by the task will aid assessment designers in their use and development of these types of tasks.
Transcript
Page 1: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

1

Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel

Abstract

Purpose: This paper discusses how the use of an Evidence-Centered Design (ECD) (Mislevy & Riconscente, 2006) approach for assessment development can aid in the identification of appropriate measurement models for complexly-structured performance tasks. In an ECD approach the connections between the student model (what it is we want to say about the student), the evidence model (which includes the measurement model), and the task model (features of the task that provide evidence) are made explicit. The development of these connections helps aid in the identification of appropriate measurement models. Theoretical Perspective: Recent developments in assessment have seen a shift from traditional standardized testing to include the use of technology-enhanced scenario-based assessment tasks (Quellmalz & Pellegrino, 2009). These assessments present opportunities, in which students can engage in complex tasks such as designing investigations and manipulating representations of real-world tools. The use of these assessment tasks also provides challenges to the assessment designer when determining how the tasks should be scored. Traditional methods of scoring such as IRT are often not appropriate as the local independence assumption of items, and the unidimensionality assumption are violated by the presentation of an overarching scenario and multiple constructs being measured. In addition, advances in technology have led to new and innovative item types, and one task might present several different types of items which might not be scored the same. This paper presents some of the issues when it comes to the development of the measurement model for scoring these complex scenario-based assessment tasks. Methods: This paper presents three measurement models that can be applied. One measurement model is an extension to IRT that allows for multiple dimensions to be measured, the MRCML model (Briggs & Wilson, 2003). Another method that will be presented is diagnostic classification models (DCMs) (Henson, Templin, & Willse, 2009). A final measurement model that will be discussed is Bayesian Networks (Almond, DiBello, Moulder, & Zapata-Rivera, 2007). All of these models allow for the incorporation of multiple dimensions into the measurement model and provide benefits when it comes to scoring. The paper will provide background on each of the models, and discuss benefits and drawbacks to each method based on the literature. It will also demonstrate the application of these models to an example of a scenario-based assessment task. Results: While there is not one measurement model that would apply in all situations, the identification of the evidence needed to make valid inferences about the student, and the evidence that can be accumulated from the task can provide information that can be used to identify the appropriate measurement model for that task. Significance: With more and more complex tasks being developed and used, the identification of a measurement model that can best leverage the information provided by the task will aid assessment designers in their use and development of these types of tasks.

Page 2: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

2

Part of the assessment development process includes the development of a measurement

model. The measurement model is used to determine how the assessment task is scored and how

scores provide information about the student. In a traditional assessment points are assigned to

each item, each item is scored for each student and then the percent correct is calculated or

student ability levels are calculated using an item response theory (IRT) model. Cut scores can

be used to group students and to provide feedback on the students’ ability level.

This method assumes there is one underlying ability being measured and that the items on

the assessment are independent. However in complex scenario-based assessments these

assumptions are often violated (Levy, 2013). The assessment developer should determine what

the effects are of these violations and identify an appropriate measurement model to mitigate

these effects.

This paper discusses how using an evidence-centered design approach can aid in the

identification of an appropriate measurement model. In additional three alternative measurement

models, MRCML, Bayesian Networks and Cognitive Diagnostic Models (CDM) are presented.

Evidence-Centered Design

Evidence-Centered Design (ECD) is a framework that makes explicit, and provides tools for,

building assessment arguments (Mislevy & Riconscente, 2006; Mislevy, Steinberg, & Almond,

2003). ECD views assessment as an argument from imperfect evidence. It aims to make explicit

the claims (the inferences that one intends to make based on scores) and the nature of the

evidence that supports those claims. (Mislevy, Haertel, Cheng, Ructtinger, et al, 2013).

The ECD framework includes the specification of three models: the student model (what

inferences we want to make about the student), the evidence model (how evidence collected

from the student provides information for the student model), and the task model (how the task

Page 3: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

3

can be structured to allow students to provide evidence for use in the evidence model). Figure 1

shows the relationship among these models. The task model provides information on how the

items are presented to the students and the type of responses expected from the students. The

evidence model shows how the responses are scored to obtain observable variables (represented

by the blue boxes in the evidence model) about the students and then how these variables are

aggregated to provide information for the student model(s). The student model defines the

student model variables which represent the inferences that are made about the students and the

relationship among these inferences.

Figure 1: Relationship of the ECD student, evidence, and task models

By following the ECD principles, these models are defined in the initial stage of assessment

design and are refined throughout the development process. This process involves frequent

iteration among the ECD models so that the final representations of the ECD models are aligned

to the fully developed assessment task. This is important because it highlights the fact that the

evidence model is not something that is developed after the assessment has been fully developed,

but instead is an integral part of the assessment design process. (See Mislevy & Riconscente,

2006 for more information on ECD.)

Page 4: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

4

Traditional measurement models

While classical test theory (Crocker & Algina, 1986) and IRT (Hambleton &

Swaminathan, 1985) use different statistics to model characteristics of students and items, they

both assume there is one main construct of interest which would correspond to the student model

having a single variable (Levy, 2013). In IRT this variable is the ability level of the student and

is referred to as a students’ θ. Assessments are then developed such that the items on the

assessment are assumed to be conditionally independent. This means that the probability to

answer any one question is dependent only on a student’s ability and not on any other factor (see

Figure 2).

Figure 2: Representation of an unidimensional IRT measurement model

With traditional methods, item rubrics are developed that provide a score for each item.

These scores could be dichotomous or they could be polytomous depending on the complexity of

the items. The scores are then combined (using statistics such as percent correct, or IRT models)

in order to arrive at an overall ability estimate of the students for that construct.

A violation of the assumption of unidimensionality or local independence affects the

validity of the inferences drawn from the assessment task (Hambleton & Swaminathan, 1985). If

the assessment is measuring multiple constructs (and therefore is not unidmensional), but the

Page 5: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

5

method assumes only one construct is being measured, then the ability estimate for the student

could be influence by a construct that is not being modeled. For example, if a task is assumed to

be measuring a student’s ability on science inquiry practices but the assessment also measure

science content, then a student who is weak on the science content might display poor

performance even if they are strong on inquiry skills. Similarly, if items in the assessment are

dependent on each other, then poor performance on later items could be due to their dependence

on earlier items and therefore overall results could underestimate a students’ true ability.

Characterization of Scenario-based Assessment Tasks

Increasingly scenario-based, technology-enhanced assessment tasks are being developed and

implemented for both formative and summative purposes in K-16 education (Quellmalz &

Pellegrino, 2009). These assessments present opportunities, in which students can engage in

complex tasks such as designing scientific investigations and manipulating representations of

real-world tools. While technology can be used to increase engagement with the task as well as

to measure concepts that are difficult to measure in a paper/pencil format, scenario-based

assessments do not have to be technology based.

In a scenario-based assessment, the tasks measure constructs by presenting items in a highly

contextualized situation. For example, a commonly used approach to contextualizing science

assessments is to present students with a scientific phenomenon to be explained and tools to

support that investigation. The purpose of including this context is to provide students with a

real-world scenario which not only supports their engagement with the task but also provides one

or more purposes for performing the task (instead of just answering discrete questions for the

sake of the assessment). These contexts often involve several skills such as crossing practice

skills with content skills (e.g. a task that has the student construct a graph and relate the

Page 6: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

6

information in the graph to a scientific phenomenon relies on a student’s knowledge of graphing

conventions, knowledge of the scientific phenomenon, as well as their ability to interpret and

explain the data presented in the graph). The complexity of the scenario is in part dependent on

the number of constructs being measured and the number of goals the student must address to

complete the task. Since more than one construct is being measured in such assessment tasks the

assumption of unidimensionality is often violated.

In addition, the use of a scenario-based context may require students to make connections

across items which could violate the assumption of local independence. For example, items

within a scenario might require that students choose a hypothesis, independent and dependent

variables and then explain their choices. The explanation that the student provides will depend

on the design choices they made.

Identification of observable variables

The first step in the development of the evidence model is to determine the observable

variables. This step involves determining how the work products that the students produce will

be evaluated. Typically this process is referred to in ECD as evidence identification and includes

specification of the rules or rubrics that will be applied to the work products (Levy, 2013). This

evaluation process results in the assignment of values to the observable variables. The next step

is to identify the measurement models. This section discusses how ECD can help determine the

observable variables. The next sections present different measurement models that can be used to

aggregate the observable variables to provide information on the student model variables.

One of the tasks in the ECD process is to determine the alignment between the task

model and the student model. This starts by clearly defining the student model variables that

make up a particular student model. For example, we could design an assessment in science

Page 7: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

7

where two scores are to be generated, one that represent a student’s ability to perform scientific

inquiry and a second score which represents a student’s ability within a particular science content

domain such as Biology. In this example there would be two student model variables one

associated with each score.

Particular knowledge, skills and attributes are then associated with the construct

represented by the student model variable. For a given assessment there can be multiple student

model variables, each representing a different constellation of knowledge, skills and attributes.

Items are developed that map to the particular knowledge, skills and attributes. This mapping

involves specifying the products of the student work (work products) that allow the assessment

designer to make inferences about the knowledge, skills, and attributes of the student. For

example, if the student model variable is a student’s ability to perform scientific inquiry, then

one skill associated with this student model variable is a student’s ability to generate a testable

hypothesis. The hypothesis is the work product that the student generated which provides

evidence of the student’s ability to generate a hypothesis, which in turn provides evidence for the

student’s ability to perform scientific inquiry. A student model variable most often includes

several knowledge, skills and attributes and therefore the assessment will include multiple work

products associated with that student model variable. See Figure 3 which illustrates a student

model with several student model variables included.

Page 8: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

8

Figure 3: Example of a student model. The student model consists of the list of student model variables and the connections between these variables. Once work products are identified then scoring rules need to be generated. Part of the

ECD process is not only to define the type of work products but also to define the qualities of the

work products that will be scored. It is from the specification of these qualities that rubrics can

be developed. In our hypothesis example, one quality of interest is “how testable is the

hypothesis”. Another quality may be “how well did the student relate the hypothesis to the

scientific phenomenon being observed”. It is around these qualities that rubrics and scoring

guides can be developed. Applying a rubric to the work product will produce observable

variables.

An ECD process encourages the specification of work products and observable variables

to be done prior to or during item development. Specifying these products and variables early

allows the developer to reflect on what it is that they need to score and how the scores are

aligned to the student model variables. This alignment can be used to support the validity of the

assessment.

In complex scenario-based assessments this is particularly important because individual

items might be aligned to multiple student model variables. For example, if a student is required

Page 9: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

9

to explain the hypothesis he or she generated, then one quality of this explanation is the format of

the hypothesis which provides evidence of the student’s ability to perform scientific inquiry,

while another quality of the explanation is about the science content and would provide evidence

of the student’s ability in the specific content area. Recognizing that different qualities of the

explanation contribute to different student model variables heightens the designer’s awareness

that multiple scores should be produced for this one item.

The identification of the work products becomes more intricate when it comes to

technology-enhanced items as the type of data that can be collected about the student differs

from that collected in a paper/pencil assessment. For example, data about the order in which a

student performs certain actions or the amount of time a student spends on an individual item can

be collected in a technology-enhanced item. It is important to ensure that the data that is

collected is aligned to the student model variables.

MRCML

An extension to the traditional IRT model is the Multidimensional Random Coefficient

Multinomial Logit (MRCML) model (Briggs & Wilson, 2003). This model can be used when

scores are required for multiple dimensions. These dimensions are represented by our student

model variables, so if there are multiple student model variables then this would mean that the

assessment should be scored along multiple dimensions. The MRCML model will produce

ability estimates for each student on each of the dimensions (student model variables) and also

takes into account the correlation among the multiple dimensions when obtaining the estimates

(Briggs & Wilson, 2003).

Observable variables need to be created before using the MRCML model. Each

observable variable will align to one and only one of the student model variables (or

Page 10: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

10

dimensions). The observable variables can be dichotomous or polytomous. The MRCML model

is set up to make it clear which observable variables are aligned to which student model

variables. The model also indicates that the student model variables are related to each other (see

Figure 4)

Figure 4: Example model for MRCML with four student model variables

Ability estimates are calculated using the following formula:

𝑃 𝑋!" = 1;𝐴,𝐵, 𝜉 𝜃 =  𝑒(!!"!!!!!"!)

𝑒(!!"!!!!!"!)!!!!!

 

Where A is a scoring matrix and B is a design matrix, and ξ is a vector that specifies item and

category parameters. Items are indexed by i and each item has k+1 possible response categories

(Briggs & Wilson, 2003). (See Wang, Wilson, & Adams, 1997 for more details).

A benefit of using this model is the fact that it takes into account the dependencies among

the multiple dimensions. As an example, assume that there are four dimensions of interest:

SMV1, SMV2, SMV3, SMV4. Each of these dimensions is associated with multiple observable

variables. One possible way to generate a score for each of the dimensions is to sum up the

observable variables that are associated with each dimensions. However, this doesn’t take into

Page 11: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

11

account the relationship among the different dimensions. For example, if Student A has raw

scores on the SMVs of 32, 20, 35, 24 and Student B has raw scores on the SMVs of 32, 12, 30,

20 then based on their raw scores the students would have the same ability on the first dimension

(SMV1). However, using the MRCML Student A would have a higher ability estimate since the

scores on the other dimensions are higher.

The MRCML model doesn’t address the issue of violations of local independence of

items, as it assumes that the observable variables are locally independent. One way to address

this issue is with the concept of item bundling (Kennedy, 2005). In item bundling, scores for

items or observable variables that are dependent on each other are “bundled” together to generate

an overall score. With this method the observable variables that are used to generate the ability

estimates for the SMVs are locally independent.

The MRCML model can be used when it is believed that there is a correlation among the

student model variables, and it is relatively straight-forward to generate observable variables

(through item bundling) that are locally independent. During the ECD process the determination

of the relationship between the observable variables and the student model variables as well as

the relationship among the student model variables is specified, making the MRCML model

clear.

Cognitive Diagnostic Models

Another type of measurement model is a cognitive diagnostic model (CDM) which has

its roots in latent variable modeling. In this model, the latent variables represent the attributes

required by the assessment (Rupp & Templin, 2008). In a CDM model, student model variables

are referred to as attributes. Observable variables are aligned to one or more of these attributes.

Probabilities that a student has the attribute are calculated based on these observable variables.

Page 12: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

12

The number of latent variables or attributes in a CDM may vary but there should be more than

one attribute of interest. This model can accommodate a hierarchical structure among the

attributes. The CMD model is used to provide evidence about the set of attributes obtained by the

student.

In a CDM, the alignment of the observable variables with the attributes is referred to as a

loading structure. CDMs often have a complex loading structure (Rupp & Templin, 2008) since

items may depend on a combination of attributes. The loading structure is represented in a Q

matrix. This is a matrix that indicates for every item which attributes it requires. For example,

an assessment could be created that has 8 items designed to measure 4 attributes. In the example

in Table 1, items that measure a particular attribute, also require the previous attribute. While

another example (see Table 2) has different combinations of items and attributes. Either of these

types of loading structures can be handled with CDMs, along with many other types. The Q

matrix can not only help with the analysis of the assessment, as it is clear which attributes the

items are designed to measure, but also in the creation of items for the assessment, as this type of

information makes some of the requirements for each item clear. The process of specifying the

CDM is the same as specifying the relationship between the observable variables and the student

model variables in an ECD process.

Page 13: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

13

Table 1: An example Q-matrix for an exam with 8 items depending on 4 attributes, where each attribute requires the previous attribute

Item Number

Attributes 1 2 3 4

1 1 0 0 0 2 1 0 0 0 3 1 1 0 0 4 1 1 0 0 5 1 1 1 0 6 1 1 1 0 7 1 1 1 1 8 1 1 1 1

Table 2: An example Q-matrix for an exam with 8 items depending on 4 attributes, where attributes do not have a specific relationship to each other

Item Number

Attributes 1 2 3 4

1 1 1 0 0 2 1 0 0 0 3 1 0 1 0 4 0 1 1 0 5 1 1 1 0 6 0 0 1 0 7 0 0 1 1 8 0 1 0 1

The general model for a CDM is as follows:

∏∑ −−==I

xic

xic

Ccrr

irirvxXP 1)1()( ππ (Rupp, Templin, & Henson, 2010) where rx is the vector

of response data for person r (responses are assumed to be binary), cv is the probability of being

in class c, and icπ is the probability of a correct response for item i given the student is in class c.

Page 14: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

14

Different CDMs provide different parameterizations for calculating icπ (Rupp, Templin, &

Henson, 2010).

CDMs differ in several ways. These include the type of observable variables that can be

modeled (polytomous or dichotomous), the type of attributes included in the model (polytomous

or dichotomous) and the relationship among the different attributes. The relationship among

different attributes, can be modeled in either a compensatory (having one attribute makes up for

having a lack of the second attribute) or a non-compensatory manner (Von Davier, 2008). Some

CDM models may be more appropriate for certain types of assessments. The decision of which

CDM model to use should be based on a theoretical perspective around the relationship of the

attributes.

The CDM model does not take into account relationships among items. Similarly to the

MRCML model, relationships among items can be taken into account through the use of item

bundling. CDMs are useful when there is a pre-determined relationship among the attributes (or

student model variables) to be modeled. During the ECD process the relationship among the

student model variables can be specified as well as the loading structure for the CDM.

Bayesian Inference Networks

One type of model which some consider a CDM is a Bayesian Inference Network (BIN).

BINs are different from other CDMs in that they are a framework versus a specific model.

Because of that, BINs are more flexible than other cognitive diagnostic models. However, with

the choice of using a BIN comes more decisions regarding how the assessment is modeled.

A BIN is a graphical representation of the relationships between variables. It is based on

a finite acyclic directed graph (Almond, Dibello, Moulder, & Zapata-Rivera, 2007). In a BIN the

vertices are thought of as categorical variables with values representing states. A given

Page 15: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

15

examinee is modeled as if he/she is in one state, represented by one possible value of the

categorical variable. Edges in the graph represent a probabilistic dependency, so the edge (V1,

V2) would imply that the probabilities associated with the states in V2 differ depending on the

state of V1. Or put another way, the probability of V2 is conditionally dependent on V1. For the

edge (V1, V2) V1 is referred to as the parent node, and V2 is called the child node. Nodes in a

BIN may have no parents, one parent or multiple parent nodes. The probability distribution

associated with each node is conditionally dependent on all of its parent nodes.

A BIN is considered to be built when all of the probability distributions for the variables

have been determined (Mislevy, 2002). The joint product of the conditional probabilities of all

variables given their parents (interpreted to include marginal distributions for variables that have

no parents) is a joint probability distribution for the full set of variables. At this point an

assessment designer may enter any information that is known about the examinee and the

probabilities for each of the values of the variables will be updated for all nodes in the BIN.

In a very simple example, a BIN can be constructed to represent the relationship between

the weather and whether or not I take an umbrella with me to work. For this example there are

two variables. Variable A is the weather and for this example it can take on the values of sunny,

rainy, cloudy, and snowy. The other variable is the variable for taking an umbrella with me and it

can take on the values yes or no. The graph for this is represented in 5. Notice in the graph that

the umbrella variable is dependent on the weather variable (made clear by the arrow pointing

from the weather variable to the umbrella variable). This arrow indicates that whether or not I

take an umbrella is dependent on the weather. It would be a very different statement if the arrow

pointed the other way. Using that direction, the BIN would indicate that whether or not I take an

umbrella has some influence on the weather.

Page 16: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

16

Figure 5: BIN for the relationship between two variables. In this case the probability of an umbrella is dependent on the weather. Shown are the starting probabilities when neither value is known.

Each variable has its own probability table. For the weather variable this is the

probability of each type of weather occurring (see Table 3). For the umbrella variable this is the

conditional probability given the type of weather (see 4). While this data is hypothetical, in

general these probabilities would come from theory or they would be derived from real data.

Table 3: Probability of a given type of weather Weather Prob Sunny 25% Rainy 25%

Cloudy 25% Snowy 25%

Table 4: Conditional probability of taking an umbrella given the type of weather.

Weather Umbrella

yes no sunny 10% 90% rainy 90% 10%

cloudy 50% 50% snowy 20% 80%

In the initial state the type of weather is not known and whether or not I took an umbrella

is also not known. The probability for the weather variable is simply the starting probability for

WeatherSunnyRainyCloudySnowy

25.025.025.025.0

UmbrellaYesNo

42.557.5

Page 17: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

17

this variable (which could be based on knowing the season, a current weather forecast, or simply

looking out the window). Probabilities can be updated in either direction. If I took an umbrella,

then the probability that it is rainy would increase. Or, if it is sunny then the probability that I

took an umbrella would decrease.

In an educational setting a BIN may be constructed to represent the measurement model

of an assessment. Using a traditional measurement model there is one attribute that is being

measured, and each of the items on the assessment are designed to measure an aspect of that

attribute. Figure 6 shows a BIN for a traditional IRT model. The assumption of local

independence is shown in the BIN by having each of the items depend on the attribute without

any direct dependencies among the items.

Figure 6: BIN for an IRT model with four items depending on one attribute For a more complex assessment, multiple attributes or student model variables could be

defined and items could be associated with multiple attributes. In addition, if items are dependent

on one another than arrows can be added to the BIN to represent this dependency, as shown by

the arrow between items 2 and 3 in Figure 7. Probability tables can then be set up for each of the

items (see Table 5).

Item1CorrectIncorrect

65.035.0

Item2CorrectIncorrect

50.050.0

Item3CorrectIncorrect

38.361.7

Item4CorrectIncorrect

27.073.0

Attributelowmediumhigh

33.333.333.3

Page 18: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

18

Figure 7: A BIN where items are aligned to multiple attributes (NOTE ADD Dependency between items 2 and 3) Table 5: Probability of item responses to item 1 based on the BIN shown in Figure x

Attribute 1 Attribute 2 Item 1

Correct Incorrect Yes Yes 0.9 0.1 Yes No 0.2 0.8 No Yes 0.2 0.8 No No 0.2 0.8

BINs are the most flexible of the measurement models discussed in this paper. However,

they also require a fair amount of time to determine based on theory or empirical findings the

probabilities for each of the nodes in the BIN. The dependencies among the items and the

attributes must be modeled and initial probabilities must be loaded into the model. [CITE]

Conclusion

With the development of more complex scenario-based assessment tasks there is a need

to determine the appropriate measurement model to use to collect evidence. The evidence model

has two parts, the first is to determine how to score the work products produced by the student to

obtain observable variables. The second part is to aggregate scores from these observable

variables to be able to draw inferences about the student model variables of interest.

Page 19: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

19

An ECD approach can aid in the specification of the student model variables, the

identification of the work products, the specifications for creating observable variables, and the

specification of a measurement model to produce inferences about the student model based on

the observable variables. Part of this process involves specifying the relationships among the

observable variables and the student model variables. The assumptions that are made during this

specification can help determine an appropriate measurement model.

Three measurement models, MRCML, CDMs, and BINs are discussed here. These

models can be used with multiple student model variables, and have methods to deal with the

issue of item dependence. This makes them appropriate to use with complex scenario-based

assessments as these assessments often violate the assumptions of unidimensionality and local

independence of items required by more traditional measurement models. However, these are not

the only options for measurement models and an assessment developer should take time to

ensure that the measurement model they choose is appropriate for the purpose of their

assessment.

Page 20: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

20

REFERENCES

Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera, J. D. (2007). Modeling diagnostic

assessments with Bayesian networks. Journal of Educational Measurement, 44, 341-359.

Briggs, D. C., & Wilson, M. (2003) . An introduction to multidimensional measurement using

Rasch models. Journal of Applied Measurement, 4(1), 87-100.

Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: principles and

applications. Boston, MA: Kluwer Nijhoff Publishing.

Kennedy, C. A. (2005). Constructing PADI measurement models for the BEAR scoring engine.

PADI Technical Report 7. Menlo Park, CA: SRI International.

Levy, R. (2013). Psychometric and evidentiary advances, opportunities, and challenges for

simulation-based assessment. Educational Assessment, 18(3), 182-207.

Mislevy, R. J., Almond, R., Dibello, L., Jenkins, F. Steinberg, L., and Yan, D. (2002). Modeling

conditional probabilities in complex educational assessments. CSE Technical Report

580. Los Angeles: The National Center for Research on Evaluation, Standards, Student

Testing (CRESST), Center for Studies in Education, UCLA.

Mislevy, R. J., Haertel, G., Cheng, B. H., Ructtinger, L., DeBarger, A., Murray, E., et.al. (2013).

A “conditional” sense of fairness in assessment. Educational Research and Evaluation:

An International Journal on Theory and Practices, 19(2-3).

Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design: Layers,

concepts, and terminology. In S. Downing & T. Haladyna (Eds.), Handbook of test

development (pp. 61-90). Mahwah, NJ: Lawrence Erlbaum.

Mislevy, R. J., Steinberg, L. S., Almond, R. G. (2003). On the structure of educational

assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3-62.

Page 21: Developing Measurement Models for Complex Scenario V2€¦ · 1 Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel Abstract

21

Quellmalz, E. S., & Pellegrino, J. W. (2009). Technology and testing. Science, 323, 75–79.

Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification

models : A comprehensive review of the current state-of-the-art. Measurement, 6, 219-

262.

Rupp, A. A., Templin, J., Henson, R. A. (2010). Diagnostic measurement theory, methods and

applications. New York: The Guilford Press.

Von Davier. M. (2008). A general diagnostic model applied to language testing data. British

Journal of Mathematical and Statistical Psychology, 61.


Recommended