1
Developing Measurement Models for Complex Scenario-Based Assessment Tasks - Daisy Wise Rutstein, Geneva Haertel
Abstract
Purpose: This paper discusses how the use of an Evidence-Centered Design (ECD) (Mislevy & Riconscente, 2006) approach for assessment development can aid in the identification of appropriate measurement models for complexly-structured performance tasks. In an ECD approach the connections between the student model (what it is we want to say about the student), the evidence model (which includes the measurement model), and the task model (features of the task that provide evidence) are made explicit. The development of these connections helps aid in the identification of appropriate measurement models. Theoretical Perspective: Recent developments in assessment have seen a shift from traditional standardized testing to include the use of technology-enhanced scenario-based assessment tasks (Quellmalz & Pellegrino, 2009). These assessments present opportunities, in which students can engage in complex tasks such as designing investigations and manipulating representations of real-world tools. The use of these assessment tasks also provides challenges to the assessment designer when determining how the tasks should be scored. Traditional methods of scoring such as IRT are often not appropriate as the local independence assumption of items, and the unidimensionality assumption are violated by the presentation of an overarching scenario and multiple constructs being measured. In addition, advances in technology have led to new and innovative item types, and one task might present several different types of items which might not be scored the same. This paper presents some of the issues when it comes to the development of the measurement model for scoring these complex scenario-based assessment tasks. Methods: This paper presents three measurement models that can be applied. One measurement model is an extension to IRT that allows for multiple dimensions to be measured, the MRCML model (Briggs & Wilson, 2003). Another method that will be presented is diagnostic classification models (DCMs) (Henson, Templin, & Willse, 2009). A final measurement model that will be discussed is Bayesian Networks (Almond, DiBello, Moulder, & Zapata-Rivera, 2007). All of these models allow for the incorporation of multiple dimensions into the measurement model and provide benefits when it comes to scoring. The paper will provide background on each of the models, and discuss benefits and drawbacks to each method based on the literature. It will also demonstrate the application of these models to an example of a scenario-based assessment task. Results: While there is not one measurement model that would apply in all situations, the identification of the evidence needed to make valid inferences about the student, and the evidence that can be accumulated from the task can provide information that can be used to identify the appropriate measurement model for that task. Significance: With more and more complex tasks being developed and used, the identification of a measurement model that can best leverage the information provided by the task will aid assessment designers in their use and development of these types of tasks.
2
Part of the assessment development process includes the development of a measurement
model. The measurement model is used to determine how the assessment task is scored and how
scores provide information about the student. In a traditional assessment points are assigned to
each item, each item is scored for each student and then the percent correct is calculated or
student ability levels are calculated using an item response theory (IRT) model. Cut scores can
be used to group students and to provide feedback on the students’ ability level.
This method assumes there is one underlying ability being measured and that the items on
the assessment are independent. However in complex scenario-based assessments these
assumptions are often violated (Levy, 2013). The assessment developer should determine what
the effects are of these violations and identify an appropriate measurement model to mitigate
these effects.
This paper discusses how using an evidence-centered design approach can aid in the
identification of an appropriate measurement model. In additional three alternative measurement
models, MRCML, Bayesian Networks and Cognitive Diagnostic Models (CDM) are presented.
Evidence-Centered Design
Evidence-Centered Design (ECD) is a framework that makes explicit, and provides tools for,
building assessment arguments (Mislevy & Riconscente, 2006; Mislevy, Steinberg, & Almond,
2003). ECD views assessment as an argument from imperfect evidence. It aims to make explicit
the claims (the inferences that one intends to make based on scores) and the nature of the
evidence that supports those claims. (Mislevy, Haertel, Cheng, Ructtinger, et al, 2013).
The ECD framework includes the specification of three models: the student model (what
inferences we want to make about the student), the evidence model (how evidence collected
from the student provides information for the student model), and the task model (how the task
3
can be structured to allow students to provide evidence for use in the evidence model). Figure 1
shows the relationship among these models. The task model provides information on how the
items are presented to the students and the type of responses expected from the students. The
evidence model shows how the responses are scored to obtain observable variables (represented
by the blue boxes in the evidence model) about the students and then how these variables are
aggregated to provide information for the student model(s). The student model defines the
student model variables which represent the inferences that are made about the students and the
relationship among these inferences.
Figure 1: Relationship of the ECD student, evidence, and task models
By following the ECD principles, these models are defined in the initial stage of assessment
design and are refined throughout the development process. This process involves frequent
iteration among the ECD models so that the final representations of the ECD models are aligned
to the fully developed assessment task. This is important because it highlights the fact that the
evidence model is not something that is developed after the assessment has been fully developed,
but instead is an integral part of the assessment design process. (See Mislevy & Riconscente,
2006 for more information on ECD.)
4
Traditional measurement models
While classical test theory (Crocker & Algina, 1986) and IRT (Hambleton &
Swaminathan, 1985) use different statistics to model characteristics of students and items, they
both assume there is one main construct of interest which would correspond to the student model
having a single variable (Levy, 2013). In IRT this variable is the ability level of the student and
is referred to as a students’ θ. Assessments are then developed such that the items on the
assessment are assumed to be conditionally independent. This means that the probability to
answer any one question is dependent only on a student’s ability and not on any other factor (see
Figure 2).
Figure 2: Representation of an unidimensional IRT measurement model
With traditional methods, item rubrics are developed that provide a score for each item.
These scores could be dichotomous or they could be polytomous depending on the complexity of
the items. The scores are then combined (using statistics such as percent correct, or IRT models)
in order to arrive at an overall ability estimate of the students for that construct.
A violation of the assumption of unidimensionality or local independence affects the
validity of the inferences drawn from the assessment task (Hambleton & Swaminathan, 1985). If
the assessment is measuring multiple constructs (and therefore is not unidmensional), but the
5
method assumes only one construct is being measured, then the ability estimate for the student
could be influence by a construct that is not being modeled. For example, if a task is assumed to
be measuring a student’s ability on science inquiry practices but the assessment also measure
science content, then a student who is weak on the science content might display poor
performance even if they are strong on inquiry skills. Similarly, if items in the assessment are
dependent on each other, then poor performance on later items could be due to their dependence
on earlier items and therefore overall results could underestimate a students’ true ability.
Characterization of Scenario-based Assessment Tasks
Increasingly scenario-based, technology-enhanced assessment tasks are being developed and
implemented for both formative and summative purposes in K-16 education (Quellmalz &
Pellegrino, 2009). These assessments present opportunities, in which students can engage in
complex tasks such as designing scientific investigations and manipulating representations of
real-world tools. While technology can be used to increase engagement with the task as well as
to measure concepts that are difficult to measure in a paper/pencil format, scenario-based
assessments do not have to be technology based.
In a scenario-based assessment, the tasks measure constructs by presenting items in a highly
contextualized situation. For example, a commonly used approach to contextualizing science
assessments is to present students with a scientific phenomenon to be explained and tools to
support that investigation. The purpose of including this context is to provide students with a
real-world scenario which not only supports their engagement with the task but also provides one
or more purposes for performing the task (instead of just answering discrete questions for the
sake of the assessment). These contexts often involve several skills such as crossing practice
skills with content skills (e.g. a task that has the student construct a graph and relate the
6
information in the graph to a scientific phenomenon relies on a student’s knowledge of graphing
conventions, knowledge of the scientific phenomenon, as well as their ability to interpret and
explain the data presented in the graph). The complexity of the scenario is in part dependent on
the number of constructs being measured and the number of goals the student must address to
complete the task. Since more than one construct is being measured in such assessment tasks the
assumption of unidimensionality is often violated.
In addition, the use of a scenario-based context may require students to make connections
across items which could violate the assumption of local independence. For example, items
within a scenario might require that students choose a hypothesis, independent and dependent
variables and then explain their choices. The explanation that the student provides will depend
on the design choices they made.
Identification of observable variables
The first step in the development of the evidence model is to determine the observable
variables. This step involves determining how the work products that the students produce will
be evaluated. Typically this process is referred to in ECD as evidence identification and includes
specification of the rules or rubrics that will be applied to the work products (Levy, 2013). This
evaluation process results in the assignment of values to the observable variables. The next step
is to identify the measurement models. This section discusses how ECD can help determine the
observable variables. The next sections present different measurement models that can be used to
aggregate the observable variables to provide information on the student model variables.
One of the tasks in the ECD process is to determine the alignment between the task
model and the student model. This starts by clearly defining the student model variables that
make up a particular student model. For example, we could design an assessment in science
7
where two scores are to be generated, one that represent a student’s ability to perform scientific
inquiry and a second score which represents a student’s ability within a particular science content
domain such as Biology. In this example there would be two student model variables one
associated with each score.
Particular knowledge, skills and attributes are then associated with the construct
represented by the student model variable. For a given assessment there can be multiple student
model variables, each representing a different constellation of knowledge, skills and attributes.
Items are developed that map to the particular knowledge, skills and attributes. This mapping
involves specifying the products of the student work (work products) that allow the assessment
designer to make inferences about the knowledge, skills, and attributes of the student. For
example, if the student model variable is a student’s ability to perform scientific inquiry, then
one skill associated with this student model variable is a student’s ability to generate a testable
hypothesis. The hypothesis is the work product that the student generated which provides
evidence of the student’s ability to generate a hypothesis, which in turn provides evidence for the
student’s ability to perform scientific inquiry. A student model variable most often includes
several knowledge, skills and attributes and therefore the assessment will include multiple work
products associated with that student model variable. See Figure 3 which illustrates a student
model with several student model variables included.
8
Figure 3: Example of a student model. The student model consists of the list of student model variables and the connections between these variables. Once work products are identified then scoring rules need to be generated. Part of the
ECD process is not only to define the type of work products but also to define the qualities of the
work products that will be scored. It is from the specification of these qualities that rubrics can
be developed. In our hypothesis example, one quality of interest is “how testable is the
hypothesis”. Another quality may be “how well did the student relate the hypothesis to the
scientific phenomenon being observed”. It is around these qualities that rubrics and scoring
guides can be developed. Applying a rubric to the work product will produce observable
variables.
An ECD process encourages the specification of work products and observable variables
to be done prior to or during item development. Specifying these products and variables early
allows the developer to reflect on what it is that they need to score and how the scores are
aligned to the student model variables. This alignment can be used to support the validity of the
assessment.
In complex scenario-based assessments this is particularly important because individual
items might be aligned to multiple student model variables. For example, if a student is required
9
to explain the hypothesis he or she generated, then one quality of this explanation is the format of
the hypothesis which provides evidence of the student’s ability to perform scientific inquiry,
while another quality of the explanation is about the science content and would provide evidence
of the student’s ability in the specific content area. Recognizing that different qualities of the
explanation contribute to different student model variables heightens the designer’s awareness
that multiple scores should be produced for this one item.
The identification of the work products becomes more intricate when it comes to
technology-enhanced items as the type of data that can be collected about the student differs
from that collected in a paper/pencil assessment. For example, data about the order in which a
student performs certain actions or the amount of time a student spends on an individual item can
be collected in a technology-enhanced item. It is important to ensure that the data that is
collected is aligned to the student model variables.
MRCML
An extension to the traditional IRT model is the Multidimensional Random Coefficient
Multinomial Logit (MRCML) model (Briggs & Wilson, 2003). This model can be used when
scores are required for multiple dimensions. These dimensions are represented by our student
model variables, so if there are multiple student model variables then this would mean that the
assessment should be scored along multiple dimensions. The MRCML model will produce
ability estimates for each student on each of the dimensions (student model variables) and also
takes into account the correlation among the multiple dimensions when obtaining the estimates
(Briggs & Wilson, 2003).
Observable variables need to be created before using the MRCML model. Each
observable variable will align to one and only one of the student model variables (or
10
dimensions). The observable variables can be dichotomous or polytomous. The MRCML model
is set up to make it clear which observable variables are aligned to which student model
variables. The model also indicates that the student model variables are related to each other (see
Figure 4)
Figure 4: Example model for MRCML with four student model variables
Ability estimates are calculated using the following formula:
𝑃 𝑋!" = 1;𝐴,𝐵, 𝜉 𝜃 = 𝑒(!!"!!!!!"!)
𝑒(!!"!!!!!"!)!!!!!
Where A is a scoring matrix and B is a design matrix, and ξ is a vector that specifies item and
category parameters. Items are indexed by i and each item has k+1 possible response categories
(Briggs & Wilson, 2003). (See Wang, Wilson, & Adams, 1997 for more details).
A benefit of using this model is the fact that it takes into account the dependencies among
the multiple dimensions. As an example, assume that there are four dimensions of interest:
SMV1, SMV2, SMV3, SMV4. Each of these dimensions is associated with multiple observable
variables. One possible way to generate a score for each of the dimensions is to sum up the
observable variables that are associated with each dimensions. However, this doesn’t take into
11
account the relationship among the different dimensions. For example, if Student A has raw
scores on the SMVs of 32, 20, 35, 24 and Student B has raw scores on the SMVs of 32, 12, 30,
20 then based on their raw scores the students would have the same ability on the first dimension
(SMV1). However, using the MRCML Student A would have a higher ability estimate since the
scores on the other dimensions are higher.
The MRCML model doesn’t address the issue of violations of local independence of
items, as it assumes that the observable variables are locally independent. One way to address
this issue is with the concept of item bundling (Kennedy, 2005). In item bundling, scores for
items or observable variables that are dependent on each other are “bundled” together to generate
an overall score. With this method the observable variables that are used to generate the ability
estimates for the SMVs are locally independent.
The MRCML model can be used when it is believed that there is a correlation among the
student model variables, and it is relatively straight-forward to generate observable variables
(through item bundling) that are locally independent. During the ECD process the determination
of the relationship between the observable variables and the student model variables as well as
the relationship among the student model variables is specified, making the MRCML model
clear.
Cognitive Diagnostic Models
Another type of measurement model is a cognitive diagnostic model (CDM) which has
its roots in latent variable modeling. In this model, the latent variables represent the attributes
required by the assessment (Rupp & Templin, 2008). In a CDM model, student model variables
are referred to as attributes. Observable variables are aligned to one or more of these attributes.
Probabilities that a student has the attribute are calculated based on these observable variables.
12
The number of latent variables or attributes in a CDM may vary but there should be more than
one attribute of interest. This model can accommodate a hierarchical structure among the
attributes. The CMD model is used to provide evidence about the set of attributes obtained by the
student.
In a CDM, the alignment of the observable variables with the attributes is referred to as a
loading structure. CDMs often have a complex loading structure (Rupp & Templin, 2008) since
items may depend on a combination of attributes. The loading structure is represented in a Q
matrix. This is a matrix that indicates for every item which attributes it requires. For example,
an assessment could be created that has 8 items designed to measure 4 attributes. In the example
in Table 1, items that measure a particular attribute, also require the previous attribute. While
another example (see Table 2) has different combinations of items and attributes. Either of these
types of loading structures can be handled with CDMs, along with many other types. The Q
matrix can not only help with the analysis of the assessment, as it is clear which attributes the
items are designed to measure, but also in the creation of items for the assessment, as this type of
information makes some of the requirements for each item clear. The process of specifying the
CDM is the same as specifying the relationship between the observable variables and the student
model variables in an ECD process.
13
Table 1: An example Q-matrix for an exam with 8 items depending on 4 attributes, where each attribute requires the previous attribute
Item Number
Attributes 1 2 3 4
1 1 0 0 0 2 1 0 0 0 3 1 1 0 0 4 1 1 0 0 5 1 1 1 0 6 1 1 1 0 7 1 1 1 1 8 1 1 1 1
Table 2: An example Q-matrix for an exam with 8 items depending on 4 attributes, where attributes do not have a specific relationship to each other
Item Number
Attributes 1 2 3 4
1 1 1 0 0 2 1 0 0 0 3 1 0 1 0 4 0 1 1 0 5 1 1 1 0 6 0 0 1 0 7 0 0 1 1 8 0 1 0 1
The general model for a CDM is as follows:
∏∑ −−==I
xic
xic
Ccrr
irirvxXP 1)1()( ππ (Rupp, Templin, & Henson, 2010) where rx is the vector
of response data for person r (responses are assumed to be binary), cv is the probability of being
in class c, and icπ is the probability of a correct response for item i given the student is in class c.
14
Different CDMs provide different parameterizations for calculating icπ (Rupp, Templin, &
Henson, 2010).
CDMs differ in several ways. These include the type of observable variables that can be
modeled (polytomous or dichotomous), the type of attributes included in the model (polytomous
or dichotomous) and the relationship among the different attributes. The relationship among
different attributes, can be modeled in either a compensatory (having one attribute makes up for
having a lack of the second attribute) or a non-compensatory manner (Von Davier, 2008). Some
CDM models may be more appropriate for certain types of assessments. The decision of which
CDM model to use should be based on a theoretical perspective around the relationship of the
attributes.
The CDM model does not take into account relationships among items. Similarly to the
MRCML model, relationships among items can be taken into account through the use of item
bundling. CDMs are useful when there is a pre-determined relationship among the attributes (or
student model variables) to be modeled. During the ECD process the relationship among the
student model variables can be specified as well as the loading structure for the CDM.
Bayesian Inference Networks
One type of model which some consider a CDM is a Bayesian Inference Network (BIN).
BINs are different from other CDMs in that they are a framework versus a specific model.
Because of that, BINs are more flexible than other cognitive diagnostic models. However, with
the choice of using a BIN comes more decisions regarding how the assessment is modeled.
A BIN is a graphical representation of the relationships between variables. It is based on
a finite acyclic directed graph (Almond, Dibello, Moulder, & Zapata-Rivera, 2007). In a BIN the
vertices are thought of as categorical variables with values representing states. A given
15
examinee is modeled as if he/she is in one state, represented by one possible value of the
categorical variable. Edges in the graph represent a probabilistic dependency, so the edge (V1,
V2) would imply that the probabilities associated with the states in V2 differ depending on the
state of V1. Or put another way, the probability of V2 is conditionally dependent on V1. For the
edge (V1, V2) V1 is referred to as the parent node, and V2 is called the child node. Nodes in a
BIN may have no parents, one parent or multiple parent nodes. The probability distribution
associated with each node is conditionally dependent on all of its parent nodes.
A BIN is considered to be built when all of the probability distributions for the variables
have been determined (Mislevy, 2002). The joint product of the conditional probabilities of all
variables given their parents (interpreted to include marginal distributions for variables that have
no parents) is a joint probability distribution for the full set of variables. At this point an
assessment designer may enter any information that is known about the examinee and the
probabilities for each of the values of the variables will be updated for all nodes in the BIN.
In a very simple example, a BIN can be constructed to represent the relationship between
the weather and whether or not I take an umbrella with me to work. For this example there are
two variables. Variable A is the weather and for this example it can take on the values of sunny,
rainy, cloudy, and snowy. The other variable is the variable for taking an umbrella with me and it
can take on the values yes or no. The graph for this is represented in 5. Notice in the graph that
the umbrella variable is dependent on the weather variable (made clear by the arrow pointing
from the weather variable to the umbrella variable). This arrow indicates that whether or not I
take an umbrella is dependent on the weather. It would be a very different statement if the arrow
pointed the other way. Using that direction, the BIN would indicate that whether or not I take an
umbrella has some influence on the weather.
16
Figure 5: BIN for the relationship between two variables. In this case the probability of an umbrella is dependent on the weather. Shown are the starting probabilities when neither value is known.
Each variable has its own probability table. For the weather variable this is the
probability of each type of weather occurring (see Table 3). For the umbrella variable this is the
conditional probability given the type of weather (see 4). While this data is hypothetical, in
general these probabilities would come from theory or they would be derived from real data.
Table 3: Probability of a given type of weather Weather Prob Sunny 25% Rainy 25%
Cloudy 25% Snowy 25%
Table 4: Conditional probability of taking an umbrella given the type of weather.
Weather Umbrella
yes no sunny 10% 90% rainy 90% 10%
cloudy 50% 50% snowy 20% 80%
In the initial state the type of weather is not known and whether or not I took an umbrella
is also not known. The probability for the weather variable is simply the starting probability for
WeatherSunnyRainyCloudySnowy
25.025.025.025.0
UmbrellaYesNo
42.557.5
17
this variable (which could be based on knowing the season, a current weather forecast, or simply
looking out the window). Probabilities can be updated in either direction. If I took an umbrella,
then the probability that it is rainy would increase. Or, if it is sunny then the probability that I
took an umbrella would decrease.
In an educational setting a BIN may be constructed to represent the measurement model
of an assessment. Using a traditional measurement model there is one attribute that is being
measured, and each of the items on the assessment are designed to measure an aspect of that
attribute. Figure 6 shows a BIN for a traditional IRT model. The assumption of local
independence is shown in the BIN by having each of the items depend on the attribute without
any direct dependencies among the items.
Figure 6: BIN for an IRT model with four items depending on one attribute For a more complex assessment, multiple attributes or student model variables could be
defined and items could be associated with multiple attributes. In addition, if items are dependent
on one another than arrows can be added to the BIN to represent this dependency, as shown by
the arrow between items 2 and 3 in Figure 7. Probability tables can then be set up for each of the
items (see Table 5).
Item1CorrectIncorrect
65.035.0
Item2CorrectIncorrect
50.050.0
Item3CorrectIncorrect
38.361.7
Item4CorrectIncorrect
27.073.0
Attributelowmediumhigh
33.333.333.3
18
Figure 7: A BIN where items are aligned to multiple attributes (NOTE ADD Dependency between items 2 and 3) Table 5: Probability of item responses to item 1 based on the BIN shown in Figure x
Attribute 1 Attribute 2 Item 1
Correct Incorrect Yes Yes 0.9 0.1 Yes No 0.2 0.8 No Yes 0.2 0.8 No No 0.2 0.8
BINs are the most flexible of the measurement models discussed in this paper. However,
they also require a fair amount of time to determine based on theory or empirical findings the
probabilities for each of the nodes in the BIN. The dependencies among the items and the
attributes must be modeled and initial probabilities must be loaded into the model. [CITE]
Conclusion
With the development of more complex scenario-based assessment tasks there is a need
to determine the appropriate measurement model to use to collect evidence. The evidence model
has two parts, the first is to determine how to score the work products produced by the student to
obtain observable variables. The second part is to aggregate scores from these observable
variables to be able to draw inferences about the student model variables of interest.
19
An ECD approach can aid in the specification of the student model variables, the
identification of the work products, the specifications for creating observable variables, and the
specification of a measurement model to produce inferences about the student model based on
the observable variables. Part of this process involves specifying the relationships among the
observable variables and the student model variables. The assumptions that are made during this
specification can help determine an appropriate measurement model.
Three measurement models, MRCML, CDMs, and BINs are discussed here. These
models can be used with multiple student model variables, and have methods to deal with the
issue of item dependence. This makes them appropriate to use with complex scenario-based
assessments as these assessments often violate the assumptions of unidimensionality and local
independence of items required by more traditional measurement models. However, these are not
the only options for measurement models and an assessment developer should take time to
ensure that the measurement model they choose is appropriate for the purpose of their
assessment.
20
REFERENCES
Almond, R. G., DiBello, L. V., Moulder, B., & Zapata-Rivera, J. D. (2007). Modeling diagnostic
assessments with Bayesian networks. Journal of Educational Measurement, 44, 341-359.
Briggs, D. C., & Wilson, M. (2003) . An introduction to multidimensional measurement using
Rasch models. Journal of Applied Measurement, 4(1), 87-100.
Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: principles and
applications. Boston, MA: Kluwer Nijhoff Publishing.
Kennedy, C. A. (2005). Constructing PADI measurement models for the BEAR scoring engine.
PADI Technical Report 7. Menlo Park, CA: SRI International.
Levy, R. (2013). Psychometric and evidentiary advances, opportunities, and challenges for
simulation-based assessment. Educational Assessment, 18(3), 182-207.
Mislevy, R. J., Almond, R., Dibello, L., Jenkins, F. Steinberg, L., and Yan, D. (2002). Modeling
conditional probabilities in complex educational assessments. CSE Technical Report
580. Los Angeles: The National Center for Research on Evaluation, Standards, Student
Testing (CRESST), Center for Studies in Education, UCLA.
Mislevy, R. J., Haertel, G., Cheng, B. H., Ructtinger, L., DeBarger, A., Murray, E., et.al. (2013).
A “conditional” sense of fairness in assessment. Educational Research and Evaluation:
An International Journal on Theory and Practices, 19(2-3).
Mislevy, R. J., & Riconscente, M. M. (2006). Evidence-centered assessment design: Layers,
concepts, and terminology. In S. Downing & T. Haladyna (Eds.), Handbook of test
development (pp. 61-90). Mahwah, NJ: Lawrence Erlbaum.
Mislevy, R. J., Steinberg, L. S., Almond, R. G. (2003). On the structure of educational
assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3-62.
21
Quellmalz, E. S., & Pellegrino, J. W. (2009). Technology and testing. Science, 323, 75–79.
Rupp, A. A., & Templin, J. L. (2008). Unique characteristics of diagnostic classification
models : A comprehensive review of the current state-of-the-art. Measurement, 6, 219-
262.
Rupp, A. A., Templin, J., Henson, R. A. (2010). Diagnostic measurement theory, methods and
applications. New York: The Guilford Press.
Von Davier. M. (2008). A general diagnostic model applied to language testing data. British
Journal of Mathematical and Statistical Psychology, 61.