+ All Categories
Home > Documents > Introduction to DIF

Introduction to DIF

Date post: 26-Oct-2014
Category:
Upload: hhkarami
View: 117 times
Download: 1 times
Share this document with a friend
Popular Tags:
18
59 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2) © 2012 Time Taylor Academic Journals ISSN 2094-0734 An Introduction to Differential Item Functioning Hossein Karami University of Tehran, Iran Abstract Differential Item Functioning (DIF) has been increasingly applied in fairness studies in psychometric circles. Judicious application of this methodology by the researchers, however, requires an understanding of the technical complexities involved. This has become an impediment in the way of specially non- mathematically oriented researches. This paper is an attempt to bridge the gap. It provides a non-technical introduction to the fundamental concepts involved in DIF analysis. In addition, an introductory level explanation of a number of the most frequently applied DIF detection techniques will be offered. These include Logistic Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch model. For each method, a number of the relevant software are also introduced. Key words: Differential Item Functioning, Validity, Fairness, Bias Introduction Differential Item Functioning (DIF) occurs when two groups of equal ability levels are not equally able to correctly answer an item. In other words, one group does not have an equal chance of getting an item right though its members have comparable ability levels to the other group. If the factor leading to DIF is not part of the construct being tested, then the test is biased. DIF analysis has been increasing applied in psychometric circles for detecting bias at the item level (Zumbo, 1999). Language testing researchers have also followed suit and have exploited DIF analysis in their fairness studies. They have conducted a plethora of research studies to investigate the existence of bias in their tests. These studies have focused on such factors as gender (e.g. Ryan & Bachman, 1992; Karami, 2011; Takala & Kaftandjieva, 2000), language background (Chen & Henning, 1985; Brown, 1999; Elder, 1996; Kim 2001; Ryan & Bachman, 1992), and academic background or content knowledge (Alderson & Urquhart, 1985; Hale, 1988; Karami, 2010; Pae, 2004). Despite the widespread application of DIF analysis in psychometric circles, however, it seems that the inherent complexity of the concepts in DIF analysis has hampered its wider application among less mathematically oriented researchers. This paper is an attempt to bridge this gap by providing a non-technical introduction to the fundamental concepts in DIF analysis. The paper begins with an overview of the basic concepts involved. Then, a brief overview of the development of fairness studies and DIF analyses during the last century follows. The paper ends with a detailed, though non-technical, explanation of a number of the most widely used DIF detection techniques. For each technique, a number of most widely used software are also introduced. A few studies applying the relevant techniques are also
Transcript
Page 1: Introduction to DIF

59 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

An Introduction to Differential Item Functioning

Hossein Karami

University of Tehran, Iran

Abstract

Differential Item Functioning (DIF) has been increasingly applied in fairness

studies in psychometric circles. Judicious application of this methodology by the

researchers, however, requires an understanding of the technical complexities

involved. This has become an impediment in the way of specially non-

mathematically oriented researches. This paper is an attempt to bridge the gap. It

provides a non-technical introduction to the fundamental concepts involved in DIF

analysis. In addition, an introductory level explanation of a number of the most

frequently applied DIF detection techniques will be offered. These include Logistic

Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch

model. For each method, a number of the relevant software are also introduced.

Key words: Differential Item Functioning, Validity, Fairness, Bias

Introduction

Differential Item Functioning (DIF) occurs when two groups of equal ability

levels are not equally able to correctly answer an item. In other words, one group

does not have an equal chance of getting an item right though its members have

comparable ability levels to the other group. If the factor leading to DIF is not part

of the construct being tested, then the test is biased.

DIF analysis has been increasing applied in psychometric circles for

detecting bias at the item level (Zumbo, 1999). Language testing researchers have

also followed suit and have exploited DIF analysis in their fairness studies. They

have conducted a plethora of research studies to investigate the existence of bias in

their tests. These studies have focused on such factors as gender (e.g. Ryan &

Bachman, 1992; Karami, 2011; Takala & Kaftandjieva, 2000), language background

(Chen & Henning, 1985; Brown, 1999; Elder, 1996; Kim 2001; Ryan & Bachman,

1992), and academic background or content knowledge (Alderson & Urquhart,

1985; Hale, 1988; Karami, 2010; Pae, 2004).

Despite the widespread application of DIF analysis in psychometric circles,

however, it seems that the inherent complexity of the concepts in DIF analysis has

hampered its wider application among less mathematically oriented researchers.

This paper is an attempt to bridge this gap by providing a non-technical

introduction to the fundamental concepts in DIF analysis. The paper begins with an

overview of the basic concepts involved. Then, a brief overview of the development

of fairness studies and DIF analyses during the last century follows. The paper ends

with a detailed, though non-technical, explanation of a number of the most widely

used DIF detection techniques. For each technique, a number of most widely used

software are also introduced. A few studies applying the relevant techniques are also

Page 2: Introduction to DIF

60 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

listed. Neither the list of the software nor the studies cited are meant to be

exhaustive. Rather, these are intended to orient the reader.

Differential Item Functioning

Differential Item Functioning (DIF) occurs when examinees with the same

ability level but from two different groups have different probabilities of endorsing

an item (Clauser & Mazor, 1998). It is synonymous with statistical bias where one

or more parameters of the statistical model are under- or overestimated (Camilli,

2006; Wiberg, 2007. Whenever DIF is present in an item, the source(s) of this

variance should be investigated to ensure that it is not a case of bias. Any item

flagged as showing DIF is biased if, and only if, the source of variance is irrelevant

to the construct being measured by the test. In other words, it is a source of

construct-irrelevant variance and the groups perform differentially on an item

because of a grouping factor (Messick, 1989, 1994).

There are at least two groups, i.e. focal and reference groups, in any DIF

study. The focal group, a group of minorities for example, is the potentially

disadvantaged group. The group which is considered to be potentially advantaged

by the test is called the reference group. Note, however, that naming the groups is

not always clear-cut. Naming the groups in such cases is totally random.

There are two types of DIF, namely uniform and non-uniform DIF.

Uniform DIF occurs when a group performs better than another group on all

ability levels. That is, almost all members of a group outperform almost all

members of the other group who are at the same ability levels. In the case of non-

uniform DIF, members of one group are favored up to a level on the ability scale

and from that point on the relationship is reversed. That is, there is an interaction

between grouping and ability level.

As stated earlier, DIF occurs when two groups of the same ability levels

have different chances of endorsing an item. Thus, a criterion is needed for

matching the examinees for ability. The process is called conditioning and the

criterion dubbed as matching criterion. Matching is of two types: internal and

external. In the case of internal matching, the criterion is the observed or latent

score of the test itself. For external matching, the observed or latent score of

another test is considered as the criterion. External matching can become

problematic because in such cases the assumption is that the supplementary test

itself is free of bias and that it is testing the same construct as the test of focus

(McNamara & Roever, 2006).

DIF is not evidence for bias in the test. It is evidence of bias if, and only if,

the factor causing DIF is irrelevant to the construct underlying the test. If that factor

is part of the construct, it is called impact rather than bias. The decision as to

whether the real source of DIF in an item is part of the construct being gauged is

totally subjective. Usually, a panel of experts is consulted to give more validity to the

interpretations.

Page 3: Introduction to DIF

61 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

The Development of DIF

The origins of bias analysis can be traced back to the early twentieth century

(McNamara & Roever, 2006). At the time, researchers were concerned with

developing tests that measured „raw intelligence‟. A number of studies conducted at

the time, however, showed that the socio-economic status of the test takers was a

confounding variable. Thus, they aimed to partial out this variance through purging

items that functioned differently for examinees with high and low socio-economic

status.

In the 1960s, the focus of bias studies shifted from intelligence tests to areas

where social equity was a major concern (Angoff, 1993). The role of fairness in tests

became highlighted. A variety of techniques were developed for detecting bias in

the tests. There was a problem with all these bias-detection techniques: all the

techniques required performance on a criterion test. “Criterion measures could not

be obtained until tests were in use, however, making test-bias detection procedures

inapplicable” (Scheuneman & Bleistein, 1989 p. 256). Consequently, researchers

went for devising a plethora of item-detection procedures.

The Golden Rule Settlement in 1976 was a landmark in bias studies

because legal issues entered the scene (McNamara & Roever, 2006). The Golden

Rule Insurance Company filed a suit against the Educational Testing Service and

the Illinois Department of Insurance due to an alleged bias against blacks in the

tests they developed. The court issued a verdict in favor of the Golden Rule

Insurance Company. The ETS was considered liable for the tests it developed and

was legally ordered to make every effort to rule out bias in its tests.

The main point about the settlement was the fact that bias analysis turned

out to be a legal issue as the test developing agencies were legally held responsible

for the consequences of their tests. The case also highlighted the significance of

Samuel Messick‟s (1980, 1989) works that emphasized the consequential aspects of

the tests in his validation framework.

A number of researchers (e.g. Linn & Drasgow, 1987) opposed the verdict

emphasizing that simply discarding items showing DIF may render the test invalid

by making it less representative of the construct measured by the test. Another

reason they put forward was the fact that the items may show true differences

between the test takers and the test may be a mirror of the real world ability

differences.

The proponents of the settlement, however, argued that there is no reason

to believe that there are ability differences between the test takers simply because

they are from different test taking groups. Thus, any observed differences in the

performance of, say, blacks and whites, is cause for concern and a source of

construct-irrelevant variance in Messick‟s terminology. Since then, a number of

techniques have been developed to detect differentially functioning items.

DIF, Validity, and Fairness

The primary concern in test development and test use, as Bachman (1990)

suggests, is demonstrating that the interpretations and uses we make of test scores

Page 4: Introduction to DIF

62 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

are valid. Moreover, a test needs to be fair for different test takers. In other words,

the test should not be biased against test takers‟ characteristics, e.g. males vs.

females, blacks vs. whites, etc. To examine such an issue requires at least a statistical

approach to test analysis which is able to find initially whether the test items are

functioning differentially among test taking groups and finally detect the sources of

this variance (Geranpayeh & Kunnan 2006). One of the approaches suggested for

such purposes is DIF.

Studying the differential performance of different test taking groups is

essential in the test development and test use procedures. If the sources of DIF are

irrelevant to the construct being measured by the test, it is a source of bias and the

validity of the test is under question. The higher the stakes of the test, the more

serious the consequences of the test use are. With high stakes tests, it is incumbent

upon the test users to ensure that their test is free of bias and the interpretation

made on the test scores are valid.

Test fairness analysis and the search for test bias are closely interwoven. In

fact, they are two sides of the same coin: whenever a test is biased, it is not fair and

vice versa. The search for fairness has gained new impetus during the last two

decades mainly due to advances within Critical Language Testing (CLT). The

proponents of the CLT believe that all uses of language tests are politically

motivated. Tests are, as they suggest, means of manipulating society and imposing

the will of the system on individuals (see Shohamy 2001).

DIF analysis provides only a partial answer to fairness issues. It is focused

on only differential performance of two groups on an item. Therefore, whenever

no groupings are involved in a test, then DIF is not applicable. However, when

groupings are involved, the possibility that the items are favoring one group exists.

If this happens, then the test may not be fair for the disfavored group. Thus, DIF

analysis should be applied in such contexts to obviate the problem.

DIF Methodology

McNamara and Roever (2006, p. 93) have classified methods of DIF

detection into four categories:

1. Analyses based on item difficulty (e.g. transformed item difficulty index

(TID) or delta plot).

2. Nonparametric methods. These methods make use of contingency tables

and chi-square methods.

3. Item-response-theory-based approaches which include 1, 2, and 3

parameter logistic models

4. Other approaches. These methods have not been developed primarily for

DIF detection but they can be utilized for this purpose. They include

multifaceted Rasch measurement and generalizability theory.

Despite the diversity of techniques, only a limited number of them appear

to be in current use. DIF detention techniques based on difficulty indices are not

common. Although they are conceptually simple and their application do not

require understanding complicated mathematical formulas, they face certain

Page 5: Introduction to DIF

63 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

problems including the fact that they assume equal discrimination across all items

and that there is no matching for ability (McNamara & Roever (2006). If the first

assumption is not met, the results can be misleading (Angoff, 1993; Scheuneman &

Bleistein, 1989). When an item has high discrimination level, it shows large

differences between the groups. One the other hand, differences between the

groups will not be significant in an item with low discrimination.

As indicated above, the DIF indices based upon item difficulty are not

common. Thus, they will not be discussed here. (For a detailed account of DIF

detection methods, both traditional and modern, see the following: Kamata &

Vaughn 2004; Scheuneman & Bleistein 1989; Wiberg 2007). In the next sections, a

general discussion of the most frequently used DIF detection methods will be

presented.

LogisticRegression

Logistic regression, first proposed by Swaminathan and Rogers (1990), is

basically used when we have one or more independent variables which are most of

the time continuous, and a binary or dichotomous dependent variable (Pampel,

2000; Swaminathan & Rogers, 1990; Zumbo, 1999). In applying logistic regression

to DIF detection, one attempts to see whether item performance, a wrong or right

answer, can be predicted from total scores alone, from total scores plus group

membership, and from total scores, group membership, and interaction between

them. The procedure can be formulaically presented as follows:

Ln 𝑃𝑚𝑖

1 − 𝑃𝑚𝑖 = 𝑏0 + 𝑏1𝑡𝑜𝑡 + 𝑏2𝑔𝑟𝑜𝑢𝑝 + 𝑏3(𝑡𝑜𝑡 × 𝑔𝑟𝑜𝑢𝑝)

In the formula, 𝑏0 is the intercept, 𝑏1𝑡𝑜𝑡 is the effect of conditioning

variable which is usually the total score on the test, 𝑏2𝑔𝑟𝑜𝑢𝑝 is the grouping

variable, and finally 𝑏3(𝑡𝑜𝑡 × 𝑔𝑟𝑜𝑢𝑝) is the ability by grouping interaction effect. If

the conditioning variable alone is enough to predict the item performance, with

relatively little residuals, then no DIF is present. If group membership, 𝑏2𝑔𝑟𝑜𝑢𝑝,

adds to the precision of the prediction, uniform DIF is detected. That is, one group

performs better than another group and this is a case of uniform DIF. Finally, in

addition to total scores and grouping, if an interaction effect, signified by 𝑏3 𝑡𝑜𝑡 ×𝑔𝑟𝑜𝑢𝑝 in the formula, is also needed for a more precise prediction of the total

scores, it is a case of non-uniform DIF (Zumbo, 1999).

Also, note that the formula is based on logistic function denoted by

Ln Pmi

1−Pmi where Pmi is the probability of giving a correct answer to item i by

person m and 1 − Pmi is the probability of a wrong response. In simple words, it is

the natural logarithm of the odds of success to the odds of failure.

Identifying DIF through logistic regression is similar to step-wise regression

in that successive models are built up in each step entering a new variable to see

whether the new model is an improvement over the previous one due to the

Page 6: Introduction to DIF

64 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

presence of the new variable. As such, logistic regression involves three successive

steps:

1. The conditioning variable or the total score is entered into the model

2. The grouping variable is added

3. The interaction term is also entered.

As a test of the significance of DIF, the Chi-square value of step 1 is

calculated and subtracted from the Chi-square value of step three. This is an overall

index of the significance of the DIF. The Chi-square value of step 2 can be

subtracted from that of step 3 to provide a significance test of non-uniform DIF. In

addition, comparing the Chi-square value of steps 1 and 2 is a good indicator of

uniform DIF.

Zumbo (1999) argued that logistic regression has three main advantages

over other DIF detection techniques in that one:

- need not categorize a continuous criterion variable,

- can model both uniform and non-uniform DIF

- can generalize the binary logistic regression model for use with ordinal item

scores. (p. 23)

Also, Wiberg (2007) noted that the logistic regression and the Mantel-

Haenszel statistics (to be explained in the next section) have gained particular

attention due to the fact that they can be utilized for detecting DIF in small sample

sizes. For example, Zumbo (1999) pointed out that 200 people per group are

needed. This is not a remarkable sample size compared to that required by other

models such as the three-parameter IRT which require over 1000 test takers per

group.

McNamara and Roever (2006) also stated that, “Logistic regression is useful

because it allows modeling of uniform and non-uniform DIF, is nonparametric, can

be applied to dichotomous and rated items, and requires less complicated

computing than IRT-based analysis,” (p. 116).

There are a number of software for doing DIF analysis using logistic

regression. LORDIF (Choi, Gibbons, & Crane, 2011) conducts DIF analysis for

dichotomous and polytomous items using both ordinal logistic regression and IRT. In

addition, SPSS can also be used for doing DIF analysis through both the MH and logistic

regression. Zumbo (1999) and Kamata and Vaughn (2004) provide examples of such

analyses. Magis, Béland, Tuerlinckx, and De Boeck, (2010) also have introduced

an R package for DIF detection, called difR, that can apply nine DIF detection

techniques including the logistic regression and the MH.

There are a number of studies that have applied logistic regression for DIF

detection. Sha‟bani (2008) utilized logistic regression to analyze a version of the

University of Tehran English Proficiency test (UTEPT) for the presence of DIF

due to gender differences. Kim (2001) conducted a DIF analysis of the

polytomously scored speaking items in the SPEAK test (the Speaking Proficiency

English Assessment Kit), a test developed by the Educational Testing Service. The

participants were divided into two different groups: the East Asian and the

European groups. He utilized the IRT likelihood ratio test and logistic regression to

detect the differentially functioning items. Davidson (2004) has investigated the

Page 7: Introduction to DIF

65 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

comparability of the performances of non-aboriginal and aboriginal students. Lee,

Breland, and Muraki (2004) comparability of computer-based testing (CBT) writing

prompts in the Test of English as a Foreign Language™ (TOEFL®) for examinees

of different native language backgrounds,with a focus on European(German,

French, and Spanish) and East Asian (Chinese, Japanese, and Korean) native

languagegroups as “reference” and “focal” groups, respectively.

Standardization

The idea here is to compute the difference between the proportion of test

takers, from both focal and reference groups, who answer the item correctly at each

score level. More weight is attached to score levels with more test takers

(McNamara & Roever, 2006). The procedure can be formulaically presented as

(Clauser & Mazor 1998):

𝐷𝑠𝑡𝑑 = 𝑊𝑠(𝑃𝑓𝑠 − 𝑃𝑟𝑠)𝑠

where 𝑊𝑠 is the relative frequencey of the group members at score levels, 𝑃𝑓𝑠 is the

proportion of the focal group at score level 𝑠 correctly responding to the item, and

𝑃𝑟𝑠 is the proportion of reference group members scoring 𝑠 who endorse the item.

There are two versions of this technique based on whether the sign of the

difference is taken into account or not: unsigned proportion difference and the

signed proportion difference (Wiberg, 2007). The former is also referred to as the

standardized p-difference. The standardized p-difference indexis more common.

The item will be flagged as DIF if the absolute value of this index is above 0.1.

Despite conceptual and statistical simplicity, the standardization procedure

is not so prevalent due to the large sample sizes that it requires (McNamara &

Roever, 2006). Another shortcoming of the procedure is that it has no significance

tests (Clauser & Mazor 1998).

One of the most recent software introduced for DIF detection through the

Standardization procedure is the EASY-DIF (González et al. 2011).EASY-DIF also

applies the Mantel–Haenszel as explained earlier. Also, STDIF (Robin, 2001) is a

free DOS-based program to compute DIF through the Standardization approach.

STDIF also has a manual (Zenisky,Robin, & Hambleton2009) which is freely

available. The software and the manual are both available at:

http://www.umass.edu/remp/software/STDIF.html.

Zenisky, Hambleton, & Robin, (2003) utilized the STDIF to apply a two-

stage methodology for evaluating DIF in large-scale state assessment data. These

researchers were concerned with the merits iterative approaches to DIF detection.

In a later study, Zenisky et al, (2004) also applied the STDIF to identify gender

DIF in a large-scale science assessment. As the authors explain, their methodology

was a variant of the Standardization technique. Lawrence, Curley, and McHale,

(1988) also applied the Standardization technique to detect differentially

functioning items in the reading comprehension and sentence completion items in

the verbal section of the Scholastic Aptitude Test (SAT). Freedle and Kostin

(1997) conducted an ethnic comparison study using the DIF methodology. The

Page 8: Introduction to DIF

66 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

scrutinized a large number of items from SAT and GRE exams comparing the

performance of the Black and White examinees. Gallagher (2004) has applied the

Standardization procedure, logistic regression, and the MH to investigate the

reading performance differences between African-American and White students

taking a nationally normed reading test.

Mantel-Haenszel

The Mantel-Haenszel (MH) procedure was first proposed for DIF analysis

by Holland and Thayer (as cited in Kamata & Vaughn, 2004). The basic idea is to

calculate the odds of correctly endorsing an item for the focal group relative to the

reference group. If there are large differences, DIF is present.

According to Scheuneman and Bleistein (1989), “The MH estimate is a

weighted average of the odds ratios at each of j ability levels,” (p. 262). That is, the

odds ratios of success at each ability level are estimated and then summed over all

ability levels.

Table 1 shows the hypothetical performance of two groups of test takers,

focal and reference groups, on an item.

Table 1

Hypothetical Performance of Two Groups on an Item

Correct Incorrect Total

Reference group 14 6 20 Focal group 8 12 20 Total 20 20 40

The first step in calculating the MH statistics is to compute the probabilities

of correct and incorrect responses for both groups. The empirical probabilities are

shown in table 2.2. The second step is to find out how much more likely are the

members of either group to answer correctly rather than incorrectly to the item. For

the reference group, the odds are:

Ω𝑟 = 0.7/0.3 = 2.33

Similarly, the odds of giving a correct answer to the item for the focal group

are as follows:

Ω𝑓 = 0.4/0.6 = 0.66

Table 2

Empirical Probabilities

Correct Incorrect Total

Reference group .7 .3 15

Focal group .4 .6 20 Total 25 25 50

Page 9: Introduction to DIF

67 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

Finally, we want to know how much more likely are the members of the

reference group to correctly respond than the members of the focal group. To this

aim, we get the odds ratio:

α = 2.33/0.66 = 3.5

Simply put, the odds ratio in the above example shows that members of the

reference group are three and a half times more likely than members of the focal

group to endorse the item.

However, note that we have calculated the odds ratio for only ability level.

Thus, the overall DIF is calculated by summing the odds ratios at all ability levels

and dividing them by the number of ability levels. The resulting index is the

Mantel-Haenszel odds ratio denoted by 𝛼𝑀𝐻 . This index is usually transformed by

the following:

𝛽𝑀𝐻 = ln𝛼𝑀𝐻

A negative 𝛽𝑀𝐻 indicates DIF in favor of the focal group whereas a positive

MH ∆ shows DIF favoring the reference group (Wiberg 2007). Sometimes, 𝛽𝑀𝐻 is

further transformed into:

𝑀𝐻 𝐷 − 𝐷𝐼𝐹 = −2.35 ln𝛼𝑀𝐻

A positive value of 𝑀𝐻 𝐷 − 𝐷𝐼𝐹 indicates that the item is more difficult for

the reference group while a negative value shows focal group faces more difficulty

with the item.

The Educational Testing Service uses the MH statistics in DIF analysis.

Items flagged as DIF are further classified into three types (Zieky, 1993) to “avoid

identifying items that display practically trivial but statistically significant DIF”

(Clauser & Mazor, 1998, p. 39). Items are identified as showing type A DIF if

absolute value of 𝑀𝐻 𝐷 − 𝐷𝐼𝐹 is smaller than 1.0 or not significantly different

from zero. Type C DIF occurs when the absolute value of 𝑀𝐻 𝐷 − 𝐷𝐼𝐹is greater

than 1.5 or it is significantly different from 1.0. All other DIF items are flagged as

type B.

The main software for DIF analysis using the MH are DIFAS (Penfield,

2005), EZDIF (Waller, 1998), and more recently EASY-DIF (González, Padilla,

Hidalgo, Gómez-Benito & Benítez, 2011). Another relevant software is the

DICHODIF (Rogers, Swaminathan, & Hambleton, 1993) that can apply both the

MH and Logistic Regression. Also, LERTAP (Nelson, 2000) is anExcel-based

classical itemanalysis software that is able to do DIF analysis using the MH. Its

student version is freely available and the full version is available from

http://assess.com/xcart/product.php?productid=235&cat=21&page=1. For more

helpful information about the software see also http://lertap.curtin.edu.au/. Winsteps

(Linacre, 2010) also provides MH based DIF estimates as part of its output.

Elder (1996) conducted a study to determine whether language background

may lead to DIF. She examined reading and listening subsections of the Australian

Language Certificates (ALC), a test given to Australian school-age learners from

diverse language backgrounds. Her participants included those who were enrolled

Page 10: Introduction to DIF

68 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

in language classes in years 8 and 9. The languages of her focus were Greek, Italian,

and Chinese. Elder (1996) compared the performance of background speakers

(those who used to speak the target language plus English at home) with non-

background speakers (those who were only exposed to English at home). She

applied the Mantel-Haenszel procedure to detect DIF.Ryan and Bachman (1992),

also utilized the Mantel-Haenszel procedure to compare the performance of a

group of males and female test takers on FCE and TOELF tests. Allalouf and

Abramzon (2008) investigated the differences between groups from different first

language backgrounds, namely Arabic and Russian, using he Mantel-Haenszel.

Ockey (2007) applied both IRT and the MH to compare the performance of the

English language learners (ELL) and non-ELL 8th-grade students‟ scores on

National Assessment of Educational Progress(NAEP) math word problems. Foran

overview of the applications of Mantel-Haenszel procedure to detect DIF, see

Guilera, Gómez-Benito, and Hidalgo(2009).

Item Response Theory

The main difference between IRT DIF detection techniques and other

methods including logistic regression and MH is the fact that in non-IRT

approaches, “examinees are typically matched on an observed variable (such as

total test score), and then counts of examinees in the focal and reference groups

getting the studied item correct or incorrect are compared” (Clauser & Mazor 1998,

p. 35). That is, the conditioning or the matching criterion is the observed score.

However, in IRT-based methods, matching is based on the examinees‟ estimated

ability level or the latent trait, θ.

Figure 1. A Typical ICC

Methods based on item response theory are conceptually elegant though

mathematically very complicated. The building block of IRT is item characteristic

curve (ICC) (see Baker, 2001; DeMarse, 2009; Embretson & Reise, 2000;

Page 11: Introduction to DIF

69 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

Hambleton, Swaminathan, & Rogers, 1991). It is a smooth S-shaped curve which

depicts the relationship between the ability level and the probability of correct

response to the item. As it is evident from Figure 1, the probability of correct

response approaches one at the higher end of the ability scale, never actually

reaching one. Similarly, at the lower end of the ability scale, the probability

approaches, but never reaches, zero.

IRT uses three features to describe the shape of the ICC: item difficulty,

item discrimination, and guessing factor. Based on how many of these parameters

are involved in the estimation of the relationship between the ability and item

response patterns, there are three IRT models, namely one, two, and three

parameter logistic models.

In the one parameter logistic model and the Rasch model, it is assumed that

all items have the same discrimination level. The two parameter IRT model takes

account of item difficulty and item discrimination. However, guessing is assumed to

be uniform across ability levels. Finally, the three parameter model includes a

guessing parameter in addition to item difficulty and discrimination.

The models provide a mathematical equation for the relation of the responses to

ability levels (Baker, 2001). The equation for the three parameter model is:

P θ = 𝑐 + (1 − 𝑐) 1

1 + 𝑒−𝛼(𝜃−𝑏)

where:

b is the difficulty parameter

α is the discrimination parameter

c is the guessing or pseudo-chance parameter and

θ is the ability level

The basic idea in detecting DIF through IRT models is that if DIF is

present in an item, the ICCs of the item for the reference and the focal groups

should be different (Thissen, Steinberg, & Wainer, 1993). However, where there is

no DIF, the item parameters and hence ICCs should be almost the same. It is

evident that the ICCs would be different if the item parameters vary from a group

to another. Thus, one possible way of detecting DIF through IRT is to compare

item parameters in two groups. If the item parameters are significantly different,

then DIF is ensured.

IRT DIF can be computed using BILOG-MG (Scientific Software

International, 2003) for dichotomously scored items andPARSCALE ((Muraki &

Bock, 2002) and MULTILOG (Thissen, 1991) for the polytomously scored items.

In addition, for small sample sizes, nonparametric IRT can be employed suing the

TestGraf software (Ramsay, 2001).For an exemplary study of the application of the

TestGraf for DIF detection, see Laroche, Kim, and Tomiuk, (1998). Finally, the

IRTDIF software (Kim, & Cohen, 1992) can do DIF analysis under the IRT

framework.

Pae (2004) undertook a DIF study of examinees with different academic

backgrounds sitting theEnglish subtest of the Korean National Entrance Exam

Page 12: Introduction to DIF

70 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

forColleges and Universities. He applied the Three Parameter IRT though the

MULTILOG for DIF analysis. Before applying the IRT, however, Pae (2004) also

did an initial DIF analysis using the MH procedure to detect suspect

items.Geranpayeh and Kunnan (2006) also examined the existence of differentially

functioning items on the listening section of the Certificate in Advanced

Englishexamination for test takers from three different age groups.Uiterwijk and

Vallen (2005) investigated the performance of the second generation

immigrant(SGI) students and native Dutch (ND) students in the Final Test

ofPrimary Education in the Netherlands. Both IRT and Mantel–Haenszel were

applied in their study.

The Rasch Model

Although the one-parameter logistic model and the Rasch model are

mathematically similar, they were developed independently of each other. In fact, a

number of scholars (e.g. Pollitt 1997) believe that the IRT models are totally

different from the Rasch model.

The Rasch model focuses on the probability of endorsing item 𝑖 by person

𝑚. In aiming to model this probability, it essentially takes into account person

ability and item difficulty. Probability is a function of the difference between person

ability and item difficulty. The following formula shows just this:

P (x = 1 𝗅 θ,δ) = 𝑓(𝜃𝑛 − 𝛿𝑖)

where 𝜃𝑛 is person ability and 𝛿𝑖 is item difficulty. The formula simply states that

the probability of endorsing the item is a function of the difference between person

ability, 𝜃𝑛 , and item difficulty, 𝛿𝑖 . This is possible becasue item difficulty and

person ability are on the same scale in the Rasch model. It is also intuitively

appealing to conceive of probability in such terms. The Rasch model assumes that

any person taking the test has an amount of the construct gauged by the test and

that any item also shows an amount of the construct. These values work in the

opposite direction. Thus, it is the difference between item difficulty and person

ability that counts

Three cases can be considered for any encounter of persons and items

(Wilson, 2005):

1. item difficulty and person ability are the same, 𝜃𝑛 − 𝛿𝑖= 0, and the person

has an equal probability of endorsing the item or failing. Thus, probability is

.5.

2. person ability is greater than item difficulty, 𝜃𝑛 − 𝛿𝑖 > 0, and the person

has more than .5 probability of endorsing the item.

3. person ability is lower than item difficulty, 𝜃𝑛 − 𝛿𝑖 < 0, and the probability

of giving a correct response to the item is less than .5.

The exact formula for the Rasch model is the following:

Ln 𝑃𝑛𝑖

1 − 𝑃𝑛𝑖 = 𝜃𝑛 − 𝛿𝑖

Page 13: Introduction to DIF

71 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

The Rasch model provides us with sample independent item difficulty

indices. Therefore, DIF occurs when invariance is not accrued in a particular

application of the model (Engelhard, 2009). That is, the indices are dependent on

the sample who takes the test. The amount of DIF is calculated by a separate

calibration t-test approach first proposed by Wright and Stone (1979, see Smith,

2004). The formula is the following:

𝑡 =di2 − di1

√(𝑠2𝑖2 − 𝑠2

𝑖1)

where di1 is the difficulty of item i in calibration 1, di2 is the difficulty of item i in

calibration based on group 2, 𝑠2𝑖1 is the standard error of estimate for di1, and 𝑠2

𝑖2

is the standard error of estimate for di2.Baghaei (2009), Bond and Fox (2007), and

Wilson (2005) present excellent introductory level expositions of the Rasch model.

Among the software for DIF using the Rasch model are ConQuest(Wu, Adams,

Wilson, & Haldane, 2007), Winsteps (Linacre, 2010), and Facets (Linacre, 2009).

Karami (2011) has applied the Rasch model to investigate the existence of

DIF items in the UTEPT for male and female examinees. He applied Linacre‟s

Winsteps for DIF analysis. Also, Karami (2010) exploited the Winsteps to examine

the UTEPT items for possible DIF for test takers from different academic

backgrounds. Elder,McNamara, and Congdon (2003) also applied the Rasch

model to examine the performance of native and non-native speakers on a test of

academic English.

Furthermore, Takala and Kaftandjieva (2000) undertook a study to

investigate the presence of DIF in the vocabulary subtest of the Finnish Foreign

Language Certificate Examination, “an official, national high-stakes foreign-language

examination based on a bill passed by Parliament”. To detect DIF, they utilized the

One Parameter Logistic Model (OPLM), a modification of the Rasch model where

item discrimination is not considered to be one but is input as a known constant.

Pallant and Tennant(2007) also applied the Rasch model to scrutinize the utility of

the Hospital Anxiety and Depression Scale (HADS) totalscore (HADS-14) as a

measure of psychological distress.

Conclusion

DIF analysis aims to detect items that differentially favor examinees of the

same ability levels but from different groups. The technical requirements of this

methodology, however, has hampered the non-mathematically oriented

researchers. Even if a researcher does not apply these techniques in his own

studies, he has to be familiar with them in order to fully appreciate the published

papers that report such analyses.

This paper attempted to provide a non-technical introduction to the basic

principles of DIF analysis. Five DIF detection techniques were explained: Logistic

Regression, Mantel-Haenszel, Standardization, Item Response Theory, and the Rasch

model. For each technique, a number of the most widely applied software along with some

studies applying those techniques were briefly cited. The interested reader may refer to

Page 14: Introduction to DIF

72 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

such studies for further information about their application. It is hoped that the

exposition offered here will enable researchers to appreciate and enjoy reading

studies that have conducted a DIF analysis.

References

Alderson, J. C., & Urquhart, A. (1985). The effect of students‟ academic discipline

ontheir performance on ESP reading tests. Language Testing, 2, 192-204.

Allalouf, A., & Abramzon, A. (2008). Constructing better second language

assessments based on differential item functioning analysis. Language Assessment Quarterly, 5, 120–141.

Angoff, W. H. (1993). Perspectives on differential item functioning methodology.

In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3–

4). Hillsdale, NJ: Lawrence Erlbaum Associates.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford:

Oxford University Press.

Baghaei, P. (2009) Understanding the Rasch model. Mashad: Mashad Islamic Azad

University Press.

Baker, F. (2001). The basics of Item Response Theory. ERIC Clearinghouse on

Assessment and Evaluation, University of Maryland, College Park, MD.

Bond, T. G., & Fox, C.M. (2007) Applying the Rasch model: Fundamental measurement in the human sciences. London: Lawrence Erlbaum.

Brown, J. D. (1999). The relative importance of persons, items, subtests and

languages to TOEFL test variance. Language Testing, 16, 217–238.

Camilli, G. (2006) Test fairness. In R. Brennan (Ed.), Educational measurement (pp. 221-256). New York: American Council on Education & Praeger series

on higher education.

Chen, Z., & Henning, G. (1985) Linguistic and cultural bias in language proficiency

tests. Language Testing, 2(2), 155–163.

Choi, S. W., Gibbons, L. E., & Crane, P. K. (2011). Lordif: An R Package for

Detecting DifferentialItem Functioning Using Iterative Hybrid Ordinal

Logistic Regression/Item Response Theory andMonte Carlo Simulations.

Journal of Statistical Software, 39(8), 1-30.

Clauser, E. B., & Mazor, M. K. (1998). Using statistical procedures to identify

differentially functioning test items. Educational Measurement: Issues and Practice. 17, 31-44.

Davidson, B. (2004). Comparability of test scores for non-aboriginal and aboriginal

students. (DoctoralDissertation, University of British Columbia, 2004).

UMI Proquest Digital Dissertation. DeMars, C. E. (2010). Item response theory. New York: Oxford University Press.

Elder,C. (1996). The effect of language background on “foreign” language test

performance:The case of Chinese, Italian, and Modern Greek. Language Learning, 46, 233–282.

Elder, C., McNamara, T. F., & Congdon, P. (2003). Understanding

Raschmeasurement: Rasch techniques for detecting bias in performance

assessments:An example comparing the performance of native and non-

Page 15: Introduction to DIF

73 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

nativespeakers on a test of academic English. Journal of Applied Measurement, 4, 181–197.

Embretson, S. E., & Reise, S. (2000). Item Response Theory for psychologists. Mahwah, NJ: Erlbaum Publishers.

Engelhard, G. (2009). Using item response theory and model--data fit to

conceptualize differential item and person functioning for students with

disabilities. Educational and Psychological Measurement,69, 585-602.

Freedle, R., & Kostin, I. (1997). Predicting black and white differential item

functioning in verbalanalogy performance. Intelligence, 24, 417–444.

Gallagher, M. (2004). A study of differential item functioning: Its use as a tool for

urban educatorsto analyze reading performance (Unpublished doctoral

dissertation, Kent State University). UMIProquest Digital Dissertation.

Geranpayeh, A., & Kunnan, A. J. (2007) Differential Item Functioning in Terms of

Age in the Certificate in Advanced English Examination. Language Assessment Quarterly, 4, 190-222.

González, A., Padilla, J. L., Hidalgo, M. D., Gómez-Benito, J., & Benítez, I. (2011).

EASY-DIF: Software for analyzing differential item functioning using the

Mantel-Haenszel and standardization procedures. Applied Psychological Measurement, 35, 483-484.

Guilera, G., Gómez-Benito, J., & Hidalgo, M.D. (2009). Scientific production on

the Mantel-Hanszel procedure as a way of detecting DIF. Psicothema, 21 (3), 492-498.

Hale, G. A. (1988) Student major field and text content: interactive effects on

reading comprehension in the Test of English as a Foreign Language.

Language Testing5, 49–61.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

Kamata, A., & Vaughn, B. K. (2004). An introduction to differential item

functioning analysis. Learning Disabilities: A Contemporary Journal, 2, 49-

69.

Karami, H. (2010). A Differential Item Functioning analysis of a language proficiency test: an investigation of background knowledge bias. Unpublished Master‟s Thesis. University of Tehran, Iran.

Karami, H. (2011). Detecting gender bias in a language proficiency test.

International Journal of Language Studies, 5, 167-178.

Kim, M. (2001). Detecting DIF across the different language groups in a speaking

test. Language Testing, 18, 89–114.

Kim, S.-H., & Cohen, A. S. (1992). IRTDIF: A computer program for IRT

differential itemfunctioning analysis. Applied Psychological Measurement, 16, 158.

Laroche, M., Kim, C., & Tomiuk, M. A. (1998). Translation fidelity: an IRT

analysis of Likert-type scale items from a culture change measure for Italian-

Canadians. Advances in Consumer Research, 25, 240-245.

Lawrence, I. M., Curley, W. E., & McHale, F. J. (1988). Differential item functioning for males and females on SAT verbal reading subscoreitems. Report No. 88–4. New York: College Entrance Examination Board.

Page 16: Introduction to DIF

74 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

Lee, Y. W., Breland, H., & Muraki, E. (2004). Comparability of TOEFL CBTwriting prompts for different native language groups (TOEFL

ResearchReport No. RR-77). Princeton, NJ: Educational Testing Service.

Retrieved September 29, 2011, from

http://www.ets.org/Media/Research/pdf/RR-04-24.pdf.

Linacre, J. M. (2009). FACETS Rasch-model computer program (Version 3.66.0)

[Computer software]. Chicago, IL: Winsteps.com.

Linacre, J. M. (2010) Winsteps® (Version 3.70.0) [Computer Software].

Beaverton, Oregon:Winsteps.com.

Linn, R. L., & Drasgow, F. (1987). Implications of the Golden Rule settlement for

test construction. Educational Measurement: Issues and Practice, 6, 13–17.

Magis, D., Béland, S., Tuerlinckx, F., & De Boeck, P. (2010). A general framework

and an R package for the detection of dichotomous differential item

functioning. Behavior Research Methods, 42, 847-862.

McNamara, T., & C. Roever (2006) Language testing: The social dimension.

Malden, MA & Oxford: Blackwell.

Messick, S. (1980). Test validation and the ethics of assessment. American Psychologist, 35, 1012–1027.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (pp.

13–103). New York: American Council on Education & Macmillan.

Messick, S. (1994) The interplay of evidence and consequences in the validation of

performance assessments. Educational Researcher, 23(2), 13-23.

Muraki, E., & Bock, D. (2002) PARSCLE 4.1 Computer program. Chicago:

Scientific Software International, Inc.

Nelson, L. R. (2000). Item analysis for tests and surveys using Lertap 5. Perth,

Western Australia: Curtin University of Technology

(www.lertap.curtin.edu.au).

Ockey, G. J. (2007). Investigating the validity of math word problems for English

language learners with DIF. Language Assessment Quarterly, 4(2), 149-164.

Pae, T. (2004). DIF for learners with different academic backgrounds. Language Testing, 21, 53–73.

Pallant, J. F., & Tennant, A. (2007).An introduction to the Rasch measurement

model: An example using the Hospital Anxiety and Depression Scale

(HADS). British Journal of Clinical Psychology, 4, 1–18.

Pampel, F. (2000). Logistic regression: A primer. Thousand Oaks, CA: Sage.

Penfield, R. D. (2005). DIFAS: Differential Item Functioning Analysis System.

AppliedPsychological Measurement, 29, 150-151.

Pollitt, A. (1997). Rasch measurement in latent trait models. In Clapham, C. and

Corson, D., (Eds.), Encyclopedia of language and education. Volume 7:

Language testing and assessment (pp. 243–254). Dordrecht: Kluwer

Academic.

Ramsay, J. O. (2001). TestGraf: A program for the graphical analysis of multiple-choice test andquestionnaire data [Computer software and manual].

Montreal, Canada: McGill University.

Robin, F. (2001). STDIF: Standardization-DIF analysis program [Computer

program].Amherst, MA: University of Massachusetts, School of Education.

Page 17: Introduction to DIF

75 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

Rogers, H. J., Swaminathan, H., & Hambleton, R. K. (1993). DICHODIF: A

FORTRAN program forDIF analysis of dichotomously scored item

response data [Computer software]. Amherst:University of Massachusetts.

Roznowski, M., & Reith, J. (1999). Examining the measurement quality of tests

containing differentiallyfunctioning items: Do biased items result in poor

measurement? Educational and Psychological Measurement, 59, 248–269.

Ryan, K., & Bachman, L. (1992). Differential item functioning on two tests of EFL

proficiency. Language Testing, 9, 12–29.

Sasaki, M. (1991). A comparison of two methods for detecting differentialitem

functioning in an ESL placement test. Language Testing, 8(2), 95–111.

Scheuneman, J. D., & Bleistein, C. A. (1989) A consumer‟s guide to statistics for

identifying differential item functioning. Applied Measurement in

Education. 2, 255-275.

Shohamy, E. (2001) The Power of Tests. A Critical Perspective on the Uses of Language Tests. London: Longman/Pearson Education.

Smith, R. (2004) Detecting item bias with the Rasch model. Journal of Applied Measurement, 5(4), 430-449.

Swaminathan, H., & Rogers, H. J. (1990). Detecting differential item functioning

using logistic regression procedures. Journal of Educational Measurement, 27, 361–370.

Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2

vocabulary test.Language Testing, 17, 323–340.

Thissen, D. (1991). MULTILOG user‟s guide: Multiple categorical item analysis and test scoring using item response theory (Version 6.0). Chicago:

Scientific Software.

Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item

functioning using the parameters of item response models. In P. W.

Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113).

Hillsdale, NJ: Lawrence Erlbaum Associates.

Uiterwijk, H., & Vallen, T. (2005). Linguistic sources of item bias for second

generation immigrantsin Dutch tests. Language Testing, 22, 211–234.

Waller, N. G. (1998). EZDIF: Detection of Uniform and Nonuniform Differential

ItemFunctioning With the Mantel-Haenszel and Logistic Regression

Procedures. Applied Psychological Measurement, 22, 391.

Wiberg, M. (2007) Measuring and detecting differential item functioning in

criterion-referencedlicensing test. A Theoretic Comparison of Methods.

Educational Measurement, technical report N. 2.

Wilson, M. (2005) Constructing measures: an item response modeling approach.

London: LawrenceErlbaum Associates.

Wright B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.

Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ACER

ConQuest Version 2: Generalized item response modeling software

[computer program]. Camberwell: Australian Council for Educational

Research.

Page 18: Introduction to DIF

76 The International Journal of Educational and Psychological Assessment September 2012, Vol. 11(2)

© 2012 Time Taylor Academic Journals ISSN 2094-0734

Zenisky, A. L., Hambleton, R. K., & Robin, F. (2003). Detection of differential

item functioning in large-scale state assessments: A study evaluating a two-

stage approach. Educational and Psychological Measurement, 63 (1), 49-62.

Zenisky, A. L., Hambleton, R. K., & Robin, F. (2004). DIF detection and

interpretation in large-scale science assessments: Informing item-writing

practices. Educational Assessment, 9(1&2), 61-78.

Zenisky, A. L., Robin, F., & R. K. Hambleton. (2009). Differential item functioning analyses with STDIF: User‟s guide. Amherst, MA: University

ofMassachusetts, Center for Educational Assessment.

Zieky, M. (1993). Practical questions in the use of DIF statistics in test

development. In P.W. Holland&H.Wainer (Eds.), Differential item functioning (pp. 337–348). Hillsdale, NJ: Lawrence Erlbaum Associates.

Zumbo, B. D. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of

Human Resources Research and Evaluation, Department of National

Defense.

About the Author

Hossein Karami ([email protected]) is currently a Ph.D. candidate in TEFL and an

instructor atthe Faculty of Foreign Languages and Literature, University of Tehran,

Iran. His research interests includevarious aspects of language testing in general,

and Differential Item Functioning, validity, and fairness in particular.


Recommended