Reliability and validity

Reliability and Validity

Chong Ho Yu, Ph.D.

Conventional views of reliability (AERA et al., 1985)

Temporal stability: the same form of a test on two or more separate occasions

to the same group of examinees (Test-retest). On many occasions this approach

is not practical because repeated measurements are likely to change the

examinees. For example, the examinees will adapt the test format and thus tend

to score higher in later tests. Hence, careful implementation of the test-retest

approach is strongly recommendation (Yu, 2005).

Form equivalence: two different forms of test, based on the same content, on

one occasion to the same examinees (Alternate form). After alternate forms

have been developed, it can be used for different examinees. It is very common

in high-stake examination for pre-empting cheating. A examinee who took

Form A earlier could not share the test items with another student who might

take Form B later, because the two forms have different items.

Internal consistency: the coefficient of test scores obtained from a single test

or survey (Cronbach Alpha , KR20, Spilt-half). For instance, let's say

respondents are asked to rate statements in an attitude survey about computer

anxiety. One statement is "I feel very negative about computers in general."

Another statement is "I enjoy using computers." People who strongly agree

with the first statement should be strongly disagree with the second statement,

and vice versa. If the rating of both statements is high or low among several

respondents, the responses are said to be inconsistent and patternless. The same

principle can be applied to a test. When no pattern is found in the students'

responses, probably the test is too difficult and students just guess the answers

randomly.

Reliability is a necessary but not sufficient condition

for validity. For instance, if the needle of the scale is

five pounds away from zero, I always over-report my

weight by five pounds. Is the measurement consistent?

Yes, but it is consistently wrong! Is the measurement

valid? No! (But if it under-reports my weight by five

pounds, I will consider it a valid measurement)

Performance, portfolio, and responsive evaluations, where the tasks vary

substantially from student to student and where multiple tasks may be

evaluated simultaneously, are attacked for lacking reliability. One of the

difficulties is that there are more than one source of measurement errors in

performance assessment. For example, the reliability of writing skill test score

is affected by the raters, the mode of discourse, and several other factors

(Parkes, 2000).

Replications as unification: Users may be confused by the diversity of

reliability indices. Nevertheless, different types of reliability measures share a

common thread: What constitutes a replication of a measurement procedure?

(Brennan, 2001) Take internal consistency as an example. This measure is used

because it is convenient to compute the reliability index based upon data

collected from one occasion. However, the ultimate inference should go beyond

one single testing occasion to others (Yu, 2005). In other words, any

procedures for estimating reliability should attempt to mirror a result based

upon full-length replications.

Conventional views of validity (Cronbach, 1971)

Face validity: Face validity simply means the validity at face value. As a check

on face validity, test/survey items are sent to teachers to obtain suggestions for

modification. Because of its vagueness and subjectivity, psychometricians have

abandoned this concept for a long time. However, outside the measurement

arena, face validity has come back in another form. While discussing the

validity of a theory, Lacity and Jansen (1994) defines validity as making

common sense, and being persuasive and seeming right to the reader. For

Polkinghorne (1988), validity of a theory refers to results that have the

appearance of truth or reality.

The internal structure of things may not concur with the appearance. Many

times professional knowledge is counter-common sense. The criteria of validity

in research should go beyond "face," "appearance," and "common sense."

Content validity: draw an inference

from test scores to a large domain of

items similar to those on the test.

Content validity is concerned

with sample-population

representativeness. i.e. the knowledge

and skills covered by the test items

should be representative to the larger

domain of knowledge and skills.

For example, computer literacy

includes skills in operating system, word processing, spreadsheet, database,

graphics, internet, and many others. However, it is difficult, if not impossible,

to administer a test covering all aspects of computing. Therefore, only several

tasks are sampled from the population of computer skills.

Content validity is usually established by content experts. Take computer

literacy as an example again. A test of computer literacy should be written or

reviewed by computer science professors because it is assumed that computer

scientists should know what are important in his discipline. By the first glance,

this approach looks similar to the validation process of face validity, but yet

there is a difference. In content validity, evidence is obtained by looking for

agreement in judgments by judges. In short, face validity can be established by

one person but content validity should be checked by a panel.

However, this approach has some drawbacks. Usually experts tend to take their

knowledge for granted and forget how little other people know. It is not

uncommon that some tests written by content experts are extremely difficult.

Second, very often content experts fail to identify the learning objectives of a

subject. Take the following question in a philosophy test as an example:

What is the time period of the philosopher Epicurus?

a. 341-270 BC

b. 331-232 BC

c. 280-207 BC

d. None of the above

This type of question tests the ability of memorizing historical facts, but not philosophizing. The content expert may argue that "historical facts" are important for

a student to further understand philosophy. Let's change the subject to computer science and statistics. Look at the following two questions:

When was the founder and CEO of Microsoft, William Gates III born?

a. 1949

b. 1953

c. 1957


Which of the following statement is true about ANOVA

a. It was invented by R. A. Fisher in 1914

b. It was invented by R. A. Fisher in 1920

c. It was invented by Karl Pearson in 1920


It would be hard pressed for any computer scientist or statistician to accept that the above questions fulfill content validity. As a matter of fact, the memorization

approach is a common practice among instructors.

Further, sampling knowledge from a larger domain of knowledge involves

subjective values. For example, a test regarding art history may include many

questions on oil paintings, but less questions on watercolor paintings and

photography because of the perceived importance of oil paintings in art history.

Content validity is sample-oriented rather than sign-oriented. A behavior is

viewed as a sample when it is a subgroup of the same kind of behaviors. On the

other hand, a behavior is considered a sign when it is an indictor or a proxy of a

construct. (Goodenough, 1949). Construct validity and criterion validity, which

will be discussed later, are sign-oriented because both of them indicate

behaviors different from those of the test.

Criterion: draw an inference from test

scores to performance. A high score of

a valid test indicates that the tester has

met the performance criteria.

Regression analysis can be applied to

establish criterion validity. An

independent variable could be used as

a predictor variable and a dependent

variable, the criterion variable. The

correlation coefficient between them is

called validity coefficients.

For instance, scores of the driving test by simulation is the predictor variable

while scores of the road test is the criterion variable. It is hypothesized that if

the tester passes the simulation test, he/she should meet the criterion of being a

safe driver. In other words, if the simulation test scores could predict the road

test scores in a regression model, the simulation test is claimed to have a high

degree of criterion validity.

In short, criterion validity is about prediction rather than explanation.

Predication is concerned with non-casual or mathematical dependence where as

explanation is pertaining to causal or logical dependence. For example, one can

predict the weather based on the height of mercury inside a thermometer. Thus,

the height of mercury could satisfy the criterion validity as a predictor.

However, one cannot explain why the weather changes by the change of

mercury height. Because of this limitation of criterion validity, an evaluator has

to conduct construct validation.

Construct: draw an inference form test

scores to a psychological construct.

Because it is concerned with abstract

and theoretical construct, construct

validity is also known as theoretical

construct.

According to Hunter and Schmidt

(1990), construct validity is a

quantitative question rather than a

qualitative distinction such as "valid"

or "invalid"; it is a matter of degree. Construct validity can be measured by the

correlation between the intended independent variable (construct) and

the proxy independent variable (indicator, sign) that is actually used.

For example, an evaluator wants to study the relationship between general

cognitive ability and job performance. However, the evaluator may not be able

to administer a cognitive test to every subject. In this case, he can use a proxy

variable such as "amount of education" as an indirect indicator of cognitive

ability. After he administered a cognitive test to a portion of all subjects and

found a strong correlation between general cognitive ability and amount of

education, the latter can be used to the larger group because its construct

validity is established.

Other authors (e.g. Angoff,1988; Cronbach & Quirk, 1976) argue that construct

validity cannot be expressed in a single coefficient; there is no mathematical

index of construct validity. Rather the nature of construct validity is qualitative.

There are two types of indictors:

o Reflective indictor: the effect of the construct.

o Formative indictor: the cause of the construct.

When an indictor is expressed in terms of multiple items of an

instrument, factor analysis is used for construct validation.

Test bias is a major threat against construct validity, and therefore test bias

analyses should be employed to examine the test items (Osterlind, 1983).

The presence of test bias definitely affects the measurement of the

psychological construct. However, the absence of test bias does not guarantee

that the test possesses construct validity. In other words, the absence of test bias

is a necessary, but isn't a sufficient condition.

Construct validation as unification: The criterion and the content models

tends to be empirical-oriented while the construct model is inclined to be

theoretical. Nevertheless, all models of validity requires some form of

interpretation: What is the test measuring? Can it measure what it intends to

measure? In standard scientific inquiries, it is important to formulate an

interpretative (theoretical) framework clearly and then to subject it to empirical

challenges. In this sense, theoretical construct validation is considered

functioning as a unified framework for validity (Kane, 2001).

A modified view of reliability (Moss, 1994)

There can be validity without reliability if reliability is defined as consistency

among independent measures.

Reliability is an aspect of construct validity. As assessment becomes less

standardized, distinctions between reliability and validity blur.

In many situations such as searching faculty candidate and conferring graduate

degree, committee members are not trained to agree on a common set of criteria

and standards

Inconsistency in students' performance across tasks does not invalidate the

assessment. Rather it becomes an empirical puzzle to be solved by searching

for a more comprehensive interpretation.

Initial disagreement (e.g., among students, teachers, and parents in responsive

evaluation) would not invalidate the assessment. Rather it would provide an

impetus for dialog.

Li (2003) argued that the preceding view is incorrect:

The definition of reliability should be defined in terms of the classical test

theory: the squared correlation between observed and true scores or the

proportion of true variance in obtained test scores.

Reliability is a unitless measure and thus it is already model-free or standard-

free.

It has been a tradition that multiple factors are introduced into a test to improve

validity but decrease internal-consistent reliability.

An extended view of Moss's reliability (Mislevy, 2004)

Being inspired by Moss, Mislevy went further to ask whether there can be

reliability without reliability (indices).

By blending psychometrics and Hermeneutics, in which a holistic and

integrative approach to understand the whole in light of its parts is used,

Mislevy demanded psychometricians to think about what they intend to make

inferences about.

In many cases we don't present just one argument; rather problem solving

involves arguments or chains of reasoning with massive evidence.

Off-the-shelf inferential machinary (e.g. compute reliability indices) may fail if

we quantify things or tasks that we don't know much about.

Probability-based reasoning to more complex assessments based upon

cognitive psychology is needed.

A radical view of reliability (Thompson et al, 2003)

Reliability is not a property of the test; rather it is attached to the property of

the data. Thus, psychomterics is datammetrics.

Tests are not reliable. It is important to explore reliability in virtually all

studies.

Reliability generalization, which can be used in a meta-analysis application

similar to validity generalization, should be implemented to assess variance in

measurement error across studies.

An updated perspective of reliability (Cronbach, 2004)

In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as a way of

measuring reliability, reviewed the historical development of Cronbach Alpha. He

asserted, "I no longer regard the formula (of Cronbach Alpha) as the most appropriate

way to examine most data. Over the years, my associates and I developed the complex

generaliability (G) theory" (p. 403). Discussion of the G theory is beyond the scope of

this document. Nevertheless, Cronbach did not object use of Cronbach Alpha but he

recommended that researchers should take the following into consideration while

employing this approach:

Standard error of measurement: It is the most important piece of information

to report regarding the instrument, not a coefficient.

Independence of sampling

Heterogeneity of content

How the measurement will be used: Decide whether future uses of the

instrument are likely to be exclusively for absolute decisions, for differential

decisions, or both.

Number of conditions for the test

A critical view of validity (Pedhazur & Schmelkin,1991)

Content validity is not a type of validity at all because validity refers to

inferences made about scores, not to an assessment of the content of an

instrument.

The very definition of a construct implies a domain of content. There is no

sharp distinction between test content and test construct.

A modified view of validity (Messick, 1995)

The conventional view (content, criterion, construct) is fragmented and incomplete,

especially because it fails to take into account both evidence of the value implications

of score meaning as a basis for action and the social consequences of score use.

Validity is not a property of the test or assessment, but rather of the meaning of the

test scores.

Content: evidence of content relevance, representativeness, and technical

quality

Substantive: theoretical rationale

Structural: the fidelity of the scoring structure

Generalizability: generalization to the population and across populations

External: applications to multitrait-multimethod comparison

Consequential: bias, fairness, and justice; the social consequence of the

assessment to the society

Critics argued that consequences should not be a component of validity because test

developers should not be held responsible for the consequences of misuse;

accountability should lie with the misuser. Messick (1998) counter-argued that social

consequences of score interpretation include the value implications of the construct

label, which may or may not commensurate with the construct's trait implications and

need to be addressed in appraising score meaning. While test developers should not be

accountable to misuse of tests, they should still pay attention to the unanticipated

consequences of legitimate score interpretation.

A different view of reliability and validity (Salvucci, Walter, Conley, Fink, &

Saba (1997)

Some scholars argue that the traditional view that "reliability is a necessary but not a

sufficient condition of validity" is incorrect. This school of thought conceptualizes

reliability as invariance and validity asunbiasedness. A sample statistic may have an

expected value over samples equal to the population parameter (unbiasedness), but

have very high variance from a small sample size. Conversely, a sample statistic can

have very low sampling variance but have an expected value far departed from the

population parameter (high bias). In this view, a measure can be unreliable (high

variance) but still valid (unbiased).

Population parameter (Red line) = Sample

statistic (Yellow line) --> unbiased

High variance (Green line)

Unreliable but valid

Population parameter (Red line) <> Sample

statistic (Yellow line) --> Biased

low variance (Green line)

Invalid but reliable

Caution and advice

There is a common misconception that if someone adopts a validated instrument,

he/she does not need to check the reliability and validity with his/her own data.

Imagine this: When I buy a drug that has been approved by FDA and my friend asks

me whether it heals me, I tell him, "I am taking a drug approved by FDA and

therefore I don't need to know whether it works for me or not!" A responsible

evaluator should still check the instrument's reliability and validity with his/her own

subjects and make any modifications if necessary.

Low reliability is less detrimental to the performance pretest. In the pretest where

subjects are not exposed to the treatment and thus are unfamiliar with the subject

matter, a low reliability caused by random guessing is expected. One easy way to

overcome this problem is to include "I don't know" in multiple choices. In an

experimental settings where students' responses would not affect their final grades, the

experimenter should explicitly instruct students to choose "I don't know" instead of

making a guess if they really don't know the answer. Low reliability is a signal of high

measurement error, which reflects a gap between what students actually know and

what scores they receive. The choice "I don't know" can help in closing this gap.

Last Updated: 2008

References

American Educational Research Association, American Psychological Association, &

National Council on Measurement in Education. (1985). Standards for educational

and psychological testing. Washington, DC: Authors.

Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. I. Braun

(Eds.), Test validity.Hillsdale, NJ: Lawrence Erlbaum.

Brennan, R. (2001). An essay on the history and future of reliability from the

perspective of replications.Journal of Educational Measurement, 38, 295-317.

Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.). Educational

Measurement (2nd Ed.). Washington, D. C.: American Council on Education.

Cronbach, L. J. (2004). My current thoughts on Coefficient Alpha and successor

procedures. Educational and Psychological Measurement, 64, 391-418.

Cronbach, L. J. & Quirk, T. J. (1976). Test validity. In International Encyclopedia of

Education. New York: McGraw-Hill.

Goodenough, F. L. (1949). Mental testing: Its history, principles, and

applications. New York: Rinehart.

Hunter, J. E.; & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error

and bias in research findings. Newsbury Park: Sage Publications.

Kane, M. (2001). Current concerns in validity theory. Journal of educational

Measurement, 38, 319-342.

Lacity, M.; & Jansen, M. A. (1994). Understanding qualitative data: A framework of

text analysis methods.Journal of Management Information System, 11, 137-160.

Li, H. (2003). The resolution of some paradoxes related to reliability and

validity. Journal of Educational and Behavioral Statistics, 28, 89-95.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences

from persons' responses and performance as scientific inquiry into scoring

meaning. American Psychologist, 9, 741-749.

Messick, S. (1998). Test validity: A matter of consequence. Social Indicators

Research, 45, 35-44.

Mislevy, R. (2004). Can there be reliability without reliability? Journal of Educational

and Behavioral Statistics, 29, 241-244.

Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher,

23, 5-12.

Osterlind, S. J. (1983). Test item bias. Newbury Park: Sage Publications.

Parkes, J. (2000). The relationship between the reliability and cost of performance

assessments. Education Policy Analysis Archives, 8. [On-line] Available

URL: http://epaa.asu.edu/epaa/v8n16/

Pedhazur, E. J.; & Schmelkin, L. P. (1991). Measurement, design, and analysis: An

integrated approach.Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Polkinghorne, D. E. (1988). Narrative knowing and the human sciences. Albany: State

University of New York Press.

Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba, M. (1997). Measurement error

studies at the National Center for Education Statistics. Washington D. C.: U. S.

Department of Education

Thompson, B. (Ed.) (2003). Score reliability: Contemporary thinking on reliability

issues. Thousand Oaks: Sage.

Yu, C. H. (2005). Test-retest reliability. In K. Kempf-Leonard (Ed.). Encyclopedia of

Social Measurement, Vol. 3 (pp. 777-784). San Diego, CA: Academic Press.

Questions for discussion

Pick one of the following cases and determine whether the test or the assessment is

valid. Apply the concepts of reliability and validity to the situation. These cases may

be remote to this cultural context. You may use your own example.

1. In ancient China, candidates for government officials had to take the

examination regarding literature and moral philosophy, rather than public

administration.

2. Before July 1, 1997 when Hong Kong was a British colony, Hong Kong

doctors, including specialists, who graduated from non-Common Wealth

medical schools had to take a general medical examination covering all general

areas in order to be certified.

Navigation

Index

Simplified Navigation

Table of Contents

Search Engine

Contact

Date post:	13-Nov-2014
Category:	Technology
Upload:	anju-kumawat
View:	5,431 times
Download:	1 times

Reliability and validity

Technology