Psychometric Properties of Psychological Assessment Measures by Dr. Celeste Fabrie

Title

Psychometric Properties of PsychologicalAssessment Measures

by

Dr. Celeste Fabrie

2

Contents

1 Introduction 4

2 Definition of key concepts 5

2.1 Different types of norms 52.2 Criterion referenced tests 52.3 Psychological measures 52.4 Reliability and validity of psychological measures 5

3 Various types of norms 6

3.1 Mental age scales and grade equivalents 63.2 Percentiles 73.3 Stanines and Sten scales 73.4 Deviation IQ 7

4 Criterion referenced tests 7

4.1 Expectancy tables 8

5 Constructing a psychological measure 8

5.1 The planning phase 85.2 Format of the items 85.3 Item analysis phase: 9

- Item difficulty value 9- Item discrimination value 9- Item-total correlation 9

6 Reliability of a psychological measure 10

6.1 Correlation coefficient 106.2 Statistical significance 106.3 Reliability coefficient 116.4 Observed score 116.5 True and error scores 116.6 True variance and error variance in a test score 12

contents continued

3

Contents (continued)

7. Stability of a test – the advantages and limitations 12

7.1 Test-retest reliability 127.2 Alternate-form reliability 12

8 Internal consistency of a test 13– advantages and limitations

8.1 Split-half reliability 138.2 Kuder-Richardson 20 and Cronbach´s Alpha 13

9 Validity of a psychological measure 14

9.1 Content validity 149.2 Criterion-related validity 159.3 Predictive validity 159.4 Concurrent validity 15

10 Major methods of establishing construct validity 15

10.1 Developmental changes 1610.2 Correlations with other tests 1610.3 Factor analysis 1610.4 Internal consistency 1710.5 Convergent and discriminant validation 1710.6 Experimental interventions 18

11 Summary 18

12 Conclusion 19

References 20

4

1. Introduction

To understand what is meant by psychometric properties of psychological

assessment measures, it is necessary to separate the two descriptive classes

on their own. In other words, what do we mean when we speak about

psychometric properties, and what is implied by the concept psychological

measures?

Psychometrics is basically the study of different mental traits and

behavioural characteristics in amounts or scores. For example, these tests

can take the form of intelligence scales, or a rating of different attitudes

against a specific population standard or norm. The fact is, psychometrics

involves the assessment of human data according to known, specific

standards implied by the experienced researcher or scientist.

Without psychometric theory, there would be problems to develop a reliable

and valid psychological measure. It would be impossible to study human

intelligence without comparing a person, or group to a normative sample.

Psychological assessment measures on the other hand, involve five different

processes; diagnosis, classification, planning of treatment, self-knowledge,

research and program evaluation (Gregory, 2000, p.41)

The aim of this essay is to demonstrate the principles of psychometric

theory, and how these principles are integrated to explain how a

psychological measure is developed. The following major concepts,

including their related values and tests will be discussed, namely, the

different types of norms, criterion referenced tests, psychological measures

and the reliability and validity of these measures, as well as the advantages

and limitations of each test.

5

3 Definition of key conceptsOur discussion begins with an overview of the following concepts:

2.1 Different types of norms

Norms can be defined as a raw score or table of values of an individual

measurement against the performance of others in a particular group.

Furthermore, norms help us to make important standard comparisons,

whereby we can judge how much a person’s score deviates from the

average population or representative group in a sample (Rosnow &

Rosenthal, 1999. p. 222).

2.2 Criterion referenced tests

These tests are basically the opposite to norm-referenced tests. Criterion-

referenced tests are mainly concerned with the personal achievements of the

tested person, than on making comparisons with the performance or abilities

of external groups. For example, these tests are ideal for the testing of

individual educational needs.

2.3 Psychological measures

To recapture: psychological measures or assessment measures involve

quantification techniques. In other words, psychological measures is a

dynamic process, forever changing, but retaining its original structure.

That is, a psychological measurement can be a sophisticated process with the

following characteristics, as mentioned by Gregory (2000, p.30):

- scores or categories

- behaviour samples

- norms or standards

- standardized procedures

- prediction of nontest behaviour

2.4 Reliability and validity of psychological measures

Test reliability can be defined according to how it consistently measures

what it is supposed to measure. Unfortunately, such problems as “random

error” can greatly influence the reliability of an instrument (but more about

that later in our discussion under point 6).

6

Validity, for example often pertains to the contents of a test or measuring

instrument. For instance, if a particular personality trait is been measured by

a certain test, then it is expected that this test actually measures what it is

supposed to measure, otherwise it is invalid.

3 Various types of norms

The word “normed score” on its own means very little to test takers if they

do not know what kind of norms they are been tested against. It is up to the

tester to explain a test or individual’s raw score against a background of a

representative population group. The following norms will be discussed

briefly to demonstrate the importance of understanding a person’s test score

on a norm referenced test.

3.1 Mental age scales and Grade Equivalents

These two types of norms actually are referred to as “developmental norms”

but with a slight difference. Mental age scales encourage similar-age test

comparisons. In other words, the performance level of a child who is 10

years of age will be compared to the performance level of other children of

the same age group. These age norms are a convenient way of testing

children’s developmental characteristics that are dynamically changing in

comparison to the more stable traits of adults.

Grade equivalents are quite similar to age grades. However, instead of just

checking the aptitude or ability of a similar age group, grade norms measure

the standard of test performance for every individual grade depicted in the

normative sample. This means that school performance is measured against

a normative sample from the same class or grade in the school. This is a

more convenient method to test a child or scholar’s academic performance

against a similar grade equivalent (Gregory, 2000, p.71).

7

3.2 Percentiles

Percentiles is a relative measure which can fluctuate between 0 and 100.

A percentile describes the distribution of scores either falling below or

above a particular percentage of sampling scores. For example, we talk

about a 50th percentile (which is also referred to as the “median”) when

a typical score falls below the 50% level and above the 50% level (Rosnow

& Rosenthal, 1999, p. 233)

3.3 Stanines and Sten Scales (standard scores)

Stanines transform test takers raw scores on a 1 - 9 point scale and

the 10 unit sten scale which Canfield (1951) recommended is basically

a slight variation on the stanine scale. Both measures were useful

devices before the pre-computer age to test norms. In fact, stanines always

have a mean of 5 and an exact standard deviation of 2. This means that

scores are ranked from the lowest to the highest, with the bottom 4% of

scores having a stanine of 1 and so forth (Gregory, 2000, p.68).

3.4 Deviation IQ

This norm is often used to measure the scores depicted on intelligence tests.

However, this kind of scale is often misinterpreted by inexperienced persons

wanting to find an easy “labelling” system to describe a persons intelligence

above or below a particular marker of 100. It would be foolish to use one

single IQ test as a final result of ones intelligence or character. For example,

using obsolete tests can either inflate or deflate a participants IQ scores.

4. Criterion referenced tests

A criterion referenced test is the opposite to a norm-referenced test. Where

the latter measures an individual’s performance against a representative

group, the former will measure person’s mastery or nonmastery skills on a

particular content domain, such as a task on dexterity or memory

performance (Gregory, 2000, p.74). An expectance table is a good example

of such tests.

8

4.1 Expectancy tables

This kind of table normally reflects a practical eye-view of candidates

predictor results and a specific criterion. For example, expectancy tables test

the relationships between an individual’s test scores and what he or she is

able to accomplish later on in life, whether it be a certain career path or

achieving good grades in an upcoming college entrance exam. However,

such tables also have limitations, mainly because they reflect the results of

large representative group scores, which reflect their present social or school

standards of the time. In fact, such tests, which also include most other norm

tests require constant updates or checks in order to accomplish what it is

supposed to; which is reliability and validity of results (Gregory, 2000, p.

72).

5. Constructing a psychological measure

5.1 The planning phase

This is probably one of the most important steps before beginning a

psychological measure. The planning phase involves very careful decision

making. For example, an engineer will plan every move and step along the

way before he delivers his proposal for a new effective railway bridge.

Therefore, when planning a psychological measure, the tester will have to

consider for instance; choice, format, length of items he or she will include

in a test measure.

5.2 Format of the items

Test construction is not just a simple matter of throwing any kind of item

into a main batch. It is crucial to decide what type of item format is required.

It is of no value to try and test a questionnaire on a personality trait against

items that test for physical speed, such as running 500 metres in a certain

time. For example, it would also make no sense to test a preschoolers

performance on an arithmetic test meant for an 8 year old school child. The

length of the measure should be suitable for that particular group. It would

be a waste of time to test someone with a major depressive disorder (and on

strong medication) on a test which requires 3 hours of heavy concentration.

9

In other words, test items can come in different formats and styles such

as multiple-choice questionnaires, true-false items, forced-choice,

closed/open-response and so forth.

It must be remembered however, that item selection is never a perfect

system. It always involves an item measurement error in assessment tests.

That is why careful consideration is applied to the planning and

implementation of item selection from the beginning to the end stages to

avoid as little as possible too much measurement error.

5.3 Item analysis phase

The item analysis phase involves 3 different types of item statistics:

- item difficulty value

- discrimination value

- item-total correlation

These above item statistics help the researcher to choose the most suitable

items for the end measure. It is always wise to try and adapt tests on a

homogeneous basis, which means taking into account many different

demographic features of the test person, such as age, sex, social/economic

background, educational status, and most important cultural differences.

The item difficulty value tests a large amount of students correct answers

against a single test question. If a minimum percentage get it wrong, then it

is obvious that the test item is too easy and should be adjusted. The reverse

is also true.

The discrimination value shows how well an item discriminates between

those who get high and low ratings on the complete test.

The item-total correlation is a point-biserial correlation (also similar to

The Pearson r), which stresses the relationship between 2 variables. The

higher the relationship between a single item and the total score, then the

item is considered good with regards to internal consistency. In other words,

a good measurement should have items that are homogeneous with a high

level of internal consistency.

10

6. Reliability of a psychological measure

Reliability according to Gregory (2000) “expresses the relative influence of

true and error scores on obtained test scores.” To understand what reliability

Means, is to try and imagine a scale weighing a kilo of grapes. The

greengrocer, weighs the grapes twice in a row and each time he gets a

slightly different reading, but never the same as the first weighing. In other

words, reliability is not always an absolute measure. There will always be a

slight inconsistency between the first test and the second test. But again,

slight fluctuations between tests is a matter of degrees. Repeating results

helps the tester to confirm some form of accuracy in scores, but this again

will not mean much without validity which will be discussed later on in this

essay.

6.1 Correlation coefficient

A correlation coefficient r possesses values ranging from –1.00 to +1.00.

A +1.00 is a perfect linear relationship between 2 test results. A zero

correlation occurs when 2 variables, such as height and reaction time have

no relationship to one another.

To test reliability of psychological test scores, the same test should be taken

twice, namely with a test-retest method. We can then test the degree of

variance in the obtained scores with the variance in true scores.

6.2 Statistical significance

This type of method goes beyond that of just testing a correlation coefficient

between 2 variables. The psychometrician, for example is not just interested

in a small sample of test-persons, but would like to compare/generalize the

results to a larger population. The fact is the larger the sample size, the better

the statistical significance. For instance, it is better to try for less errors by

increasing the size and homogeneity of our sample. If a correlation is

significant at a .01 level, we know then that the probability of error will

be 1 out of a 100 which is a rather good estimate.

11

6.3 Reliability coefficient

The reliability coefficient is the proportion of true score variance (factors

which are consistent) to the complete total variance of test results. In plain

terms, we add the true score variance (the stable attribute which we are

testing) with the error score variance or errors of measurement.

6.4 Observed score

The observed score or obtained score can be drastically altered by random

events or measurement errors. To avoid this problem, it is up to the

researcher to reduce as many of the nuisances as possible in order to have a

reliable measure. In fact Classical Theory (Gregory, 2000. pp77-79)

stipulates that a negative measurement error can contribute to an obtained

score been much lower than the true score. A positive measurement error on

the other hand, could contribute to a higher obtained score than the actual

true score. Either way, one of the students doing a specific knowledge test

will come out better due to some unbalanced item selection or other

measurement error.

6.5 True and error scores

True and error scores are uncorrelated according to the classical

measurement theory. True scores are hypothetical. They are never really

known. However, it is error scores which give test developers headaches.

For example, the researcher decides to test a trait for nervousness and keeps

on getting a measurement for confidence. It is obvious that there is

something inconsistent with this test measure. It could be that the researcher

has chosen incorrect test items based on obsolete tests, or that the person/s

being tested are not suitable test candidates. The fact is, that errors of

measurement will give false observed scores. If the same test would be

repeated again, the end results will be inconsistent. Therefore test

construction should be carefully planned in the beginning in order to avoid

such measurement errors creeping into the results.

12

6.5 True variance and error variance in a test score

Briefly, the true variance shows a more homogeneous, internal item

consistency than the error variance. Error variance results from bad content

sampling, such as in alternate-form and split-half reliability, as well as

heterogeneity of the traits under observation. On the other hand, a high

interitem consistency shows a more homogeneous variance with little

inconsistency. For example, if 2 half-tests show 2 different results we speak

about an error variance. This means that both half-tests are inconsistent with

one another.

7 Stability of a test – the advantages and limitations

7.1 Test-retest reliability

In this kind of measurement, the same test is repeated twice to the same test

group. This sample group is of a heterogeneous nature which is

representative of the general population. The idea behind this kind of test is

to compare or correlate the two scores for a reliable measure. The advantage

of such a test is to predict the second score from the results of the first test,

hoping that there will be a correlation between both scores.

There are limitations however to such tests. Error variances, such as

experience, maturations, lengthy time spans between tests, illness and so

forth could affect retest reliability. (PSY498-8 p. 6).

7.2 Alternate-form reliability

Alternate forms of the same test are issued to test persons. This test

measures the correlation between both scores (which is quite similar to the

test-retest reliability). However, there is a difference between the two. The

alternate-form reliability method inserts item-sampling differences (error

variance) which can limit the scope of reliability. For example, some

students may cope very well with the items on test 1 but do quite badly on

the second test due to the unidentical items with the first test. Another

limitation is the high cost of producing alternate features of a test, and the

difficulties involved trying to reproduce parallel forms (Gregory, 2000

p. 83).

13

8. Internal consistency of a test – advantages and limitations

Apart from alternate forms reliability and test-retest reliability, there

are other methods to test items for consistency. For example, the split-half

reliability, the Kuder-Richardson 20 and Cronbach’s Alpha.

8.1 Split-half reliability

As the name implies, this kind of test correlates the 2 scores from a single

test. This is achieved by “splitting” the test into identical halves. Sounds

complicated, although it is actually quite an effective measure. For example,

if the test scores on both halves indicate a strong correlation, then the scores

on two complete tests from 2 different measures should in principle also

show the same correlations (Gregory, 2000. p. 84) Internal consistency is

therefore achieved through only a single administration.

Of course there are advantages to this method such as lengthening the test to

produce more reliability or studying a large behaviour domain. But, there are

also limitations as to how one can “split” items on a single test. One can try

dividing even and odd numbers or separating easy and difficult items.

However, this becomes a problem when the test developer has to split

drawings or comprehension texts.

8.2 Kuder-Richardson 20

We use the Kuder-Richardson or KR20 (1937) formula if one wants to find

internal consistency of a single administration of one test, such as discussed

in the split-half procedure. What this formula actually does is to test

individual test items as a 0 for wrong and a 1 for right. However, when tests

go beyond the KR20 formula, such as in the testing of heterogeneous items,

we then use the Coefficient Alpha (Cronbach (1951). This formula is

suitable for example, in attitude scales where test persons must rate their

answers as; strongly agree, disagree, and so forth (Gregory, 2000, p.86).

14

9 Validity of a psychological measure

Validity can be described as the degree to which a measure does what it is

supposed to do. In other words, the psychological measure should give a

good indication of well-grounded truth/fact between both the trait been

tested, and the operational definition of the construct. Furthermore, this

measuring instrument must test, and only test what it was designed to do.

For example, it is no use designing an instrument for intelligence scales and

then using the same measure to test for “running speed” ((Blanche &

Durrheim, 2002, p. 83). The following validity procedures will be discussed:

- content validity

- criterion-related validity

- predictive validity

- concurrent validity

9.1 Content validity

Content validity is a suitable measure when testing for traits such as

knowledge, as in an examination paper (Blanche et al, 2002, p85).

In other words, this type of measure is actually the testing of item samples

on a test which are taken from a greater sample or population, which could

be several text books covering one field or domain topic. It would be

impossible to test an examinee on the entire contents of a particular subject

such as engineering! (Time is normally limited with such tests).

Content validation sometimes runs into difficulties when abstract traits, such

as personality and aptitudes have to be tested. It is difficult to give an

accurate test description of something like racism or morals, as these traits

do not fit smugly between the pages of a subject book (Blanche et al, 2002,

p85).

Face validity is another matter to consider. For instance, how does the test

appear to others? Does it look too complicated, or does it have an

unprofessional appearance? Face validity needs to be taken serious if the

measure is going to be accepted by other persons in authority, namely from a

legal and educational point of view (PSY498-/8102).

15

9.2 Criterion-related validity

Criterion-related validity normally correlates with other similar tests or

research. In other words, a researcher who discovers a new form of “job

mobbing” in corporate and industry will compare previous studies in this

field with his/her new findings. There are 2 types of validity measures to

test for criterion validity, namely, predictive validity and concurrent validity.

9.3 Predictive validity

As the name implies, predictive validity helps predict future events from

existing scores, budgets, educational performance and so forth. For

example, future inflation rates can be predicted from present statistics on the

countries economic performance in relation to the rest of the world.

Concurrent validation, on the other hand replaces predictive validity

measures when it comes to making a present diagnosis on a pupils

immediate performance, and not on future events (PSY498-8/102).

9.3 Concurrent validity

This type of method would be more suitable when testing abstract traits,

such as someone suffering from an immediate problem of depression. The

clinician can judge the patients observable behaviour and cognitive

performance, and make a suitable diagnosis. It would be difficult however to

use a method of predictive validity in such a case. One cannot “predict” if

someone who is suffering from a dark mood one day is going to suffer from

depression in the future. A positive feature of concurrent validity is that

costs are kept at a minimum and results are normally immediate, compared

to predictive validity.

10 Major methods of establishing construct validity

Example construct or traits are; technical and mechanical knowledge,

running speed, frustration, reading and spelling abilities and so forth.

How do we measure such constructs? Firstly, the researcher for instance

gathers as much data as possible on a particular trait, through observations,

interrelationships with other behaviour or cognitive measures and so forth.

16

We are looking at both a theoretical and empirical method of establishing

construct validity. Several methods will be discussed under the following.

10.1 Developmental changes

It is common knowledge that developmental changes take place between

childhood and adulthood, which also means that both behaviour and

cognitive abilities also change perhaps more rapidly in childhood than in

later years where they tend to “stabilize”.

Age-differentiation is also dictated by a specific culture. Different cultures

have different child-rearing patterns or beliefs. The Piagetian ordinal scales,

for example, the sequential patterning of development or schemas indicate

the gradual process of conceptual skills of early childhood to early

adulthood. This is an example of construct validation of ordinal scales over

several developmental levels (PSY498-8/102).

10.2 Correlations with other tests

It is a necessary condition that when making a correlation between a new

test with other tests that the former does not correlate too high to make it

invalid. A good mix would be between low and a “moderate” high, but no

more. In other words, it would be ridiculous to compare a new test on

a factor of intelligence with a similar test, and then find out later that the

new test is actually measuring a personality disorder!

10.2 Factor analysis

This is a particular family of statistics which many researchers adopt to

explain certain relationships between variables or constructs that correlate

highly with one another. This method is used to obtain a strict frugal set of

data. In other words, factor analysis allows for the testing of a multitude

of major mental abilities such as, comprehension, memory, number

recognition and so forth compared to more conservative tests, such as the

Stanford-Binet tests (Gregory, 2000, p 23). Factor analysis has one primary

goal, and that is to make a neat, comprehensible set of statistics by cutting

17

back too many “untidy” test variables to a more efficient economical set of

common traits.

10.3 Internal consistency

Briefly, internal consistency aims for significant item-test correlations with

the test pointing in a key direction. Another way of testing for internal

consistency is to correlate subtest scores with the total score. Take for

instance certain intelligence test factors, reading ability, arithmetic, spelling

and so forth. All the sub scores are added together to give a total test score.

Of course, it is necessary that items are homogeneous in order to achieve a

good internal test consistency.

10.4 Convergent and Discriminant validation

Convergent validation of a test means that a test correlates highly with other

tests, or traits that share a common factor. In other words, such tests are

normally done on a heterogeneous sample to test for convergence. This also

means that such a test should also not correlate with opposite variables.

For example, a test for vocabulary ability should not correlate with a test for

arithmetic reasoning.

Discriminant validation is important to personality tests. In fact discriminant

validation occurs when there is a clash, or non-correlation with two opposite

variables such as popularity and intelligence. This would obviously be a

negative correlation, if any correlation at all.

The multitrait-multimethod matrix (Campbell and Fiske (1959) combines

the assessment of two or more variables with two or more methods

(Gregory, 2000, p. 110-111). This matrix demonstrates a good source of data

on discriminant and convergent validity, as well as reliability.

18

10.5 Experimental interventions

Any form of experimental intervention a researcher does will involve

“control” of the test situation. This is done in order to “isolate” common

treatment factors, and remove any unwanted interferences that could

invalidate results. There are numerous research designs to choose from,

such as a standard one-group pretest-posttest design for testing construct

validation in a scholastic test.

Then there are other tests such as the Equivalent Time Series that spread out

over lengthy time periods (Neumann,1997, pp 183-197). What ever test is

chosen, there will always be a certain amount of experimental interference.

The researcher seeks solutions to problems, or tries to find a better

experimental method to test different hypothesis for present and future

generations.

11. Summary

The goal of this essay was to explain what was meant by psychometric

properties of psychological assessment measures. The principles necessary

to psychometric theory were discussed, namely, the different types of norms,

criterion referenced tests, psychological measures and the reliability and

validity of these measures, as well as the advantages and limitations of each

test.

19

12. Conclusion

Psychometric testing of psychological measurements is an extensive

procedure. There are a number of processes involved in assessing

human data which cannot be done in a vacuum. People are human constructs

which do not remain stable over time, that is why researchers are always

testing and retesting their products against the dynamics of man. It is

therefore safe to conclude that no test is a complete test. As this essay has

demonstrated, there are always advantages and limitations to assessment

measures. What works for one test, may not necessarily work for another.

Sometimes it is not a matter of degrees whether a test is supposed to

measure what it is supposed to measure, but how the test sample relates to

the real world of people.

20

References

Durrheim, K. (2002). Research in Practice. In M.T.Blanche (Ed), Quantitative

Measurement (pp. 72-95). Cape Town: UCT Press.

Gregory, R.J. (2000). Psychological Testing. 3rd Edition.

Illinois: Allyn and Bacon, Inc.

Neumann, W.L. (1997). Social Research Methods. 3rd edition. Needham Heights:

Allyn & Bacon.

Rosnow, R.L. & Rosenthal, R. (1999). Beginning Behavioral Research.

3rd. Edition. New Jersey: Prentice Hall

Tutorial Letter 102 for PSY498-8. (2003). Psychological Assessment.

Pretoria: Unisa Press. (sections from pp. 3-24).

Date post:	26-Mar-2015
Category:	Documents
Upload:	dr-celeste-fabrie
View:	2,985 times
Download:	0 times

Psychometric Properties of Psychological Assessment Measures by Dr. Celeste Fabrie

Documents