Comparing CTT and IRT using the Aptitude Test for High School
Adelaida de Perio De La Salle University - Manila
Background Admission to tertiary education requires applicants to
pass the screening process set by schools. One of the assessment tools used to select their
potential applicants is the use of aptitude tests. Aptitude tests are used to measure one’s fundamental intellectual abilities.
Background The Abstract Reasoning is a non-verbal test measuring
one’s ability to identify patterns in a series. Numerical ability on the other hand, measure’s one’s
ability in solving mathematical problems. Verbal Reasoning measures one’s ability to understand
analogies and covers areas in English language. The spatial ability measures one’s ability to manipulate
shapes.
Background Mechanical reasoning measures one’s knowledge of
physical and mechanical principles. Lastly, spelling measures one’s ability to detect errors in grammar punctuation and capitalization (Magno & Quano, 2010).
Review of Related Literature Because many studies link aptitude with academic
performance, schools use the aptitude test to predict future outcomes of students’ performance.
Long standing key predictors of academic success is students’ abilities measured by SAT or ACT, or high school GPA in predicting academic success (Covington, 1992; Lavin, 1965; Willingham, Lewis, Morgan, & Ramist, 1990).
Review of Related Literature Garavila, Gredler & Margaret (1997) examined the extent
to which college students’ learning strategies, prior achievement and aptitude predicted course achievement. Analyses showed that each of the predictor was significantly correlated with achievement. These variables accounted for 45% of the variance in course achievement.
Review of Related Literature Garcia (1997) found the same results in his study
examining the relations of motivation, attitude, and aptitude on second language achievement. The findings of the study revealed that aptitude (β=.43) Motivation (β =.41) and Ethnic Membership (β =.14) explained more than 50% of the variance in language achievement.
Review of Related Literature In secondary education, little has been done to screen in
students before entering the high school. This is the reason why some students lack the necessary skills and come unprepared to meet the demands and expectations of high school education. The use of an aptitude test therefore will not only serve as a screening tool but moreover, it will provide teachers with information on the areas students have to improve on.
Objective The present study therefore aims to compare CTT and
IRT results in evaluating the Aptitude Test developed for High School in terms of item difficulty, and item discrimination.
Review of CTT and IRT The CTT model, also called the “True Score Theory”
espouses the idea that responses of examinees are only due to variation of the examinee’s ability.
In CTT, item difficulty is indicated by the frequency of responses; item discrimination is indicated by item total correlation; and frequency of responses is used to examine distracters (Impara & Plake, 1997).
Review of CTT and IRT Traditionally, CTT has been used as a method of analysis
in evaluating test although it has several limitations.
First, the person statistic or the observed score is item dependent. Second, item statistics or the difficulty level and item discrimination are examinee dependent. The Item Response Theory answers these major limitations of the CTT.
Review of CTT and IRT The Rasch model, which is also referred to as the IRT,
estimates the probability of a correct response to an item as a function of the person’s ability and difficulty of the item.
In IRT, each item in a test has its own characteristic curve which describes the probability of getting the item correctly or depending on the test taker’s ability (Kaplan & Saccuzzo, 1997).
Review of CTT and IRT IRT asserts that the easier question, the more likely a
student will be able to respond to it correctly, and the more able the student, the more likely he or she will be able to answer the question correctly as compared to a student who is less able. Rasch model is based on the assumption that guessing and item differences in discrimination are negligible (Anastasi and Urbina (2002).
Method Participants A total of 63 incoming 1st year High School students,
both male and female participated in the study. The participants in the study were composed of grade 6 students from different elementary schools in Manila. The participants have finished the grade 6 level and were applying in a Science High School. Age ranges from 11-13 years old.
Method Instrument The Aptitude Test for High School was developed to
measure fundamental intellectual abilities in abstract reasoning, verbal reasoning and quantitative reasoning. The instrument consists of a total of 100 multiple choice items. The AHP consists of 30 items for abstract reasoning; 30 items for numerical reasoning, and 40 items for verbal reasoning.
Method Psychometric properties of the test show the following
reliability estimates for each subtest. Obtained reliability coefficients for each subtest are .70 for abstract reasoning, .77 for numerical reasoning, and .78 for verbal reasoning.
Method Procedure The test was administered to incoming 1st year high
school students in a Science High School in Manila. The AHP was given as one of the assessment tools in their selection of potential applicants who will be accepted in the Science High School. A trained examiner administered the test for one hour.
Data Analysis Data gathered were analyzed in terms of its reliability
coefficients, item difficulty and discrimination using both CTT and IRT.
In terms of item difficulty and item discrimination using the Rasch model, two samples were tested and compared.
The following computer software was used: SPSS version 16, and Microsoft Excel version 2007, and Winsteps for the IRT.
Results Reliability Indices Using the Classical Test Theory, reliability coefficients for
abstract reasoning, numerical reasoning and verbal reasoning were as follows: .70, .77, and.78.
Table 1 Summary of Person and Item Measure for Abstract Reasoning
Person Input Measured Infit Outfit
Score Count Measure Error IMNSQ ZSTD
OMNSQ ZSTD
Mean 21.8 30 1.33 0.5 1 0.1 0.94 0.1
SD 4 0 0.88 0.11 0.15 0.6 0.31 0.6 Real
RMSE 0.51 True SD 0.72 Separation 1.39 Person
Reliability 0.66
Person Input Measured Infit Outfit
Mean 45.9 63 0 0.36 1 0.1 0.94 0
SD 10.2 0 1.09 0.14 0.11 0.8 0.22 0.9 Real
RMSE 0.39 True SD 1.02 Separation 2.65 Item reliability 0.88
Table 2 Summary of Person and Item Measure for Numerical Reasoning
Person Input Measured Infit Outfit
Score Count Measure Error IMNSQ ZSTD OMNS
Q ZSTD Mean 19.5 30 0.82 0.47 1 0.1 0.97 0
0.9 0.27 0.9 SD 4.9 0 0.95 0.97 0.16
Real RMSE 0.47 True SD 0.82 Separation 1.74
Person Reliability 0.75
Person Input Measured Infit Outfit
Mean 40.9 63 0 0.32 1 0.1 0.97 0
SD 10.5 0 0.93 0.04 0.11 0.9 0.2 1 Real
RMSE 0
.32 True SD 0.87 Separation 2.74 Item
reliability 0.88
Table 3 Summary of Person and Item Measure for Verbal Reasoning
Person Input Measured Infit Outfit Score Count Measure Error IMNSQ ZSTD OMNSQ ZSTD
Mean 21.8 40 0.33 0.4 1.01 0 0.99 0
SD 5.1 0 0.75 0.04 0.19 1 0.43 0.9 Real
RMSE 0.4 True SD 0.63 Separati
on 1.59 Person
Reliability .0.72
Person Input Measured Infit Outfit
Mean 33.8 62 0 0.34 0.99 0.1 0.99 0.1
SD 14.8 0 1.41 0.12 0.09 0.8 0.29 1 Real
RMSE 0.36 True SD 0.87 Separati
on 3.79 Item
reliability 0.94
Table 4 Summary of Item Difficulty for Abstract Reasoning using Two Samples
SAMPLE 1 SAMPLE 2 MEASURE SE MEASURE SE
ITEM 1 -0.27 0.47 -0.02 0.44 ITEM 2 0.49 0.41 0.67 0.4 ITEM 3 0.96 0.39 0.17 0.43 ITEM 4 -2.36 1.03 -0.98 0.56 ITEM 5 0.32 0.42 -0.22 0.46 ITEM 6 .-51 0.51 -0.44 0.48 ITEM 7 -0.27 0.47 -0.44 0.48 ITEM 8 2.01 0.4 1.9 0.4 ITEM 9 1.7 0.39 1.44 0.39
ITEM 10 -0.27 0.47 -0.02 0.44 ITEM 11 0.8 0.39 0.17 0.43 ITEM 12 1.25 0.38 0.98 0.39 ITEM 13 -0.27 0.47 0.51 0.41 ITEM 14 -0.51 0.51 -2.55 0.03 ITEM 15 -0.79 0.55 -0.44 0.48 ITEM 16 0.96 0.39 -0.22 0.46 ITEM 17 -0.27 0.47 -1.8 0.75 ITEM 18 -1.14 0.62 -0.98 0.56 ITEM 19 -1.6 0.75 -1.8 0.75 ITEM 20 0.32 0.42 0.34 0.42 ITEM 21 -0.79 0.55 -0.44 0.48 ITEM 22 -3.56 1.81 -2.55 1.03 ITEM 23 0.14 0.43 0.34 0.42 ITEM 24 0.32 0.42 2.22 0.41 ITEM 25 -1.14 0.62 0.83 0.39 ITEM 26 1.7 0.39 2.78 0.46 ITEM 27 -0.27 0.47 -0.22 0.46 ITEM 28 0.32 0.42 1.29 0.39 ITEM 29 -0.51 0.51 -0.69 0.51 ITEM 30 -0.27 0.47 0.17 0.43
Table 5 Summary of Item Difficulty for Numerical Reasoning using Two Samples
SAMPLE 1 SAMPLE 2
MEASURE SE MEASURE SE
ITEM 1 -0.91 0.51 -0.61 0.44
ITEM 2 -2.76 1.03 -1.02 0.48
ITEM 3 2.45 0.46 1.47 0.41
ITEM 4 0.76 0.4 -0.8 0.45
ITEM 5 0.92 0.39 0.23 0.39
ITEM 6 -0.45 0.46 -1.26 0.51
ITEM 7 -0.45 0.46 -0.61 0.44
ITEM 8 1.23 0.4 0.99 0.39
ITEM 9 0.28 0.41 -0.08 0.4
ITEM 10 0.45 0.4 0.39 0.39
ITEM 11 -0.67 0.48 -1.02 0.48
ITEM 12 -1.2 0.56 -0.25 0.41
ITEM 13 -1.55 0.63 -1.26 0.51
ITEM 14 0.28 0.41 0.54 0.39
Table 5 Summary of Item Difficulty for Numerical Reasoning using Two Samples
ITEM 15 0.61 0.4 0.23 0.39
ITEM 16 1.38 0.4 1.64 0.42
ITEM 17 -2.76 1.03 -1.26 0.51
ITEM 18 0.61 0.4 -0.61 0.44
ITEM 19 0.28 0.41 0.39 0.39
ITEM 20 -0.45 0.46 -0.42 0.42
ITEM 21 -0.45 0.46 0.84 0.39
ITEM 22 -0.91 0.51 -1.02 0.48
ITEM 23 -0.67 0.48 -0.25 0.41
ITEM 24 -0.06 0.43 -0.25 0.41
ITEM 25 -1.2 0.56 0.23 0.39
ITEM 26 2.06 0.43 1.82 0.43
ITEM 27 0.92 0.39 0.23 0.39
ITEM 28 -0.45 0.46 -0.42 0.42
ITEM 29 1.23 0.46 0.69 0.39
ITEM 30 1.07 0.39 1.47 0.41
Table 6 Summary of Item Difficulty for Verbal Reasoning using Two Samples
SAMPLE 1 SAMPLE 2 MEASURE SE MEASURE SE
ITEM 1 -0.43 0.4 -0.66 0.41 ITEM 2 0.17 0.38 0.69 0.38 ITEM 3 0.77 0.39 1.33 0.42 ITEM 4 -2.67 0.74 -2.53 0.74 ITEM 5 -3.41 1.02 -4.5 1.83 ITEM 6 1.55 0.44 1.7 0.45 ITEM 7 -2.21 0.62 -2.53 0.74 ITEM 8 -0.59 0.41 0.55 0.38 ITEM 9 1.64 0.45 0.55 0.38 ITEM 10 -0.94 0.43 -0.66 0.41 ITEM 11 -1.35 0.47 -2.08 0.62 ITEM 12 -1.13 0.45 -3.28 1.02 ITEM 13 0.77 0.39 0.69 0.38 ITEM 14 0.77 0.39 0.69 0.38 ITEM 15 -0.94 0.43 -2.53 0.74 ITEM 16 -1.58 0.51 -3.28 1.02 ITEM 17 -1.58 0.51 -2.08 0.62 ITEM 18 -1.35 0.47 -1.02 0.45 ITEM 19 1.64 0.45 1.33 0.42 ITEM 20 0.03 0.38 1 0.4 ITEM 21 1.86 0.48 1.16 0.41 ITEM 22 0.93 0.4 0.4 0.38 ITEM 23 -0.12 0.39 -1.02 0.45 ITEM 24 0.93 0.4 1.16 0.41 ITEM 25 1.86 0.48 1.5 0.43 ITEM 26 0.62 0.39 1.7 0.45 ITEM 27 1.64 0.45 1.91 0.48 ITEM 28 0.77 0.39 1.16 0.41 ITEM 29 -0.59 0.41 0.26 0.38 ITEM 30 -0.27 0.39 0.69 0.38 ITEM 31 -0.27 0.39 -0.49 0.4 ITEM 32 1.64 0.45 0.84 0.39 ITEM 33 -0.27 0.39 -0.18 0.39 ITEM 34 0.03 0.38 -0.18 0.39 ITEM 35 -0.43 0.4 0.26 0.38 ITEM 36 2.74 0.63 2.79 0.63 ITEM 37 -0.43 0.43 -0.49 0.4 ITEM 38 0.93 0.4 1.16 0.41 ITEM 39 -0.94 0.43 -1.02 0.45 ITEM 40 0.17 0.38 0.55 0.38
Discussion In terms of reliability measures obtained reliability using
CTT and IRT shows moderately high estimates. This suggests that there is a higher chance that persons estimated with higher measures actually have really higher measures than persons with low measures.
Results also reveal that in terms of item and person separation, the sample can still be separated into groups and the test can still be divided into groups.
Discussion In terms of item discrimination, the same items were
found to have poor discrimination index for numerical reasoning and verbal reasoning using CTT and IRT. Therefore, these items should be subjected to revision.
However, for abstract reasoning 2 out of 5 items considered poor using CTT was also considered poor using IRT. In terms of item difficulty, similar items considered difficult were seen using both models.
Discussion However, there is discrepancy in the number of items
considered difficult for both CTT and IRT. These findings suggest that there is a relative degree of stability across CTT and IRT in terms of item discrimination.
Overall results showed that there appears to have consistency in the results using both CTT and IRT.
Discussion However, in this study, one of the advantages of using the
IRT over CTT was evidently seen. IRT is sample- free nature of its results. This means that item parameters are invariant when computed using different groups of different abilities.
Thank you!