Development of Mathematics Diagnostic Tests

transcript

7/27/2019 Development of Mathematics Diagnostic Tests

1/24

Development of Mathematics Diagnostic Test for DORSHS

Second Year High School Students Using Item Response Theory

Jeremias C. Ceniza

Donnell C. Cereno

Abstract

The study developed a diagnostic test that determined to measure learning in Intermediate

Algebra among second year high school students of Davao Oriental Regional Science High

School (DORSHS). The test established validity, discrimination indices, difficulty indices and

reliability. The construction of the test adapted the Research Methods for Educational Planning.

Content validity was handled by 3 experts in the content area of secondary mathematics. Primary

data for item analysis were extracted through 2 test tryouts: one on 59 third year students for

grammatical checking and compatibility benchmarking while another was on 78 second year

students for discrimination and difficulty indices and reliability. Analyses were attained through

the use of an IRT modeling software called ConQuest, spreadsheet and statistical software

known as SPSS. The study used the IRT two-parameter logistic model. The test was found to be

valid and highly reliable. Experts guaranteed that the instrument can gauge all content in the

learning competency of MathematicsII and that nothing was left untargeted. The reliability

coefficient of the test is 0.84. The range of difficulty of the items was at -1.37 to 1.35 and

discrimination indices were within 0.23 to 0.51. This standardized test is a tool that identifies the

zone of proximal development, mastered and least learned content areas in MathematicsII

among DORSHS second year students.


2/24

The Davao Oriental Regional Science High School (DORSHS) of Region XI offers

additional subjects in Math, Science and English to its students as early as first year. For the past

five years of its existence, the DORSHS National Achievement Test (NAT) results in

Mathematics are much far from the 75% Mean Percentage Score (MPS) target. The researcher

has become interested in creating a diagnostic test for DORSHS second year students that is

valid and reliable since the school doesnt have any yet. This instrument w ould be used in

detecting the strengths and weaknesses of second year students before undergoing a review

program for NAT. In general, this instrument intends to help DORSHS during NAT review to

improve results in Mathematics.

The study aimed to construct a standardized diagnostic test in Mathematics for second

year students in Davao Oriental Regional Science High School (DORSHS). Specifically, it

sought to answer the following questions:

1. Is the diagnostic test valid?2. To what extent do the test items illustrate difficulty?3. To what degree does the test exhibit discrimination level?4. How reliable is the entire test according to IRT model?


3/24

Conceptual Framework

Figure1.The Conceptual Paradigm of the Study

Item Response Theory (IRT)

After the popular (or classical) measurement models for constructing test and interpreting

test scores have served well its purpose for quite a long time, a new test theory had been

developed over the past forty years that was conceptually more powerful than classical test

theory. Based upon items rather than test scores, the new approach was known as item response

theory (Baker, 2001).

Diagnostic Test

Content Areas

Learning Areas in

MathematicsII

Test Objectives

Test Construction

Test Development

Content Validity

IRT Analysis

o Item Analysis- Difficulty Indices- Discrimination Indices

o Test Reliability- KR-20

Diagnostic Test

Results

Students Zoneof Proximal

Develo ment

Least Learned

and Mastered

Competencies

Settings of Item

Thinking Level


4/24

In contrast to the limitations of classical test models, the item response theory has many

desirable features. These features would include (a) item characteristics that are not group-

dependent, (b) scores describing examinee proficiency that are not test-dependent, (c) a model

that is expressed at the item level rather than at the test level, (d) a model that does not require

strictly parallel tests for assessing reliability, and (e) a model that provides a measure of

precision for each ability score (Hambleton, Swaminathan & Jane Rogers, 1991). Having this

advancement in educational and psychological measurement, it is now safe to administer test

tryouts to different groups of various abilities before giving to the target examinees. Kim, Cohen,

& Park (1995) illustrated further that IRT allows researchers to conduct rigorous tests of

measurement equivalence across experimental groups. IRT methods can distinguish item bias

from true differences on the attribute measured for which the classical test theory (CTT) could

not.

Point Biserialsor point biserial correlation coefficient (rpb)is one common metric used

to assess item quality. The pt bis as it is sometimes called is the correlation between an item

score (1/0) and the total score on a test. Positive values are desirable and indicate that the item is

good at differentiating between high ability and low ability examinees (Bontempo, 2009).

Zone of Proximal Development, or ZPD, as established by the Russian psychologist Lev

Vygotsky, refers to the distance between what a child can do with assistance and what the child

can accomplish without assistance (Vygotsky, 1978). In other words, it is the learning of the

students on a certain competency at 50% mastery level. Under IRT perspective, test items that

the child can answer correctly at 50% chance are the items within the childs zone of proximal


5/24

development. The child is, at this point, has a 50-50 chance of achieving the learning task

without teachers assistance.

Assessment

The main purpose of assessment is to improve the learning outcomes of students. In

assessing students learning, systematic process is being followed, specifically; the Diagnostic

test which is an assessment before a learning program, the Formative Test which is an

assessment during a learning program and the Summative Test which refers to assessment after a

learning program (Davies, Arbuckle & Bonneau, 2005).

Conducting a Diagnostic Test

The main purpose of diagnostic test is to pinpoint specific strengths and weaknesses of

the learner in accordance to his or her grade level requirements. They are scored using true test

score criteria, this means that they are not averaged or normed (Educational Diagnostic

Prescriptive Services, 2009). Izard, J. (2005) says that scores in a criterion referenced test are

interpreted as an individual performance of each student in the group on what he can do or not do

rather than comparing the results with other groups of students.

Stages in Test Construction

Izard (2005) in his module on Quantitative Research Methods in Educational Planning

laid an overview on test construction the test construction steps. His module describes the

different stages of developing a test that could be used by teachers in the classroom and even


6/24

nationwide as achievement test. The methods revealed important details on how the test will be

constructed in such a way that it can accumulate results that are valid, fair and reliable.

Validity, Reliability and Usability

Whatever the test is, it should apply the qualities of a good measuring instrument. The

qualities of a good measuring instrument are, validity, reliability, and usability (Calmorin, 2004).

Validity is the extent to which a test measures what it claims to measure. One type of

validity is the content validity. It refers to the extent to which the test reflects the content

represented in curriculum statements and the skills implied by that content

Reliabilityis another important characteristic of a good test. This refers to the consistency

of a measure. A test is considered reliable if we get the same result repeatedly that means, each

time the test is administered to a subject, the results should be approximately the same (Cherry,

2005). A reliability coefficient within the range of 0.811.0 indicates high reliability; 0.610.80

signifies moderate reliability; 0.41 to 0.60 fair reliability; 0.10 to 0.40 slight reliability and

less than 0.10virtually no reliability (Shrout, 1998).


7/24

Methods

Research Locale and Duration

Table 1.Timescale and Resource Requirement for the Test Development

TIMESCALE AND RESOURCES FOR TEST DEVELOPMENT

Stages Time Resources

I. Planning the Test

a. Developing Test ItemObjectives

1 week Basic Education Curriculum (BEC) list ofcompetencies

b.Constructing Test Grid orTable of Specification

1 week DepEd Memo on the Official number ofSchool Days

II. Preparing the Test

a. Content Analysis 1 week Learning Competencies, textbooksb. Item Writing and

Preparation of Answers Key

1 month Guidelines on test construction,

Mathematics TextbooksIII. Review and Testing Process

a. Item Review I 2 weeks Test construction team, Research adviser

b.Planning of Item Scoring 2 days Answers key, Spreadsheet softwarec.Production of Trial Tests I 1 day Word-processingd.Trial Testing I 1 day DORSHS 3rd Year studentse.Item Review II 3 days Researcherf.Production of Trial Tests II 1 day Word-processingg.Trial Testing II 1 day DORSHS 2nd Year studentsh. Item Analysis 2 weeks IRT software, Encoder

IV. Assembly of Final Test

a. Revising Test Items 5 days IRT concepts and interpretationsb.Finalization of the Test 2 days Data Findings and Resultsc. IRT Model Fitting 1 day IRTConQuestModeling Softwared. Identification of the Zone of

Proximal Development1 day

IRTConQuestModeling

Software,Learning Competenciese.Test Difficulty Settings 1 day IRT Results, Test Grid

Table 1 is the timescale and resource requirement for test development patterned

from the Quantitative Research Methods for Educational PlanningModule 6 (Izard, 2005). The

study was conducted in Davao Oriental Regional Science High School (DORSHS) campus at

Mati City during the month of February of School Year 2010-2011.


8/24

Statistical Treatment

This study used IRT model in describing the data. Item Analysis and test reliability (as

explained by Brannick, 2006; Baker, 2001; Hambleton, Swaminathan, & Jane Rogers, 1991)

were computed through a computer software model and interpreted as discussed by the

following:

Item Analysis. Under the IRT Two-Parameter Model was the last part of the item review

that verified how each item performed in the final test tryout. The researcher together with an

IRT expert did the following:

i. Processing test responses through IRT model

In determining the difficulty and discrimination indices of test items, a computer

software known as ConQuest: Generalised Item ResponseModelling Software was

used to construct the Item Characteristic Curve (ICC) of each test item. In here, the

responses of all test takers for each item were recorded in a spreadsheet, converted

into text format through SPSS, and then were run through the ConQuestsoftware. The

software then generated the summary of statistical results for each item and the

corresponding ICC as shown on Figure 2.

Figure 2. Sample Item Characteristic Curve (ICC)


9/24

The ICC, as shown in Figure 2, was the basis of item estimations for the two

parameters: item difficulty and item discrimination. The horizontal axis ( ) is the

latent ability of the examinees while the vertical axis P( ) is the probability of the

examinees to choose the correct answer. The ability ( ) whereP( ) is equal to 0.5 is

the value of the difficulty parameter. Also, the value estimated describing the

steepness of the curve is the value of the item discrimination parameter.

ii. Difficulty and Item Discrimination

With the item difficulty and item discrimination parameter values given by the

ICC, each item of the test was interpreted as the following:

Labels for item discrimination parameter (a) values

Verbal label Range of values

Negative less than zero Zero 0 Low 0.010.20 Moderate 0.210.60 High above 0.60

The discrimination parameters were sometimes called slope parameters. A jumpy

curve means that the expected test score responded to true ability unevenly. Flat

curves means that the expected score is not very sensitive to differences in true

ability. A steeper S-curve (like in Figure 2) means that the expected score is more

sensitive to differences in ability. In other words, the test discriminates or

distinguishes better between persons of different ability, which explains the term

discrimination parameter. Thus, discrimination parameter describes how well an


10/24

item can differentiate between examinees having abilities below the item location and

those having abilities above the item location.

Labels for item difficulty parameter (b) values

Verbal Label Range of values

Very easy less than2 Easy 0.50 to2.00 Average 0.49 to 0.49 Difficult 0.50 to 2.00 Very difficult greater than 2.00

The item difficulty parameter (b) value tells us how easy or how difficult an

item is. Under item response theory, an items difficulty is a point on the ability scale

where the probability of correct response is 0.5. One can find the value of b on the

common ability axis at the point for which the predicted probability equals 0.5.

Test reliability. In classical test theory, there will be another test for reliability. In IRT,

there is local reliability, that is, an amount of information at each point of the underlying

continuum. With IRT model, each item of the test contained information. For the set of

parameters associated with each term in a model, ConQuest computed a separation reliability

index. This reliability was an index of the equality of the parameters. In the case of dichotomous

data like the test conducted, the Coefficient Alpha given by ConQuest is equal to KR-20 (Wu,

Adams, et. al., 2007).

Revising Test Items. The selection of the items suited for inclusion to the final output of

the test was determined through the verbal interpretation of each item. This is illustrated by

Table 2 below.


11/24

Table 2.Decision Table for Difficulty and Discrimination Indices

Difficulty

Level

Discrimination

LevelDecision

Easy

Low Revise

Moderate Retain

High Retain

Average

Low Revise

Moderate Retain

High Retain

Difficult

Low Revise

Moderate Retain

High Retain

An item was rejected as any of the following was observed: (a) its discrimination

parameter value is negative or zero, and/or (b) its difficulty parameter value is very easy or very

difficult.

To preserve the validity of the entire test and to maintain the total number of items, all

items rejected were automatically replaced. The process of revision and replacement were

subjected to IRT item analyses yet still accord to the Test Grid.

Revised and replacement items were re-administered to the target examinees, the second

year students, and its test responses were re-run through IRT software for final verification of

whether these items are retained or not. Retesting of selected items was done as necessary in

preparation for the finalization of the test.

Evaluators of the Constructed Test

A group of experts and experienced persons in the field of test construction were in-

charged on the development of the test. This group served as consultants in connection to content


12/24

validity of the test. Another group, composed of students, took the test for item reviews and test

reliability. These two teams were called as test evaluators.

Test Construction Team. The first set of evaluators was the group of persons who took

charge in examining the content validity of the test. This group was made up of experts on test

development or individuals with experience on testing or trainings related to. More importantly,

these persons were familiar with the content areas in secondary mathematics and at least a

graduate of Master of Arts in Mathematics or Master of Science Teaching in Mathematics.

Try Out Test Examinees. The second set of evaluators was the group of students in

Davao Oriental Regional Science High School (DORSHS). Specifically, the members of the

team involved all third year (first group) and all second year (second group) students of the said

school. The administration of the test to the first group was vital in making the test structurally

ready while the second group responses was used to distinguish the difficulty and discrimination

indices of test items and the reliability coefficient of the over-all test.

Validation of the Diagnostic Test

Validation of the diagnostic test did not utilize a statistical analysis. It relied on matching

the test items from the objectives and presenting the whole test to the group of experts in the

content areas of secondary mathematics for item review. The team guaranteed that the instrument

had strong content validity in which each item represented at least one topic actually being

investigated to students, rather than asking unrelated questions.


13/24

Test Administration

The administration of the test tryouts was done after securing permission from the

concerned head of offices. The conduct of the test was officially approved by the OIC - Schools

Division Superintendent. Likewise, the principal of DORSHS also posed no objection to proceed

with the testing process. As a sort of ethical considerations, examinees involved in this study

who were minors were dealt with proper information about the research.

The first tryout was administrated to 59 third year students of DORSHS. The purpose of

administering the test was to determine the structural readiness of the test and assured its

compatibility to examinees thinking level. The next tryout was administered to 78 second year

high school students of DORSHS for interpretation of test results. Specifically, the test was

conducted for item analysis purposes.

The researcher introduced to the test takers certain guidelines in taking the test. In

answering the test, examinees were told to use the answer sheets as provided. They were

instructed to shade on the answer sheet the letter that corresponds to the best answer for every

test question. They were also been told to mark X on the previous answer if they decide to

change their answer and that they must only have one answer for each item, otherwise, it will be

marked wrong.


14/24

Results and Discussion

Test Validation

Validity of the test was done through certain revisions of some items as suggested by the

test construction team in line with the following test attributes: grammar structuring, proper

usage of punctuation, principles on item construction, setting of item difficulty, typographical

precision and more importantly, on content criterion.

Difficulty and Discrimination Indices of Test Items

Table 3 shows the result of item analysis from the final tryout using IRT through a

computer program called ConQuest: Generalized Item Response Modeling Software.

Table 3.Difficulty and Discrimination Indices Result of the Final Tryout

Item

Number

Difficulty

Index

Difficulty

Level

Discriminatio

n Index

Discrimination

LevelAction

1 -1.08 Easy -0.01 Negative Reject2 -1.08 Easy 0.34 Moderate Retain

3 -1.41 Easy 0.20 Low Revise

4 0.58 Difficult 0.08 Low Revise

5 -0.12 Average 0.46 Moderate Retain


7 0.58 Easy 0.30 Moderate Retain

8 -1.29 Easy 0.34 Moderate Retain

9 0.57 Difficult 0.23 Moderate Retain



12 0.52 Difficult 0.13 Low Revise13 -2.53 Very Easy 0.14 Low Revise







15/24

Item

Number

Difficulty

Index

Difficulty

Level

Discriminatio

n Index

Discrimination

LevelAction



21 0.27 Average 0.34 Moderate Retain

22 0.33 Average 0.50 Moderate Retain23 -0.18 Average 0.27 Moderate Retain










33 -0.12 Average 0.47 Moderate Retain34 0.33 Average 0.17 Low Revise




38 0.45 Average 0.17 Low Revise


40 -2.12 Very Easy 0.14 Moderate Reject

41 -1.37 Easy 0.14 Low Revise











52 1.27 Difficult -0.02 Negative Reject

53 2.19 Very Difficult 0.30 Moderate Reject

54 0.45 Average 0.28 Moderate Retain55 0.45 Average 0.51 Moderate Retain




59 2.90 Very Difficult 0.42 Moderate Reject



16/24

IRT discrimination and difficulty parameters for each item suggested that item numbers 1

(negative discrimination), 13 (very easy), 40 (very easy), 52 (negative discrimination), 53 (very

difficult) and 59 (very difficult) be rejected. The item numbers subjected for revision according

to Table 3 were numbers 3, 4, 12, 34, 38, 41, 48 and 51.

Revisions and Replacements of Weak Items

Data generated by Conquest facilitated the revisions or replacements of some items.

Items with negative or low discrimination like numbers 1, 3, 4, 12, 34, 38, 41, 48 and 52 were

treated with the aid of an IRT attribute called point biserial. The following table shows the IRT

Generalized Item Analysis Result of item number 1. Other items subject for revisions had similar

attributes with this table.

Table 4.IRT Generalized Item Analysis Result for Item No. 1

item:1 (1)

Cases for this item 78 Discrimination -0.01

Item Threshold(s): -1.08 Weighted MNSQ 1.14 Item Delta(s): -1.08

-----------------------------------------------------------------------Label Score Count % of tot Pt Bis t (p) PV1Avg:1 PV1 SD:1

-----------------------------------------------------------------------

1 1.00 56 71.79 -0.01 -0.06(.952) -0.00 0.72

2 0.00 2 2.56 -0.02 -0.21(.833) -0.31 0.05

3 0.00 2 2.56 -0.00 -0.00(.000) 0.11 1.29

4 0.00 18 23.08 0.02 0.14(.885) 0.02 0.64

=======================================================================

As observed in Table 4, item number 1 same as the other weak items had a low or

negative discrimination. It was noticed that the point biserial of correct answer is negative, or if

not, very close to zero while other wrong options become positive where ideally it should be

negative. This simply means that the wrong options were attractive as correct answer to

examinees with high ability. These options were reviewed then replaced for improvement and


17/24

ease of the test takers. Other items were also improved by restructuring the manner of

questioning to lessen confusions in answering.

IRT Generalized Item Analysis Result also revealed that the Delta values for item

numbers 40, 53 and 59 were -2.12, 2.19 and 2.90, respectively. It has a very easy and very

difficult difficulty level. Thus, the level of questioning on these items was rephrased to fit the

students level.

Finalization of the Test

Table 5 shows the item difficulty and discrimination results after retesting of the revised

and replacement items.

Table 5. Retesting Results on Difficulty and Discrimination Indices

Item

Number

Difficulty

Index

Difficulty

Level

Discrimination

Index

Discrimination

Level

Action















Test Reliability

As observed in the summary results by Table 6 below, the Coefficient Alpha is 0.84. This

is the Kuder-Richardson Formula 20 (KR-20) reliability coefficient (Wu et. al., 2007).


18/24

Table 6. Summary Statistics from Item Analysis Results-----------------------------------------------------------------------

The following results are scaled to assume that a single response was provided for each item.

N 78

Mean 28.77

Standard Deviation 8.56

Variance 73.30

Skewness 0.93

Kurtosis 1.05

Coefficient Alpha 0.84

=======================================================================

Zone of Proximal Development

The following Table translates the items into its corresponding learning areas in

Mathematics-II that the students have shown 50% mastery which is, in fact, known as the Zone

of Proximal Development.


19/24

Table 7. Students Achievement on Different Learning Areas in Mathematics-II

Students

Group

Mathematics Learning Area within the

Groups Zone of Proximal Development

Upper Top

Group (UTG)

- use systems of linear equations to solve problems- represent the solution set of a system of linear inequalities by graphing - simplify complex rational algebraic expressions- solve problems involving expressions with exponents - describe an arithmetic sequence by giving the formula for the nth term- solve problems involving geometric mean

Upper Middle

Group (UMG)

- represent the solution set of a system of linear inequalities by graphing - find the solution set of a quadratic equation- identify rational algebraic expression-perform operations on rational algebraic expressions- solve rational equations and check for extraneous solutions- solve problems involving rational algebraic expressions- solve equations involving variations- demonstrate understanding of expressions- rewrite algebraic expressions w/ zero and negative exponents- name two rational numbers where n x lies in between- simplify expression containing rational exponents using laws of exponents- solve problems involving radical equations- solve problems involving arithmetic means- solve problems involving arithmetic sequences- describe a geometric sequence given the first few terms- derive the formula for the sum of the terms of a geometric sequence- derive the formula for an infinite geometric series- solve problems involving geometric sequence- find the solution set of a quadratic equation

Lower Middle

Group (LMG)

-perform operations on radical expressions- define a system of linear equations in two variables- translate certain situations in real life to linear inequalities - draw the graph of a linear inequality in two variables - solve rational equations which can be reduced to quadratic equations- use quadratic equations to solve problems- simplify rational algebraic expressions- identify variation relationships of real life- represent variation relationships as equations- evaluate numerical expressions involving integral exponents- identify expressions which are perfect squares or perfect cubes- find the square root or cube root of expressions- rewrite expressions with rational exponents as radical expressions- simplify the radical expression- solve radical equations- list the next few terms of a sequence given several consecutive terms- derive a mathematical rule for generating the sequence- given few terms of an arithmetic sequence, find the common difference - given two terms of an arithmetic sequence, find the specified nth term


20/24

(Continuation)

Lower MiddleGroup (LMG)

- derive the formula for the sum of the n terms of an arithmetic sequence - describe a geometric sequence given the first few terms- find the sum of the terms of a geometric sequence- define a system of linear inequalities- describe an arithmetic sequence by giving the first few terms

Lower Bottom

Group (LBG)

- solve systems of linear equations in two variables- distinguish a quadratic equation from a linear equation- translate verbal expressions into rational algebraic expressions- rationalize a fraction whose denominator contains square roots- define the sum of an arithmetic sequence

Table 7 was a mapping of students score to their zone of proximal development (ZPD) or

the learning competencies that they had 50% mastery level. Thus, the competencies above their

ZPD were the least learned learning areas of the group while those below their ZPD were the

competencies the group mastered.

Settings on Thinking Hierarchy

Table 8 shows the results of students thinking level measured by the item as set by

intuition and by IRT calculation. It was revealed in the table that out of 60, only 23 items had

equal setting of difficulty level between the researchers intuition and test responses result.


21/24

Table 8. Item Difficulty Level (Intuition versus IRT perspective)

Item

Number

Level of

Difficulty by

Researchers

Intuition

Level of

Difficulty by

Test

Responses

Item

Number

Level of

Difficulty by

Researchers

Intuition

Level of

Difficulty by

Test

Responses1 1

stlevel 1

stlevel 31 3

rlevel 1

stlevel

2 2n

level 1s

level 32 2n

level 1s

level

3 3r

level 1st

level 33 2n

level 2n

level

4 2n

level 3r

level 34 3r

level 2n

level

5 1st

level 2n

level 35 3r

level 2n

level

6 2n

level 2n

level 36 2n

level 1st

level

7 3r

level 3r

level 37 3r

level 2n

level

8 1st

level 1st

level 38 1st

level 2n

level

9 2n

level 3r

level 39 2n

level 1st

level

10 2n

level 1st

level 40 2n

level 1st

level

11 3r level 1st level 41 3r level 1st level12 1

stlevel 3

rlevel 42 3

rlevel 2

nlevel

13 1st

level 1st

level 43 3r

level 1st

level

14 2n

level 2n

level 44 3r

level 2n

level

15 2n

level 3r

level 45 2n

level 2n

level

16 3r

level 3r

level 46 2n

level 3r

level

17 3r

level 3r

level 47 3r

level 3r

level

18 3r

level 3r

level 48 3r

level 3r

level

19 1st

level 2n

level 49 3r

level 2n

level

20 2n

level 2n

level 50 1st

level 2n

level

21 3r

level 2n

level 51 3r

level 3r

level

22 1st level 2n level 52 3r level 3r level

23 2n

level 2n

level 53 2n

level 3r

level

24 2n

level 2n

level 54 2n

level 2n

level

25 3r

level 3r

level 55 2n

level 2n

level

26 1st

level 1st

level 56 2n

level 1st

level

27 2n

level 1st

level 57 2n

level 3r

level

28 2n

level 2n

level 58 1s

level 2n

level

29 2n

level 3r

level 59 2n

level 3r

level

30 2n

level 2n

level 60 2n

level 2n

level


22/24

Conclusions and Recommendations

Conclusion

In view of the findings of this study, the following conclusions were drawn:

1.The results of this study reflect that the diagnostic test being developed is valid. It is aninstrument that can measure the desired trait of second year students in MathematicsII.

2.The final output of the test showed that the over-all test difficulty were within the rangeof -1.37 to 1.35. The diagnostic test therefore has an average level of difficulty.

3.The discrimination indices were within the range of 0.23 to 0.51. Thus, the test ismoderately discriminating instrument.

4.With the KR-20 coefficient at 0.84, the developed diagnostic test, as introduced byShrout (1998), indicates high reliability.

5.Based from the previous items, the diagnostic test developed in this study is valid, highlyreliable and fair. Hence, this test is standardized and can now be used in assessing

Mathematics learning of second year students in DORSHS.

6.The test had identified the least learned, zone of proximal development and masteredcompetencies of DORSHS second year high school students in Mathematics-II.

7.The study revealed the considerable difference of intuitive item difficulty setting fromactual item difficulty results of students.

Recommendations

Based on the findings and conclusions of this study, the following are the general

recommendations:


23/24

1. The standardized diagnostic test can already be used in assessing Mathematics learningof second year students in DORSHS prior from any Achievement Test review program.

2.Other researches like determining the coherence of the test to the second year NationalAchievement Test (NAT) can be made.

3.A further validation of the test using IRT is highly encouraged to second year students ofother schools or to another batch of second year students in DORSHS.

4.The test and its findings can also be utilized for any comparative study on item analysisbetween Classical Test Method and IRT.

5.The result of this study can be utilized for any possible research on students assessment.


24/24

References

Baker, F. (2001). The Basics of Item Response Theory (2nd

ed.). United States of America: ERIC

Clearinghouse on Assessment and Evaluation.

Bontempo, Brian D., PhD (2009). Measurement Art. The Point-Biserial Correlation Coefficient.Retrieved June 30, 2009, from http://www.mountainmeasurement.com/blog/?p=148.

Brannick, M. (2006). Concepts from IRT that Move Beyond Classical Test Theory. Multiple

Regression and Research Methods. Retrieved November 22, 2009, from

http://luna.cas.usf.edu/~mbrannic/files/pmet/irt.htm.

Calmorin, L. P. (2004).Educational Research Measurement and Evaluation (3rd

ed.). Manila,

Philippines: National Book Store, Inc.

Cherry, K. (2009).Reliability - What Is Reliability. Retrived November 20, 2009 from

http://psychology.about.com/od/researchmethods/

Davies, A., Arbuckle, M., Bonneau, D.(2005).Assessment For Learning: Planning for

Professional Development. Retrieved October 14, 2009 fromhttp://electronicportfolios.org/afl/Assessment4learning.pdf.

Hambleton, R., Swaminathan, H., Jane Rogers, H. (1991).Fundamentals of Item ResponseTheory. United States of America: SAGE Publications, Inc.

Izard, J. (2005). Quantitative Research Methods in Educational Planning. Overview of TestConstruction. Paris, France. International Institute for Educational Planning/UNESCO.

Kim, S. H., Cohen, A. S., & Park, T. H. (1995). Detection of Differential Item Functioning inMultiple Groups.Journal of Educational Measurement, 32: 261 - 276.

Educational Diagnostic Prescriptive Services (2009).Educational Diagnostic Prescriptive.

Retrived October 20, 2009 from

http://homeschoolcreations.blogspot.com/2009/09/educational.

Shrout, PE (1998). Measurement Reliability and Agreement in Psychiatry. Statistics Methods in

Medical Results. United States of America: SAGE Publications, Inc.

Vygotsky, L.S. (1978).Mind and society: The development of higher psychological processes.Cambridge, MA: Harvard University Press.

Development of Mathematics Diagnostic Tests

Documents