+ All Categories
Home > Documents > What is Evaluation

What is Evaluation

Date post: 23-Dec-2015
Category:
Upload: seresa-legaspi
View: 13 times
Download: 2 times
Share this document with a friend
Description:
evaluation
Popular Tags:
48
What is Evaluation? Evaluation is the process of examining a program or process to determine what's working, what's not, and why. Evaluation determines the value of programs and acts as blueprints for judgment and improvement. (Rossett & Sheldon, 2001) Types of Evaluations in Instructional Design Evaluations are normally divided into two broad categories: formative and summative. Formative A formative evaluation (sometimes referred to as internal) is a method for judging the worth of a program while the program activities are forming (in progress). This part of the evaluation focuses on the process. Thus, formative evaluations are basically done on the fly. They permit the designers, learners, and instructors to monitor how well the instructional goals and objectives are being met. Its main purpose is to catch deficiencies so that the proper learning interventions can take place that allows the learners to master the required skills and knowledge. Formative evaluation is also useful in analyzing learning materials, student learning and achievements, and teacher effectiveness.... Formative evaluation is primarily a building process which accumulates a series of components of new materials, skills, and problems into an ultimate meaningful whole. - Wally Guyot (1978) Summative A summative evaluation (sometimes referred to as external) is a method of judging the worth of a program at the end of the program activities (summation). The focus is on the outcome.
Transcript

What is Evaluation?

Evaluation is the process of examining a program or process to determine what's working, what's not, and why.

Evaluation determines the value of programs and acts as blueprints for judgment and improvement.

(Rossett & Sheldon, 2001)

Types of Evaluations in Instructional DesignEvaluations are normally divided into two broad categories: formative and summative.

Formative

A formative evaluation (sometimes referred to as internal) is a method for judging the worth of a program while the program activities are forming (in progress). This part of the evaluation focuses on the process.

Thus, formative evaluations are basically done on the fly. They permit the designers, learners, and instructors to monitor how well the instructional goals and objectives are being met. Its main purpose is to catch deficiencies so that the proper learning interventions can take place that allows the learners to master the required skills and knowledge.

Formative evaluation is also useful in analyzing learning materials, student learning and achievements, and teacher effectiveness.... Formative evaluation is primarily a building process which accumulates a series of components of new materials, skills, and problems into an ultimate meaningful whole. - Wally Guyot (1978)

Summative

A summative evaluation (sometimes referred to as external) is a method of judging the worth of a program at the end of the program activities (summation). The focus is on the outcome.

All assessments can be summative (i.e., have the potential to serve a summative function), but only some have the additional capability of serving formative functions. - Scriven (1967)

The various instruments used to collect the data are questionnaires, surveys, interviews, observations, and testing. The model or methodology used to gather the data should be a specified step-by-step procedure. It should be carefully designed and executed to ensure the data is accurate and valid.

Questionnaires are the least expensive procedure for external evaluations and can be used to collect large samples of graduate information. The questionnaires should be trialed (tested) before using to ensure the recipients understand their operation the way the designer intended. When designing questionnaires, keep in mind the most important feature is the guidance given for its completion. All instructions should be clearly stated...let nothing be taken for granted.

History of the Two Evaluations

Scriven (1967) first suggested a distinction between formative evaluation and summative evaluation when describing two major functions of evaluation. Formative evaluation was intended to foster development and improvement within an ongoing activity (or person, product, program, etc.). Summative evaluation, in contrast, is used to assess whether the results of the object being evaluated (program, intervention, person, etc.) met the stated goals.

Scriven saw the need to distinguish the formative and summative roles of curriculum evaluation. While Scriven preferred summative evaluations — performing a final evaluation of the project or person, he did come to acknowledge Cronbach's merits of formative evaluation — part of the process of curriculum development used to improve the course while it is still fluid (he believed it contributes more to the improvement of education than evaluation used to appraise a product).

Later, Misanchuk (1978) delivered a paper on the need to tighten up the definitions in order to get more accurate measurements. The one that seems to cause the greatest disagreement is the keeping of fluid movements or changes strictly in the prerelease versions (before it hits the target population).

In Paul Saettler's (1990) history of instructional technology, he describes the two evaluations (pp. 430-431) in the context of how they were used in developing Sesame Street and The Electric Company by the Children's Television Workshop. CTW used formative evaluations for identify and defining program designs that could provide reliable predictors of learning for particular learners. They later used summative evaluations to prove their efforts (to quite good effect I might add). While Saettler praises CTW for a significant landmark in the technology of instructional design, he warns that it is still tentative and should be seen more as a point of departure rather than a fixed formula.

Saettler defines the two types of evaluations as: 1) formative is used to refine goals and evolve strategies for achieving goals, while 2) summative is undertaken to test the validity of a theory or determine the impact of an educational practice so that future efforts may be improved or modified.

Thus, using Misanchuk's defining terms will normally achieve more accurate measurements; however, the cost is quite high as it is highly resource intensive, particularly with time because of all the pre-work that has to be performed in the design phase: create, trial, redo, trial, redo, trial, redo, etc.; and all preferably without using the target population.

However, most organizations are demanding shorter design times. Thus the formative part is moved over to the other methods, such as through the use of rapid prototyping and using testing and evaluations methods to improve as one moves on. Which of course is not as accurate but it is more appropriate to most organizations as they are not really that interested in accurate measurements of the content but rather the end product — skilled and knowledgeable workers.

Misanchuk's defining terms basically puts all the water in a container for accurate measurements while the typical organization estimates the volume of water running in a stream.

Thus if you are a vendor, researcher, or need highly accurate measurements you will probably define the two evaluations in the same manner as Misanchuk. If you need to push the training/learning out faster and are not all that worried about highly accurate measurements, then you define it closer to how most organizations do and Saettler does with the CTW example.

Kirkpatrick's Four Level Evaluation ModelPerhaps the best known evaluation methodology for judging learning processes is Donald Kirkpatrick's Four Level Evaluation Model that was first published in a series of articles in 1959 in the Journal of American Society of Training Directors (now known as T+D Magazine). The series was later compiled and published as an article, Techniques for Evaluating Training Programs, in a book Kirkpatrick edited,

Evaluating Training Programs (1975). However it was not until his 1994 book was published, Evaluating Training Programs, that the four levels became popular. Nowadays, his four levels remain a cornerstone in the learning industry.

While most people refer to the four criteria for evaluating learning processes as “levels,” Kirkpatrick never used that term, he normally called them “steps” (Craig, 1996). In addition, he did not call it a model, but used words such as “techniques for conducting the evaluation” (Craig, 1996, p294).

The four steps of evaluation consist of:

Step 1: Reaction - How well did the learners like the learning process? Step 2: Learning - What did they learn? (the extent to which the learners

gain knowledge and skills) Step 3: Behavior - (What changes in job performance resulted from the

learning process? (capability to perform the newly learned skills while on the job)

Step 4: Results - What are the tangible results of the learning process in terms of reduced cost, improved quality, increased production, efficiency, etc.?

Kirkpatrick's concept is quite important as it makes an excellent planning, evaluating, and troubling-shooting tool, especially if we we make some slight improvements as show below.

Not Just For Training

While some mistakenly assume the four levels are only for training processes, the model can be used for other learning processes. For example, the Human Resource Development (HRD) profession is concerned with not only helping to develop formal learning, such as training, but other forms, such as informal learning, development, and education (Nadler, 1984). Their handbook, edited by one of the founders of HRD, Leonard Nadler (1984), uses Kirkpatrick's four levels as one of their main evaluation models.

Kirkpatrick himself wrote, “These objectives [referring to his article] will be related to in-house classroom programs, one of the most common forms of training. Many of the principles and procedures applies to all kinds of training activities, such as performance review, participation in outside programs, programmed instruction, and the reading of selected books” (Craig, 1996, p294).

Improving the Four Levels

Because of its age and with all the new technology advances, Kirkpatrick's model is often criticized for being too old and simple. Yet, almost five decades after its introduction, there has not been a viable option to replace it. And I believe the reason why is that because Kirkpatrick basically nailed it, but he did get a few things wrong:

Motivation, Not Reaction

When a learner goes through a learning process, such as an e-learning course, informal learning episode, or using a job performance aid, the learner has to make a decision as to whether he or she will pay attention to it. If the goal or task is judged as important and doable, then the learner is normally motivated to engage in it (Markus, Ruvolo, 1990). However, if the task is presented as low-relevance or there is a low probability of success, then a negative effect is generated and motivation for task engagement is low. In addition, research on Reaction evaluations generally show that it is not a valid measurement for success (see the last section, Criticisms).

This differs from Kirkpatrick (1996) who wrote that reaction was how well the learners liked a particular learning process. However, the less relevance the learning package is to a learner, then the more effort that

has to be put into the design and presentation of the learning package. That is, if it is not relevant to the learner, then the learning package has to hook the learner through slick design, humor, games, etc. This is not to say that design, humor, or games are unimportant; however, their use in a learning package should be to promote or aid the learning process rather than just make it fun. And if a learning package is built of sound purpose and design, then it should support the learners in bridging a performance gap. Hence, they should be motivated to learn—if not, something dreadfully went wrong during the planning and design processes! If you find yourself having to hook the learners through slick design, then you probably need to reevaluate the purpose of your learning processes.

Performance, Not Behavior

As Gilbert noted (1998), performance is a better objective than behavior because performance has two aspects: behavior being the means and its consequence being the end... and it is the end we are mostly concerned with.

Flipping it into a Better Model

The model is upside down as it places the two most important items last—results, and behavior, which basically imprints the importance of order in most people's head. Thus by flipping it upside down and adding the above changes we get:

Result - What impact (outcome or result) will improve our business? Performance - What do the employees have to perform in order to create

the desired impact? Learning - What knowledge, skills, and resources do they need in order to

perform? (courses or classrooms are the LAST answer, see Selecting the Instructional Setting)

Motivation - What do they need to perceive in order to learn and perform? (Do they see a need for the desired performance?)

This makes it both a planning and evaluation tool which can be used as a troubling-shooting heuristic: (Chyung, 2008):

Compare criterion and norm-referenced tests for MR students What are the advantages and disadvantages of norm- and criterion-related assessments and formative and summative evaluations?

"When the cook tastes the soup, that's formative; when the guests taste the soup, that's summative."

Norm- and criterion-related assessments MUST be used in both types of evaluation. Criterion referenced refers to how our student measures up to some standard set by an outside source.

For example, a criterion would be to be able to jump a certain height, or to read a certain set of words. which is probably not true in the case of an mentall retarded (MR) student. Maybe one criterion would be that the student be able to name his/her colors by a certain time.

Norm-referenced tests are tests which compare the student being tested with all other students. Norm-referenced tests are used to classify students, to place them. MR students by definition do not test as well as other students his/her age.

The advantage of a norm-referenced test is that it shows us how our student is doing related to other students across the country. A disadvantage is that they are standardized and do not show small increments

of gain. They are good for using for placement at the beginning and then again four or six months later, or at the end of the year. This will show growth over the period of the time.

Norm-referenced (also called standardized or criterion-referenced) tests along with informal observational evaluation are useful for showing student growth over time. They aren't to be used for grading though they can be one element in a total grade. One must remember we can't expect great growth, if any, over short periods of times, particularly as shown on a norm-referenced test.

The definition of retarded is "slowed." That means that the growth of our students is slowed, but in most cases, for many things and for most students, not stopped. As a matter of fact, it is just unlikely for a "normal" population of students to show much growth even after a semester's time. These tests are not intended for measuring small increments of gain.

Criterion-related tests are nice because we can see just what our student accomplished. So now, after three months, s/he can recognize 35 more words, or maybe 65 more words than s/he could before. The student can name all the colors. The student is now putting away toys where they belong where before s/he either would not or could not.

What is Authentic Assessment?

Definitions

What Does Authentic Assessment Look Like?

How is Authentic Assessment Similar to/Different from      Traditional Assessment?

Traditional Assessment Authentic Assessment Authentic Assessment Complements Traditional Assessment Defining Attributes of Authentic and Traditional Assessment Teaching to the Test

Alternative Names for Authentic Assessment

Definitions

A form of assessment in which students are asked to perform real-world tasks that demonstrate meaningful application of essential knowledge and skills -- Jon Mueller

"...Engaging and worthy problems or questions of importance, in which students must use knowledge to fashion performances effectively and creatively. The tasks are either replicas of or analogous to the kinds of problems faced by adult citizens and consumers or professionals in the field." -- Grant Wiggins -- (Wiggins, 1993, p. 229).

"Performance assessments call upon the examinee to demonstrate specific skills and competencies, that is, to apply the skills and knowledge they have mastered." -- Richard J. Stiggins -- (Stiggins, 1987, p. 34).

 

What does Authentic Assessment look like?

An authentic assessment usually includes a task for students to perform and a rubric by which their performance on the task will be evaluated. Click the following links to see many examples of authentic tasks and rubrics.

Examples from teachers in my Authentic Assessment course

 

How is Authentic Assessment similar to/different from Traditional Assessment?

The following comparison is somewhat simplistic, but I hope it illuminates the different assumptions of the two approaches to assessment.

Traditional Assessment

By "traditional assessment" (TA) I am referring to the forced-choice measures of multiple-choice tests, fill-in-the-blanks, true-false, matching and the like that have been and remain so common in education.  Students typically select an answer or recall information to complete the assessment. These tests may be standardized or teacher-created.  They may be administered locally or statewide, or internationally.

Behind traditional and authentic assessments is a belief that the primary mission of schools is to help develop productive citizens.  That is the essence of most mission statements I have read.  From this common beginning, the two perspectives on assessment diverge.  Essentially, TA is grounded in educational philosophy that adopts the following reasoning and practice:

1. A school's mission is to develop productive citizens.2. To be a productive citizen an individual must possess a certain body of knowledge and skills.3. Therefore, schools must teach this body of knowledge and skills.4. To determine if it is successful, the school must then test students to see if they acquired the knowledge and skills.

In the TA model, the curriculum drives assessment.   "The" body of knowledge is determined first.  That knowledge becomes the curriculum that is delivered.  Subsequently, the assessments are developed and administered to determine if acquisition of the curriculum occurred.

Authentic Assessment

In contrast, authentic assessment (AA) springs from the following reasoning and practice:

1. A school's mission is to develop productive citizens.2. To be a productive citizen, an individual must be capable of performing meaningful tasks in the real world.3. Therefore, schools must help students become proficient at performing the tasks they will encounter when they graduate.4. To determine if it is successful, the school must then ask students to perform meaningful tasks that replicate real world challenges to see if students are capable of doing so.

Thus, in AA, assessment drives the curriculum.  That is, teachers first determine the tasks that students will perform to demonstrate their mastery, and then a curriculum is developed that will enable students to perform those tasks well, which would include the acquisition of essential knowledge and skills.  This has been referred to as planning backwards (e.g., McDonald, 1992).

If I were a golf instructor and I taught the skills required to perform well, I would not assess my students' performance by giving them a multiple choice test.  I would put them out on the golf course and ask them to perform.  Although this is obvious with athletic skills, it is also true for academic subjects.  We can teach students how to do math, do history and do science, not just know them.  Then, to assess what our students had learned, we can ask students to perform tasks that "replicate the challenges" faced by those using mathematics, doing history or conducting scientific investigation.

Authentic Assessment Complements Traditional Assessment

But a teacher does not have to choose between AA and TA. It is likely that some mix of the two will best meet your needs. To use a silly example, if I had to choose a chauffeur from between someone who passed the driving portion of the driver's license test but failed the written portion or someone who failed the driving portion and passed the written portion, I would choose the driver who most directly demonstrated the ability to drive, that is, the one who passed the driving portion of the test. However, I would prefer a driver who passed both portions. I would feel more comfortable knowing that my chauffeur had a good knowledge base about driving (which might best be assessed in a traditional manner) and was able to apply that knowledge in a real context (which could be demonstrated through an authentic assessment).

Defining Attributes of Traditional and Authentic Assessment

Another way that AA is commonly distinguished from TA is in terms of its defining attributes. Of course, TA's as well as AA's vary considerably in the forms they take. But, typically, along the continuums of attributes listed below, TA's fall more towards the left end of each continuum and AA's fall more towards the right end.

 

Traditional --------------------------------------------- Authentic

Selecting a Response ------------------------------------ Performing a Task

Contrived --------------------------------------------------------------- Real-life

Recall/Recognition ------------------------------- Construction/Application

Teacher-structured ------------------------------------- Student-structured

Indirect Evidence -------------------------------------------- Direct Evidence

 

Let me clarify the attributes by elaborating on each in the context of traditional and authentic assessments:

Selecting a Response to Performing a Task: On traditional assessments, students are typically given several choices (e.g., a,b,c or d; true or false; which of these match with those) and asked to select the right answer. In contrast, authentic assessments ask students to demonstrate understanding by performing a more complex task usually representative of more meaningful application.

Contrived to Real-life: It is not very often in life outside of school that we are asked to select from four alternatives to indicate our proficiency at something. Tests offer these contrived means of assessment to increase the number of times you can be asked to demonstrate proficiency in a short period of time. More commonly in life, as in authentic assessments, we are asked to demonstrate proficiency by doing something.

Recall/Recognition of Knowledge to Construction/Application of Knowledge: Well-designed traditional assessments (i.e., tests and quizzes) can effectively determine whether or not students have acquired a body of knowledge. Thus, as mentioned above, tests can serve as a nice complement to authentic assessments in a teacher's assessment portfolio. Furthermore, we are often asked to recall or recognize facts and ideas and propositions in life, so tests are somewhat authentic in that sense. However, the demonstration of recall and recognition on tests is typically much less revealing about what we really know and can do than when we are asked to construct a product or performance out of facts, ideas and propositions. Authentic assessments often ask students to analyze, synthesize and apply what they have learned in a substantial manner, and students create new meaning in the process as well.

Teacher-structured to Student-structured: When completing a traditional assessment, what a student can and will demonstrate has been carefully structured by the person(s) who developed the test. A student's attention will understandably be focused on and limited to what is on the test. In contrast, authentic assessments allow more student choice and construction in determining what is presented as evidence of proficiency. Even when students cannot choose their own topics or formats, there are usually multiple acceptable routes towards constructing a product or performance. Obviously, assessments more carefully controlled by the teachers offer advantages and disadvantages. Similarly, more student-structured tasks have strengths and weaknesses that must be considered when choosing and designing an assessment.

Indirect Evidence to Direct Evidence: Even if a multiple-choice question asks a student to analyze or apply facts to a new situation rather than just recall the facts, and the student selects the correct answer, what do you now know about that student? Did that student get lucky and pick the right answer? What thinking led the student to pick that answer? We really do not know. At best, we can make some inferences about what that student might know and might be able to do with that knowledge. The evidence is very indirect, particularly for claims of meaningful application in complex, real-world situations. Authentic assessments, on the other hand, offer more direct evidence of application and construction of knowledge. As in the golf example above, putting a golf student on the golf course to play provides much more direct evidence of proficiency than giving the student a written test. Can a student effectively critique the arguments someone else has presented (an important skill often required in the real world)? Asking a student to write a critique should provide more direct evidence of that skill than asking the student a series of multiple-choice, analytical questions about a passage, although both assessments may be useful.

Teaching to the Test

These two different approaches to assessment also offer different advice about teaching to the test.  Under the TA model, teachers have been discouraged from teaching to the test.  That is because a test usually assesses a sample of students' knowledge and understanding and assumes that students' performance on the sample is representative of their knowledge of all the relevant material.  If teachers focus primarily on the sample to be tested during instruction, then good performance on that sample does not necessarily reflect knowledge of all the material.   So, teachers hide the test so that the sample is not known beforehand, and teachers are admonished not to teach to the test.

With AA, teachers are encouraged to teach to the test.  Students need to learn how to perform well on meaningful tasks.  To aid students in that process, it is helpful to show them models of good (and not so good) performance.  Furthermore, the student benefits from seeing the task rubric ahead of time as well.  Is this "cheating"?  Will students then just be able to mimic the work of others without truly understanding what they are doing?  Authentic assessments typically do not lend themselves to mimicry.  There is not one correct answer to copy.  So, by knowing what good performance looks like, and by knowing what specific characteristics make up good performance, students can better develop the skills and understanding necessary to perform well on these tasks. (For further discussion of teaching to the test, see Bushweller.)

 

Alternative Names for Authentic Assessment

You can also learn something about what AA is by looking at the other common names for this form of assessment. For example, AA is sometimes referred to as

Performance Assessment (or Performance-based) -- so-called because students are asked to perform meaningful tasks. This is the other most common term for this type of assessment. Some educators distinguish performance assessment from AA by defining performance assessment as performance-based as Stiggins has above but with no reference to the authentic nature of the task (e.g., Meyer, 1992). For these educators, authentic assessments are performance assessments using real-world or authentic tasks or contexts. Since we should not typically ask students to perform work that is not authentic in nature, I choose to treat these two terms synonymously.

Alternative Assessment -- so-called because AA is an alternative to traditional assessments.

Direct Assessment -- so-called because AA provides more direct evidence of meaningful application of knowledge and skills. If a student does well on a multiple-choice test we might infer indirectly that the student could apply that knowledge in real-world contexts, but we would be more comfortable making that inference from a direct demonstration of that application such as in the golfing example above.

Types of TestsNorm-Referenced

Standardized tests compare students' performance to that of a norming or sample group who are in the same grade or are of the same age. Students' performance is communicated in percentile ranks, grade-equivalent scores, normal-curve equivalents, scaled scores, or stanine scores.

Examples: Iowa Tests; SAT; DRP; ACTCriterion-Referenced

A student's performance is measured against a standard. One form of criterion-referenced assessment is the benchmark, a description of a key task that students are expected to perform.

Examples: DIBELS; Chapter tests; Driver's License Test; FCAT (Florida Comprehensive Assessment Test)Survey

Survey tests typically provide an overview of general comprehension and word knowledge.

Examples: Interest surveys; KWL; Learning Styles InventoryDiagnostic Tools

Diagnostic tests assess a number of areas in greater depth.

Examples: Woodcock-Johnson®; BRI; "The Fox in the Box"Formal Tests

Formal tests may be standardized. They are designed to be given according to a standard set of circumstances, they have time limits, and they have sets of directions which are to be followed exactly.

Examples: SAT; FCAT; ACTInformal Tests

Informal tests generally do not have a set of standard directions. They have a great deal of flexibility in how they are administered. They are constructed by teachers and have unknown validity and reliability.

Examples: Review games; QuizzesStatic (Summative) Tests

Measures what the student has learned.

Examples: End-of-chapter tests; Final examinations; Standardized state testsDynamic (Formative) Tests

Measures the students' grasp of material that is currently being taught. Can also measure readiness. Formative tests help guide and inform instruction and learning.

Examples: Quizzes; Homework; Portfolios

Lawrence Kohlberg's stages of moral development constitute an adaptation of a psychological theory originally conceived by the Swiss psychologist Jean Piaget. Kohlberg began work on this topic while a psychology graduate student at the University of Chicago [1] in 1958, and expanded and developed this theory throughout his life.

The theory holds that moral reasoning, the basis for ethical behavior, has six identifiable developmental stages, each more adequate at responding to moral dilemmas than its predecessor.[2] Kohlberg followed the development of moral judgment far beyond the ages studied earlier by Piaget,[3] who also claimed that logic and morality develop through constructive stages.[2] Expanding on Piaget's work, Kohlberg determined that the process of moral development was principally concerned with justice, and that it continued throughout the individual's lifetime,[4] a notion that spawned dialogue on the philosophical implications of such research.[5][6]

The six stages of moral development are grouped into three levels: pre-conventional morality, conventional morality, and post-conventional morality.

Stages

Kohlberg's six stages can be more generally grouped into three levels of two stages each: pre-conventional, conventional and post-conventional.[7][8][9] Following Piaget's constructivist requirements for a stage model, as described in his theory of cognitive development, it is extremely rare to regress in stages—to lose the use of higher stage abilities.[14][15] Stages cannot be skipped; each provides a new and necessary perspective, more comprehensive and differentiated than its predecessors but integrated with them.[14][15]

Level 1 (Pre-Conventional) 1. Obedience and punishment orientation (How can I avoid punishment?)2. Self-interest orientation (What's in it for me?)(Paying for a benefit)Level 2 (Conventional)3. Interpersonal accord and conformity (Social norms)(The good boy/girl attitude)4. Authority and social-order maintaining orientation (Law and order morality)Level 3 (Post-Conventional)5. Social contract orientation6. Universal ethical principles (Principled conscience)

The understanding gained in each stage is retained in later stages, but may be regarded by those in later stages as simplistic, lacking in sufficient attention to detail.

Pre-conventional

The pre-conventional level of moral reasoning is especially common in children, although adults can also exhibit this level of reasoning. Reasoners at this level judge the morality of an action by its direct consequences. The pre-conventional level consists of the first and second stages of moral development, and is solely concerned with the self in an egocentric manner. A child with pre-conventional morality has not yet adopted or internalized society's conventions regarding what is right or wrong, but instead focuses largely on external consequences that certain actions may bring.[7][8][9]

In Stage one (obedience and punishment driven), individuals focus on the direct consequences of their actions on themselves. For example, an action is perceived as morally wrong because the perpetrator is punished. "The last time I did that I got spanked so I will not do it again." The worse the punishment for the act is, the more "bad" the act is perceived to be.[16] This can give rise to an inference that even innocent

victims are guilty in proportion to their suffering. It is "egocentric," lacking recognition that others' points of view are different from one's own.[17] There is "deference to superior power or prestige."[17]

An example of obedience and punishment driven morality would be a child refusing to do something because it is wrong and that the consequences could result in punishment. For example, a child's classmate tries to dare the child in playing hooky from school. The child would apply obedience and punishment driven morality by refusing to play hooky because he would get punished. Another example of obedience and punishment driven morality is when a child refuses to cheat on a test because the child would get punished

Stage two (self-interest driven) expresses the "what's in it for me" position, in which right behavior is defined by whatever the individual believes to be in their best interest but understood in a narrow way which does not consider one's reputation or relationships to groups of people. Stage two reasoning shows a limited interest in the needs of others, but only to a point where it might further the individual's own interests. As a result, concern for others is not based on loyalty or intrinsic respect, but rather a "You scratch my back, and I'll scratch yours." mentality.[2] The lack of a societal perspective in the pre-conventional level is quite different from the social contract (stage five), as all actions have the purpose of serving the individual's own needs or interests. For the stage two theorist, the world's perspective is often seen as moral relativism.

An example of self-interest driven is when a child is asked by his parents to do a chore. The child asks "what's in it for me?" The parents would offer the child an incentive by giving a child an allowance to pay them for their chores. The child is motivated to do chores for self-interest. Another example of self-interest driven is when a child does their homework in exchange for better grades and rewards from their parents

Conventional

The conventional level of moral reasoning is typical of adolescents and adults. To reason in a conventional way is to judge the morality of actions by comparing them to society's views and expectations. The conventional level consists of the third and fourth stages of moral development. Conventional morality is characterized by an acceptance of society's conventions concerning right and wrong. At this level an individual obeys rules and follows society's norms even when there are no consequences for obedience or disobedience. Adherence to rules and conventions is somewhat rigid, however, and a rule's appropriateness or fairness is seldom questioned.[7][8][9]

In Stage three (good intentions as determined by social consensus), the self enters society by conforming to social standards. Individuals are receptive to approval or disapproval from others as it reflects society's views. They try to be a "good boy" or "good girl" to live up to these expectations,[2] having learned that being regarded as good benefits the self. Stage three reasoning may judge the morality of an action by evaluating its consequences in terms of a person's relationships, which now begin to include things like respect, gratitude and the "golden rule". "I want to be liked and thought well of; apparently, not being naughty makes people like me." Conforming to the rules for one's social role is not yet fully understood. The intentions of actors play a more significant role in reasoning at this stage; one may feel more forgiving if one thinks, "they mean well ..."[2]

In Stage four (authority and social order obedience driven), it is important to obey laws, dictums and social conventions because of their importance in maintaining a functioning society. Moral reasoning in stage four is thus beyond the need for individual approval exhibited in stage three. A central ideal or ideals often prescribe what is right and wrong. If one person violates a law, perhaps everyone would — thus there is an obligation and a duty to uphold laws and rules. When someone does violate a law, it is morally wrong; culpability is thus a significant factor in this stage as it separates the bad domains from the good ones. Most active members of society remain at stage four, where morality is still predominantly dictated by an outside force.[2]

Post-Conventional

The post-conventional level, also known as the principled level, is marked by a growing realization that individuals are separate entities from society, and that the individual’s own perspective may take

precedence over society’s view; individuals may disobey rules inconsistent with their own principles. Post-conventional moralists live by their own ethical principles — principles that typically include such basic human rights as life, liberty, and justice. People who exhibit post-conventional morality view rules as useful but changeable mechanisms — ideally rules can maintain the general social order and protect human rights. Rules are not absolute dictates that must be obeyed without question. Because post-conventional individuals elevate their own moral evaluation of a situation over social conventions, their behavior, especially at stage six, can be confused with that of those at the pre-conventional level.

Some theorists have speculated that many people may never reach this level of abstract moral reasoning.[7][8]

[9]

In Stage five (social contract driven), the world is viewed as holding different opinions, rights and values. Such perspectives should be mutually respected as unique to each person or community. Laws are regarded as social contracts rather than rigid edicts. Those that do not promote the general welfare should be changed when necessary to meet “the greatest good for the greatest number of people."[8] This is achieved through majority decision and inevitable compromise. Democratic government is ostensibly based on stage five reasoning.

In Stage six (universal ethical principles driven), moral reasoning is based on abstract reasoning using universal ethical principles. Laws are valid only insofar as they are grounded in justice, and a commitment to justice carries with it an obligation to disobey unjust laws. Legal rights are unnecessary, as social contracts are not essential for deontic moral action. Decisions are not reached hypothetically in a conditional way but rather categorically in an absolute way, as in the philosophy of Immanuel Kant.[18] This involves an individual imagining what they would do in another’s shoes, if they believed what that other person imagines to be true.[19] The resulting consensus is the action taken. In this way action is never a means but always an end in itself; the individual acts because it is right, and not because it avoids punishment, is in their best interest, expected, legal, or previously agreed upon. Although Kohlberg insisted that stage six exists, he found it difficult to identify individuals who consistently operated at that level.[15]

Montessori education is an educational approach developed by Italian physician and educator Maria Montessori and characterized by an emphasis on independence, freedom within limits, and respect for a child’s natural psychological, physical, and social development. Although a range of practices exists under the name "Montessori", the Association Montessori Internationale (AMI) and the American Montessori Society (AMS) cite these elements as essential:[2][3]

Mixed age classrooms, with classrooms for children ages 2½ or 3 to 6 years old by far the most common

Student choice of activity from within a prescribed range of options Uninterrupted blocks of work time, ideally three hours A constructivist or "discovery" model, where students learn concepts from working with materials,

rather than by direct instruction Specialized educational materials developed by Montessori and her collaborators Freedom of movement within the classroom A trained Montessori teacher

Montessori education is fundamentally a model of human development, and an educational approach based on that model. The model has two basic principles. First, children and developing adults engage in psychological self-construction by means of interaction with their environments. Second, children, especially under the age of six, have an innate path of psychological development. Based on her observations, Montessori believed that children at liberty to choose and act freely within an environment prepared according to her model would act spontaneously for optimal development.

Understanding by Design, or UbD, is a tool utilized for educational planning focused on "teaching for understanding" advocated by Jay McTighe and Grant Wiggins in their Understanding by Design (1998), published by the Association for Supervision and Curriculum Development.[1][2] The emphasis of UbD is on

"backward design", the practice of looking at the outcomes in order to design curriculum units, performance assessments, and classroom instruction.[3]

"Understanding by Design" and "UbD" are registered trademarks of the Association for Supervision and Curriculum Development ("ASCD"). According to Wiggins, "The potential of UbD for curricular improvement has struck a chord in American education. Over 250,000 educators own the book. Over 30,000 Handbooks are in use. More than 150 University education classes use the book as a text."[1] As defined by Wiggins and McTighe, Understanding by Design is a "framework for designing curriculum units, performance assessments, and instruction that lead your students to deep understanding of the content you teach,"[4] UbD expands on "six facets of understanding", which include students being able to explain, interpret, apply, have perspective, empathize, and have self-knowledge about a given topic.[5]

Understanding by Design relies on what Wiggins and McTighe call "backward design" (also known as "backwards planning"). Teachers, according to UbD proponents, traditionally start curriculum planning with activities and textbooks instead of identifying classroom learning goals and planning towards that goal. In backward design, the teacher starts with classroom outcomes and then plans the curriculum, choosing activities and materials that help determine student ability and foster student learning.[6]

The Backward design approach is developed in three stages. Stage 1 starts with educators identifying the desired results of their students by establishing the overall goal of the lessons by using content standards, common core or state standards. In addition, UbD's stage 1 defines "Students will understand that..." and lists essential questions that will guide the learner to understanding. Stage 1 also focuses on identifying "what students will know" and most importantly "what students will be able to do".

Difficulty Index - Teachers produce a difficulty index for a test item by calculating the proportion of students in class who got an item correct. (The name of this index is counter-intuitive, as one actually gets a measure of how easy the item is, not the difficulty of the item.) The larger the proportion, the more students who have learned the content measured by the item.

C. Item Analysis

After you create your objective assessment items and give your test, how can you be sure that the items are appropriate -- not too difficult and not too easy? How will you know if the test effectively differentiates between students who do well on the overall test and those who do not? An item analysis is a valuable, yet relatively easy, procedure that teachers can use to answer both of these questions.

To determine the difficulty level of test items, a measure called the Difficulty Index is used. This measure asks teachers to calculate the proportion of students who answered the test item accurately. By looking at each alternative (for multiple choice), we can also find out if there are answer choices that should be replaced. For example, let's say you gave a multiple choice quiz and there were four answer choices (A, B, C, and D). The following table illustrates how many students selected each answer choice for Question #1 and #2.

Question A B C D

#1 0 3 24* 3

#2 12* 13 3 2

* Denotes correct answer.

For Question #1, we can see that A was not a very good distractor -- no one selected that answer. We can also compute the difficulty of the item by dividing the number of students who choose the correct answer (24) by the number of total students (30). Using this formula, the difficulty of Question #1 (referred to as p) is equal to 24/30 or .80. A rough "rule-of-thumb" is that if the item difficulty is more than .75, it is an easy item; if the difficulty is below .25, it is a difficult item. Given these parameters, this item could be regarded moderately easy -- lots (80%) of students got it correct. In contrast, Question #2 is much more difficult (12/30 = .40). In fact, on Question #2, more students selected an incorrect answer (B) than selected the correct answer (A). This item should be carefully analyzed to ensure that B is an appropriate distractor.

Another measure, the Discrimination Index, refers to how well an assessment differentiates between high and low scorers. In other words, you should be able to expect that the high-performing students would select the correct answer for each question more often than the low-performing students.  If this is true, then the assessment is said to have a positive discrimination index (between 0 and 1) -- indicating that students who received a high total score chose the correct answer for a specific item more often than the students who had a lower overall score. If, however, you find that more of the low-performing students got a specific item correct, then the item has a negative discrimination index (between -1 and 0). Let's look at an example.

Table 2 displays the results of ten questions on a quiz. Note that the students are arranged with the top overall scorers at the top of the table.

StudentTotal

Score (%)

Questions

1 2 3

Asif 90 1 0 1

Sam 90 1 0 1

Jill 80 0 0 1

Charlie 80 1 0 1

Sonya 70 1 0 1

Ruben 60 1 0 0

Clay 60 1 0 1

Kelley 50 1 1 0

Justin 50 1 1 0

Tonya 40 0 1 0

"1" indicates the answer was correct; "0" indicates it was incorrect.

Follow these steps to determine the Difficulty Index and the Discrimination Index.

1. After the students are arranged with the highest overall scores at the top, count the number of students in the upper and lower group who got each item correct. For Question #1, there were 4 students in the top half who got it correct, and 4 students in the bottom half.

2. Determine the Difficulty Index by dividing the number who got it correct by the total number of students. For Question #1, this would be 8/10 or p=.80.

3. Determine the Discrimination Index by subtracting the number of students in the lower group who got the item correct from the number of students in the upper group who got the item correct.  Then, divide by the number of students in each group (in this case, there are five in each group). For Question #1, that means you would subtract 4 from 4, and divide by 5, which results in a Discrimination Index of  0.

4. The answers for Questions 1-3 are provided in Table 2.

Item

# Correct (Upper group)

# Correct (Lower group)

Difficulty (p)

Discrimination (D)

Question 1

4 4 .80 0

Question 2

0 3 .30 -0.6

Question 3

5 1 .60 0.8

Now that we have the table filled in, what does it mean? We can see that Question #2 had a difficulty index of .30 (meaning it was quite difficult), and it also had a negative discrimination index of -0.6 (meaning that the low-performing students were more likely to get this item correct).  This question should be carefully analyzed, and probably deleted or changed. Our "best" overall question is Question 3, which had a moderate difficulty level (.60), and discriminated extremely well (0.8).

Another consideration for an item analysis is the cognitive level that is being assessed.  For example, you might categorize the questions based on Bloom's taxonomy (perhaps grouping questions that address Level I and those that address Level II). In this manner, you would be able to determine if the difficulty index and discrimination index of those groups of questions are appropriate. For example, you might note that the majority of the questions that demand higher levels of thinking skills are too difficult or do not discriminate

well.  You could then concentrate on improving those questions and focus your

A. Bloom's Taxonomy  Questions (items) on quizzes and exams can demand different levels of thinking skills.  For example, some questions might be simple memorization of facts, and others might require the ability to synthesize information from several sources to select or construct a response. Benjamin Bloom created a hierarchy of cognitive skills (called Bloom's  taxonomy) that is often used to categorize the levels of cognitive involvement (thinking skills) in educational settings. The taxonomy provides a good structure to assist teachers in writing objectives and assessments. It can be divided into two levels -- Level I (the lower level) contains knowledge, comprehension and application; Level II (the higher level) includes application, analysis, synthesis, and evaluation (see the diagram below).

Figure 1. Bloom's Taxonomy.

Bloom's taxonomy is also used to guide the development of standardized assessments. For example, in Florida, about 65% of the questions on the statewide reading test (FCAT) are designed to measure Level II

thinking skills (application, analysis, synthesis, and evaluation). To prepare students for these standardized tests, classroom assessments must also demand both Level I and II thinking skills. Integrating higher level skills into instruction and assessment increases the likelihood that students will succeed on tests and become better problem solvers.

Sometimes objective tests (such as multiple choice) are criticized because the questions emphasize only lower-level thinking skills (such as knowledge and comprehension). However, it is possible to address higher level thinking skills via objective assessments by including items that focus on genuine understanding -- "how" and "why" questions. Multiple choice items that involve scenarios, case studies, and analogies are also effective for requiring students to apply, analyze, synthesize, and evaluate information

B. Writing Selected Response Assessment Items

Selected response (objective) assessment items are very efficient – once the items are created, you can assess and score a great deal of content rather quickly. Note that the term objective refers to the fact that each question has a right and wrong answer and that they can be impartially scored. In fact, the scoring can be automated if you have access to an optical scanner for scoring paper tests or a computer for computerized tests. However, the construction of these “objective” items might well include subjective input by the teacher/creator.

Before you write the assessment items, you should create a blueprint that outlines the content areas and the cognitive skills you are targeting.  One way to do this is to list your instructional objectives, along with the corresponding cognitive level. For example, the following table has four different objectives and the corresponding levels of assessment (relative to Bloom's taxonomy). For each objective, five assessment items will be written, some at Level I and some at Level II. This approach helps to ensure that all objectives are covered and that several higher level thinking skills are included in the assessment.

ObjectiveNumber of Items at Level I 

(Bloom's Taxonomy)Number of Items at Level II

 (Blooms' Taxonomy)

1 2 3

2 3 2

3 1 4

4 4 1

After you have determined how many items you need for each level, you can begin writing the assessments. There are several forms of selected response assessments, including multiple choice, matching, and true/false. Regardless of the form you select, be sure the items are clearly worded at the appropriate reading level and do not include unintentional clues.  The validity of your test will suffer tremendously if the students can’t comprehend or read the questions! This section includes a few guidelines for constructing objective assessment items, along with examples and non-examples.

Multiple Choice

Multiple choice questions consist of a stem (question or statement) with several answer choices (distractors). For each of the following guidelines, click the buttons to view an Example or Non-Example.

All answer choices should be plausible and homogeneous.

o Example o Non-Example

Answer choices should be similar in length and grammatical form. o Example o Non-Example

List answer choices in logical (alphabetical or numerical) order. o Example o Non-Example

Avoid using "All of the Above" options. o Example o Non-Example

Matching

Matching items consist of two lists of words, phrases, or images (often referred to as stems and responses). Students review the list of stems and match each with a word, phrase, or image from the list of responses. For each of the following guidelines, click the buttons to view an Example or Non-Example.

Answer choices should be short, homogeneous and arranged in logical order. o Example o Non-Example

Responses should be plausible and similar in length and grammatical form. o Example o Non-Example

Include more response options than stems. o Example o Non-Example

As a general rule, the stems should be longer and the responses should be shorter.

o Example o Non-Example

True/False

True/false questions can appear to be easier to write; however, it is difficult to write effective true/false questions. Also, the reliability of T/F questions is not generally very high because of the high possibility of guessing. In most cases, T/F questions are not recommended.

Statements should be completely true or completely false. o Example o Non-Example

Use simple, easy-to-follow statements. o Example o Non-Example

Avoid using negatives -- especially double negatives. o Example o Non-Example

Avoid absolutes such as "always; never." o Example o Non-Example

Test Topics

Step 9. Conduct the Item Analysis

Download this information in PDF format

Introduction

The item analysis is an important phase in the development of an exam program. In this phase statistical methods are used to identify any test items that are not working well. If an item is too easy, too difficult, failing to show a difference between skilled and unskilled examinees, or even scored incorrectly, an item analysis will reveal it. The two most common statistics reported in an item analysis are the item difficulty, which is a measure of the proportion of examinees who responded to an item correctly, and the item discrimination, which is a measure of how well the item discriminates between examinees who are knowledgeable in the content area and those who are not. An additional analysis that is often reported is the distractor analysis. The distractor analysis provides a measure of how well each of the incorrect options contributes to the quality of a multiple choice item. Once the item analysis information is available, an item review is often conducted.

Item Analysis Statistics

Item Difficulty IndexThe item difficulty index is one of the most useful, and most frequently reported, item analysis statistics. It is a measure of the proportion of examinees who answered the item correctly; for this reason it is frequently called the p-value. As the proportion of examinees who got the item right, the p-value might more properly be called the item easiness index, rather than the item difficulty. It can range between 0.0 and 1.0, with a higher value indicating that a greater proportion of examinees responded to the item correctly, and it was thus an easier item. For criterion-referenced tests (CRTs), with their emphasis on mastery-testing, many items on an exam form will have p-values of .9 or above. Norm-referenced tests (NRTs), on the other hand, are designed to be harder overall and to spread out the examinees' scores. Thus, many of the items on an NRT will have difficulty indexes between .4 and .6.

Item Discrimination IndexThe item discrimination index is a measure of how well an item is able to distinguish between examinees who are knowledgeable and those who are not, or between masters and non-masters. There are actually several ways to compute an item discrimination, but one of the most common is the point-biserial correlation. This statistic looks at the relationship between an examinee's performance on the given item (correct or incorrect) and the examinee's score on the overall test. For an item that is highly discriminating, in general the examinees who responded to the item correctly also did well on the test, while in general the examinees who responded to the item incorrectly also tended to do poorly on the overall test.

The possible range of the discrimination index is -1.0 to 1.0; however, if an item has a discrimination below 0.0, it suggests a problem. When an item is discriminating negatively, overall the most knowledgeable examinees are getting the item wrong and the least knowledgeable examinees are getting the item right. A negative discrimination index may indicate that the item is measuring something other than what the rest of the test is measuring. More often, it is a sign that the item has been mis-keyed.

When interpreting the value of a discrimination it is important to be aware that there is a relationship between an item's difficulty index and its discrimination index. If an item has a very high (or very low) p-value, the potential value of the discrimination index will be much less than if the item has a mid-range p-value. In other words, if an item is either very easy or very hard, it is not likely to be very discriminating. A typical CRT, with many high item p-values, may have most item discriminations in the range of 0.0 to 0.3. A useful approach when reviewing a set of item discrimination indexes is to also view each item's p-value at the same time. For example, if a given item has a discrimination index below .1, but the item's p-value is greater than .9, you may interpret the item as being easy for almost the entire set of examinees, and probably for that reason not providing much discrimination between high ability and low ability examinees.

Distractor AnalysisOne important element in the quality of a multiple choice item is the quality of the item's distractors. However, neither the item difficulty nor the item discrimination index considers the performance of the incorrect response options, or distractors. A distractor analysis addresses the performance of these incorrect response options.

Just as the key, or correct response option, must be definitively correct, the distractors must be clearly incorrect (or clearly not the "best" option). In addition to being clearly incorrect, the distractors must also be plausible. That is, the distractors should seem likely or reasonable to an examinee who is not sufficiently knowledgeable in the content area. If a distractor appears so unlikely that almost no examinee will select it, it is not contributing to the performance of the item. In fact, the presence of one or more implausible distractors in a multiple choice item can make the item artificially far easier than it ought to be.

In a simple approach to distractor analysis, the proportion of examinees who selected each of the response options is examined. For the key, this proportion is equivalent to the item p-value, or difficulty. If the proportions are summed across all of an item's response options they will add up to 1.0, or 100% of the examinees' selections.

The proportion of examinees who select each of the distractors can be very informative. For example, it can reveal an item mis-key. Whenever the proportion of examinees who selected a distractor is greater than the proportion of examinees who selected the key, the item should be examined to determine if it has been mis-keyed or double-keyed. A distractor analysis can also reveal an implausible distractor. In CRTs, where the item p-values are typically high, the proportions of examinees selecting all the distractors are, as a result, low. Nevertheless, if examinees consistently fail to select a given distractor, this may be evidence that the distractor is implausible or simply too easy.

Item Review Once the item analysis data are available, it is useful to hold a meeting of test developers, psychometricians, and subject matter experts. During this meeting the items can be reviewed using the information provided by the item analysis statistics. Decisions can then be made about item changes that are needed or even items that ought to be dropped from the exam. Any item that has been substantially changed should be returned to the bank for pretesting before it is again used operationally. Once these decisions have been made, the exams should be rescored, leaving out any items that were dropped and using the correct key for any items that were found to have been mis-keyed. This corrected scoring will be used for the examinees' score reports.

SummaryIn the item analysis phase of test development, statistical methods are used to identify potential item problems. The statistical results should be used along with substantive attention to the item content to determine if a problem exists and what should be done to correct it. Items that are functioning very poorly should usually be removed from consideration and the exams rescored before the test results are released. In other cases, items may still be usable, after modest changes are made to improve their performance on future exams.

In statistics, a bimodal distribution is a continuous probability distribution with two different modes. These appear as distinct peaks (local maxima) in the probability density function, as shown in Figure 1.

How to Compute Mean, Median, Mode, Range, and Standard DeviationIn statistics and data analysis, the mean, median, mode, range, and standard deviation tell researchers how the data is distributed. Each of the five measures can be calculated with simple arithmetic. The mean and median idicate the "center" of the data points. The mode is the value or values that occur most frequently.

Range is the span between the smallest value and largers value. Standard deviation measures how far the data "deviates" from the center, on average. Knowing how to calculate these statistical measures will help you analyze data from surveys and experiments.

Mean

The aritmetic mean or average of a set of numbers is the expected value. The mean is calculated by adding up all the values, and then dividing that sum by the number of values.

For example, suppose a teacher has seven students and records the following seven test scores for her class: 98, 96, 96, 84, 80, 80, and 72. The average test score is

(98+96+96+84+81+81+73)/7 = 609/7 = 87.

If one more student entered her class and took the test, the expected score would be an 87.

Median

The median is the middle value in a set of values. To find the median, order the numbers from largest to smallest, and then choose the value in the middle. For example, consider the following set of nine numbers:

10, 13, 4, 25, 8, 12, 9, 19, 18

If we arrange them in descending order, we get

25, 19, 18, 13, 12, 10, 9, 8, 4

The middle value is 12, so the median = 12. What if we have a set with an even number of values? For example, consider the set

1, 2, 3, 4, 5, 6.

Both 3 and 4 are in the middle. In this case, we must take the average of the two middle numbers. Since (3+4)/2 = 3.5, the median = 3.5.

Mode

The mode of a set is the value or values that occur most frequently. There can be more than one mode in a set. If there is more than one mode, you simply list all of the modes; you do not have to average them. For eaxample, consider the set

10, 10, 4, 8, 10, 8, 3, 9, 14

The number 10 occurs three times, and no other numbers occur as frequently. Therefore, the mode = 10

Now consider this set

10, 10, 4, 8, 10, 8, 3, 8, 14

Both 10 and 8 occur three times each, and no other numbers occur as often. Threrfore, the modes are 8 and 10.

Range

The range of a set of numbers is the maximum distance between any two values. In other words, it's the difference between the largest and smalles values. Knowing the range gives you an idea of how close together the data points are. For example, consider the set of test scores

78, 88, 67, 90, 92, 83, 97

The highest test score is 97 and the lowest is 67, therefore the range is 97-67 = 30.

Standard Deviation

The standard deviation is another way to measure how close together the elements are in a set of data. The s.d. is the average distance between each data point and the mean. Knowing the standard deviation gives a more complete picture of the distribution of elements in a data set. Suppose you have N data points and you label

them X1, X2, X3,... XN, and you call the mean . There are two formulas for standard deviation depending on whether your data is a complete set, or a sample take from a larger set.

For example, suppose your data is all of the ACT scores of the students in a small class. Then the standard deviation formula is

Suppose the scores are 15, 21, 21, 21, 25, 30, and 35. The mean of this set is 24. The s.d. is

sqrt[((15-24)2+(21-24)2+(21-24)2+(21-24)2+(25-24)2+(30-24)2+(35-24)2)/7]= sqrt[266/7]= sqrt[38]= 6.16

If you take a random sample of ACT scores from a large school, the standard deviation formula is

For example, suppose you select ten students at random from a high school, and their ACT scores are 17, 20, 24, 25, 26, 26, 29, 29, 30 and 32. The average of this set is 25.8. The standard deviation is

sqrt[((17-25.8)2+(20-25.8)2+(24-25.8)2+(25-25.82+(26-25.8)2

+(26-25.8)2+(29-25.9)2+(29-25.8)2+(30-25.8)2+(32-25.8)2)/(10-1)]= sqrt[(191.6 )/(10-1)]= sqrt[191.6/9]= sqrt[21.2889]= 4.61

Mean, Mode, Median, and Standard Deviation

 

The Mean and Mode

The sample mean is the average and is computed as the sum of all the observed outcomes  from the sample divided by the total number of events.  We use x as the symbol for the sample mean.  In math terms, 

        where n is the sample size and the x correspond to the observed valued.

 

Example

Suppose you randomly sampled six acres in the Desolation Wilderness for a non-indigenous weed and came up with the following counts of this weed in this region:

        34, 43, 81, 106, 106 and 115 

We compute the sample mean by adding and dividing by the number of samples, 6.

             34 + 43 + 81 + 106 + 106 + 115                                                                     =  80.83                                        6

We can say that the sample mean of non-indigenous weed is 80.83.

The mode of a set of data is the number with the highest frequency.  In the above example 106 is the mode, since it occurs twice and the rest of the outcomes occur only once.

The population mean is the average of the entire population and is usually impossible to compute. We use the Greek letter for the population mean. 

 

 

Median, and Trimmed Mean

One problem with using the mean, is that it often does not depict the typical outcome.  If there is one outcome that is very far from the rest of the data, then the mean will be strongly affected by this outcome.  Such an outcome is called and outlier.  An alternative measure is the median.  The median is the middle score.  If we have an even number of events we take the average of the two middles.  The median is better for describing the typical value.  It is often used for income and home prices.

Example

Suppose you randomly selected 10 house prices in the South Lake Tahoe area.  Your are interested in the typical house price.  In $100,000 the prices were

        2.7,   2.9,   3.1,   3.4,   3.7,  4.1,   4.3,   4.7,  4.7,  40.8

If we computed the mean, we would say that the average house price is 744,000.  Although this number is true, it does not reflect the price for available housing in South Lake Tahoe.  A closer look at the data shows that the house valued at 40.8 x $100,000  =  $4.08 million skews the data.  Instead, we use the median.  Since there is an even number of outcomes, we take the average of the middle two

      3.7 + 4.1                        =  3.9            2

The median house price is $390,000.  This better reflects what house shoppers should expect to spend.

        

There is an alternative value that also is resistant to outliers.  This is called the trimmed mean which is the mean after getting rid of the outliers or 5% on the top and 5% on the bottom.  We can also use the trimmed mean if we are concerned with outliers skewing the data, however the median is used more often since more people understand it.

Example:

At a ski rental shop data was collected on the number of rentals on each of ten consecutive Saturdays: 

        44, 50, 38, 96, 42, 47, 40, 39, 46, 50.

 

To find the sample mean, add them and divide by 10:

         44 + 50 + 38 + 96 + 42 + 47 + 40 + 39 + 46 + 50                                                                                        = 49.2                                        10

Notice that the mean value is not a value of the sample.

To find the median, first sort the data:

        38, 39, 40, 42, 44, 46, 47, 50, 50, 96

Notice that there are two middle numbers 44 and 46.  To find the median we take the average of the two.

                             44 + 46        Median  =                      =  45                                  2

Notice also that the mean is larger than all but three of the data points.  The mean is influenced by outliers while the median is robust.

Variance,  Standard Deviation and Coefficient of Variation

The mean, mode, median, and trimmed mean do a nice job in telling where the center of the data set is, but often we are interested in more.  For example, a pharmaceutical engineer develops a new drug that regulates iron in the blood.  Suppose she finds out that the average sugar content after taking the medication is the optimal level.  This does not mean that the drug is effective.  There is a possibility that half of the patients have dangerously low sugar content while the other half have dangerously high content.  Instead of the drug being an effective regulator, it is a deadly poison.  What the pharmacist needs is a measure of how far the data is spread apart.  This is what the variance and standard deviation do.  First we show the formulas for these measurements.  Then we will go through the steps on how to use the formulas.

We define the variance to be 

       

and the standard deviation to be

       

Variance and Standard Deviation: Step by Step

1. Calculate the mean, x. 2. Write a table that subtracts the mean from each observed value.3. Square each of the differences.4. Add this column.5. Divide by n -1 where n is the number of items in the sample  This is the

variance.6. To get the standard deviation we take the square root of the variance.  

 

Example

The owner of the Ches Tahoe restaurant is interested in how much people spend at the restaurant.  He examines 10 randomly selected receipts for parties of four and writes down the following data.

        44,   50,   38,   96,   42,   47,   40,   39,   46,   50

He calculated the mean by adding and dividing by 10 to get

        x  =  49.2

Below is the table for getting the standard deviation:

 

x x - 49.2 (x - 49.2 )2

44 -5.2 27.04

50 0.8 0.64

38 11.2 125.44

96 46.8 2190.24

42 -7.2 51.84

47 -2.2 4.84

40 -9.2 84.64

39 -10.2 104.04

46 -3.2 10.24

50 0.8 0.64

Total 2600.4

 

Now 

        2600.4                         =  288.7        10 - 1

Hence the variance is 289 and the standard deviation is the square root of  289 = 17.

Since the standard deviation can be thought of measuring how far the data values lie from the mean, we take the mean and move one standard deviation in either direction.  The mean for this example was about 49.2 and the standard deviation was 17.  We have: 

49.2 - 17 = 32.2

 and 49.2 + 17 = 66.2 

What this means is that most of the patrons probably spend between $32.20 and $66.20.

 

The sample standard deviation will be denoted by s and the population standard deviation will be denoted by the Greek letter .

The sample variance will be denoted by s2 and the population variance will be denoted by 2.

The variance and standard deviation describe how spread out the data is.  If the data all lies close to the mean, then the standard deviation will be small, while if the data is spread out over a large range of values, s will be large.  Having outliers will increase the standard deviation.

One of the flaws involved with the standard deviation, is that it depends on the units that are used.  One way of handling this difficulty, is called the coefficient of variation which is the standard deviation divided by the mean times 100%

                                  CV  =           100%                               

In the above example, it is 

         17                   100%   =  34.6%        49.2

This tells us that the standard deviation of the restaurant bills is 34.6% of the mean.

 

Chebyshev's Theorem

A mathematician named Chebyshev came up with bounds on how much of the data must lie close to the mean.  In particular for any positive k, the proportion of the data that lies within k standard deviations of the mean is at least

                  1        1  -                         k2

For example, if k  =  2 this number is

                 1        1  -         =  .75                22 

This tell us that at least 75% of the data lies within 75% of the mean.  In the above example, we can say that at least 75% of the diners spent between 

        49.2 - 2(17)  =  15.2

and

        49.2 + 2(17)  =  83.2 

dollars.

Skewed DataData can be "skewed", meaning it tends to have a long tail on one side or the other:

Negative Skew No Skew Positive Skew

 

Negative Skew?

Why is it called negative skew? Because the long "tail" is on the negative side of the peak.

People sometimes say it is "skewed to the left" (the long tail is on the left hand side)

The mean is also on the left of the peak.

The Normal Distribution has No Skew

A Normal Distribution is not skewed.

It is perfectly symmetrical.

And the Mean is exactly at the peak.

Positive Skew

And positive skew is when the long tail is on the positive side of the peak, and some people say it is "skewed to the right".

The mean is on the right of the peak value.

 

Example: Income Distribution

Here is some data I extracted from a recent Census.

As you can see it is positively skewed ... in fact the tail continues way past $100,000

Calculating Skewness

"Skewness" (the amount of skew) can be calculated, for example you could use the SKEW() function in Excel or OpenOffice Calc.

Normal Distribution

The normal distribution is a probability distribution that associates the normal random variable X with a cumulative probability . The normal distribution is defined by the following equation:

Normal equation. The value of the random variable Y is:

Y = [ 1/σ * sqrt(2π) ] * e -(x - μ)2/2σ2

where X is a normal random variable, is the mean, is the standard deviation, isμ σ π approximately 3.14159, and e is approximately 2.71828.

The graph of the normal distribution depends on two factors - the mean and the standard deviation. The mean of the distribution determines the location of the center of the graph, and the standard deviation determines the height of the graph. When the standard deviation is large, the curve is short and wide; when the standard deviation is small, the curve is tall and narrow. All normal distributions look like a symmetric, bell-shaped curve, as shown below.

The curve on the left is shorter and wider than the curve on the right, because the curve on the left has a bigger standard deviation.

The Rorschach test (/ ̍ r ɔ r ʃ ɑː k / or / ̍ r ɔ ər ʃ ɑː k / ,[3] German pronunciation: [ ˈʀ o ːɐ̯fʃ ax] ; also known as the Rorschach inkblot test, the Rorschach technique, or simply the inkblot test) is a psychological test in which subjects' perceptions of inkblots are recorded and then analyzed using psychological interpretation, complex algorithms, or both. Some psychologists use this test to examine a person's personality characteristics and emotional functioning. It has been employed to detect underlying thought disorder, especially in cases where patients are reluctant to describe their thinking processes openly.[4] The test is named after its creator, Swiss psychologist Hermann Rorschach.

In the 1960s, the Rorschach was the most widely used projective test.[5] In a national survey in the U.S., the Rorschach was ranked eighth among psychological tests used in outpatient mental health facilities.[6] It is the second most widely used test by members of the Society for Personality Assessment, and it is requested by psychiatrists in 25% of forensic assessment cases,[6] usually in a battery of tests that often include the MMPI-2 and the MCMI-III.[7] In surveys, the use of Rorschach ranges from a low of 20% by correctional psychologists [8] to a high of 80% by clinical psychologists engaged in assessment services, and 80% of psychology graduate programs surveyed teach it.[9]

Although the Exner Scoring System (developed since the 1960s) claims to have addressed and often refuted many criticisms of the original testing system with an extensive body of research,[10] some researchers continue to raise questions. The areas of dispute include the objectivity of testers, inter-rater reliability, the verifiability and general validity of the test, bias of the test's pathology scales towards greater numbers of responses, the limited number of psychological conditions which it accurately diagnoses, the inability to replicate the test's norms, its use in court-ordered evaluations, and the proliferation of the ten inkblot images, potentially invalidating the test for those who have been exposed to them.[11]

The Rorschach test (/ ̍ r ɔ r ʃ ɑː k / or / ̍ r ɔ ər ʃ ɑː k / ,[3] German pronunciation: [ ˈʀ o ːɐ̯fʃ ax] ; also known as the Rorschach inkblot test, the Rorschach technique, or simply the inkblot test) is a psychological test in which subjects' perceptions of inkblots are recorded and then analyzed using psychological interpretation, complex algorithms, or both. Some psychologists use this test to examine a person's personality characteristics and emotional functioning. It has been employed to detect underlying thought disorder, especially in cases where patients are reluctant to describe their thinking processes openly.[4] The test is named after its creator, Swiss psychologist Hermann Rorschach.

In the 1960s, the Rorschach was the most widely used projective test.[5] In a national survey in the U.S., the Rorschach was ranked eighth among psychological tests used in outpatient mental health facilities.[6] It is the second most widely used test by members of the Society for Personality Assessment, and it is requested by psychiatrists in 25% of forensic assessment cases,[6] usually in a battery of tests that often include the MMPI-2 and the MCMI-III.[7] In surveys, the use of Rorschach ranges from a low of 20% by correctional psychologists [8] to a high of 80% by clinical psychologists engaged in assessment services, and 80% of psychology graduate programs surveyed teach it.[9]

Although the Exner Scoring System (developed since the 1960s) claims to have addressed and often refuted many criticisms of the original testing system with an extensive body of research,[10] some researchers continue to raise questions. The areas of dispute include the objectivity of testers, inter-rater reliability, the verifiability and general validity of the test, bias of the test's pathology scales towards greater numbers of responses, the limited number of psychological conditions which it accurately diagnoses, the inability to replicate the test's norms, its use in court-ordered evaluations, and the proliferation of the ten inkblot images, potentially invalidating the test for those who have been exposed to them.[11]

Existence of God

There are several main positions with regard to the existence of God that one might take:

1. Theism - the belief in the existence of one or more divinities or deities. 1. Pantheism - the belief that God exists as all things of the cosmos, that

God is one and all is God; God is immanent.2. Panentheism - the belief that God encompasses all things of the

cosmos but that God is greater than the cosmos; God is both immanent and transcendent.

3. Deism - the belief that God does exist but does not interfere with human life and the laws of the universe; God is transcendent.

4. Monotheism - the belief that a single deity exists which rules the universe as a separate and individual entity.

5. Polytheism - the belief that multiple deities exist which rule the universe as separate and individual entities.

6. Henotheism - the belief that multiple deities may or may not exist, though there is a single supreme deity.

7. Henology - believing that multiple avatars of a deity exist, which represent unique aspects of the ultimate deity.

2. Agnosticism - the belief that the existence or non-existence of deities or God is currently unknown or unknowable and cannot be proven. A weaker form of this might be defined as simply a lack of certainty about gods' existence or nonexistence.[citation needed]

3. Atheism - the rejection of belief in the existence of deities.[12][13] 1. Strong atheism is specifically the position that there are no deities.[14]

[15]

2. Weak atheism is simply the absence of belief that any deities exist.[15]

[16][17]

4. Apatheism - the lack of caring whether any supreme being exists, or lack thereof

5. Possibilianism

These are not mutually exclusive positions. For example, agnostic theists choose to believe God exists while asserting that knowledge of God's existence is inherently unknowable. Similarly, agnostic atheists reject belief in the existence of all deities, while asserting that whether any such entities exist or not is inherently unknowable.

Hinduism, Buddhism, Confucianism, and Taoism The four major religions of the Far East are Hinduism, Buddhism, Confucianism, and Taoism.

Hinduism

Hinduism, a polytheistic religion and perhaps the oldest of the great world religions, dates back about 6,000 years. Hinduism comprises so many different beliefs and rituals that some sociologists have suggested thinking of it as a grouping of interrelated religions.

Hinduism teaches the concept of reincarnation—the belief that all living organisms continue eternally in cycles of birth, death, and rebirth. Similarly, Hinduism teaches the caste system, in which a person's previous incarnations determine that person's hierarchical position in this life. Each caste comes with its own set of responsibilities and duties, and how well a person executes these tasks in the current life determines that person's position in the next incarnation.

Hindus acknowledge the existence of both male and female gods, but they believe that the ultimate divine energy exists beyond these descriptions and categories. The divine soul is present and active in all living things.

More than 600 million Hindus practice the religion worldwide, though most reside in India. Unlike Moslems and Christians, Hindus do not usually proselytize (attempt to convert others to their religion).

Buddhism, Confucianism, and Taoism

Three other religions of the Far East include Buddhism, Confucianism, and Taoism. These ethical religions have no gods like Yawheh or Allah, but espouse ethical and moral principles designed to improve the believer's relationship with the universe.

Buddhism originates in the teachings of the Buddha, or the “Enlightened One” (Siddhartha Gautama)—a 6th century B.C. Hindu prince of southern Nepal. Humans, according to the Buddha, can escape the cycles of reincarnation by renouncing their earthly desires and seeking a life of meditation and self‐discipline. The ultimate objective of Buddhism is to attain Nirvana, which is a state of total spiritual satisfaction. Like Hinduism, Buddhism allows religious divergence. Unlike it, though, Buddhism rejects ritual and the caste system. While a global religion, Buddhism today most commonly lies in such areas of the Far East as China, Japan, Korea, Sri Lanka, Thailand, and Burma. A recognized “denomination” of Buddhism is Zen Buddhism, which attempts to transmit the ideas of Buddhism without requiring acceptance of all of the teachings of Buddha.

Confucius, or K'ung Futzu, lived at the same time as the Buddha. Confucius's followers, like those of Lao‐tzu, the founder of Taoism, saw him as a moral teacher and wise man—not a religious god, prophet, or leader. Confucianism's main goal is the attainment of inner harmony with nature. This includes the veneration of ancestors. Early on, the ruling classes of China widely embraced Confucianism. Taoism shares similar principles with Confucianism. The teachings of Lao‐tzu stress the importance of meditation and nonviolence as means of reaching higher levels of existence. While some Chinese still practice Confucianism and Taoism, these religions have lost much of their impetus due to resistance from today's Communist government. However, some concepts of Taoism, like reincarnation, have found an expression in modern “New Age” religions.


Recommended