Comprehensive Exam Review

Comprehensive Exam Review

Click the LEFT mouse key ONCE to continue

AppraisalPart 2

Click the LEFT mouse key ONCE to continue

Statistical Concepts for Appraisal

A frequency distribution is a tabulation of scores in numerical order showing the number of persons who obtain each score or group of scores.

A frequency distribution is usually described in terms of its measures of central tendency (i.e., mean, median, and mode), range, and standard deviation.

The (arithmetic) mean is the sum of a set of scores divided by the number of scores.

The median is the middle score or point above or below which an equal number of ranked scores lie; it corresponds to the 50th percentile.

The mode is the most frequently occurring score or value in a distribution of scores.

The range is the arithmetic difference between the lowest and the highest scores obtained on a test by a given group.

The standard deviation is a measure of the variability in a set of scores (i.e., frequency distribution).

The standard deviation is the square root of the squared deviations around the mean (i.e., the square root of the variance for the set of scores).

Variability is the dispersion or spread of a set of scores; it is usually discussed in terms of standard deviations.

The normal distribution curve is a bell-shaped curve derived from the assumption that variations from the mean are by chance, as determined through repeated occurrences in the frequency distributions of sets of measurements of human characteristics in the behavioral sciences.

Scores are symmetrically distributed above and below the mean, with the percentage of scores decreasing in equal amounts (standard deviation units) as the scores progress away from the mean.

Skewness is the degree to which a distribution curve with one mode departs horizontally from symmetry, resulting in a positively or negatively skewed curve.

A positive skew is when the “tail” of the curve is on the right and the “hump” is on the left.

A negative skew is when the “tail” of the curve is on the left and the “hump” is on the right.

Kurtosis is the degree to which a distribution curve with one mode departs vertically from symmetry .

A leptokurtic distribution is one that is more “peaked” than the normal distribution.

A platokurtic distribution is one that is “flatter” than the normal distribution.

Percentiles result from dividing the (normal) distribution into one hundred linearly equal parts.

A percentile rank is the proportion of scores that fall below a particular score.

Two different percentiles may represent vastly different numbers of people in the normal distribution, depending on where the percentiles are in the distribution.

Standardization, sometimes called “normalizing,” is the conversion of a distribution of scores so that the mean equals zero and the standard deviation equals 1.0 for a particular sample or population.

“Normalizing” a distribution is appropriate when the sample size is large and the actual distribution is not grossly different from a normal distribution.

Standardization, or normalizing, is an intermediate step in the derivation of standardized scores, such as T scores, SAT scores, or Deviation IQs.

Stanines are a system for assigning a score of one through nine for any particular score. Stanines are derived from a distribution having a mean of five and a standard deviation of two.

A correlation coefficient is a measure of relationship between two or more variables or attributes that ranges in value from -1.00 (perfect negative relationship) through 0.00 (no relationship) to +1.00 (perfect positive relationship).

A regression coefficient is a measure of the linear relationship between a dependent variable and a set of independent variables.

The coefficient of determination is the square of a correlation coefficient. It is used in the interpretation of the percentage of shared variance between two sets of test scores.

The probability (also known as the alpha) level is the likelihood that a particular statistical result occurred simply on the basis of chance.

Error of measurement is the discrepancy between the value of an observed score and the value of the corresponding theoretical true score.

The standard error of measurement is an indicator of how closely an observed score compares with the true score. This statistic is derived by computing the standard deviation of the distribution of errors for the given set of scores.

Measurement error variance is the portion of the observed score variance that is attributed to one or more sources of measurement error (i.e., the square of the standard error of measurement).

Random error is an error associated with statistic analyses that is unsystematic, often indirectly observed, and appears to be unrelated to any measurement variables.

Differential item functioning is a statistical property of a test item in which, conditional upon total test score or equivalent measure, different groups of test takers have different rates of correct item response.

The item difficulty index is the percentage of a specified group that answers a test item correctly.

The item discrimination index is a statistic that indicates the extent to which a test item differentiates between high and low scorers.

Extrapolation is the process of estimating values of a function beyond the range of the available data.

A confidence interval is the interval between two points on a scale within which a score of interest lies, based on a certain level of probability.

The error of estimate (standard or probable) is the degree to which test scores estimated from a criterion correspond with actual scores.

The regression effect is the tendency of a predicted score to be nearer to the mean of its series of scores than was predicted.

A factor is a hypothetical dimension underlying a psychological construct that is used to describe the construct and intercorrelations associated with it.

Factor analysis is a statistical procedure for analyzing intercorrelations among a group of variables, such as test scores, by identifying a set of underlying hypothetical factors and determining the amount of variation in the variables that can be accounted for by the different factors.

The factorial structure is the set of factors resulting from a factor analysis.

Reliability

The reliability coefficient is an index that indicates the extent to which scores are free from measurement error. It is an approxi-mation of the ratio of true variance to observed score variance for a particular population of test takers.

Reliability is the degree to which an individual would obtain the same score on a test if the test was re-administered to the individual with no intervening learning or practice effects.

The coefficient of equivalence is a correlation between scores for two forms of a test given at essentially the same time; also referred to as alternate-form reliability, a measure of the extent to which two equivalent or parallel forms of a test are consistent in what they measure.

The coefficient of stability is a correlation between scores on two administrations of a test, such as test administration and retest with some intervening time period.

The coefficient of internal consistency is a reliability index based on interrelationships of item responses or of scores on sections of a test obtained during a single administra-tion. The most common examples include the Kuder-Richardson and split-half.

Coefficient Alpha is a coefficient of internal consistency for a measure in which there are more than dichotomous response choices, such as in the use of a Likert scale.

The split-half reliability coefficient is a reliability coefficient that estimates the internal consistency of a power test by correlating the scores of two halves of the test (usually the even-numbered items and the odd-numbered items, if their representative means and variances are equal).

The Spearman-Brown Prophecy Formula projects the reliability of a test that has been reduced from the calculated reliability of the test. It is a “correction” appropriate for use only with a split-half reliability coefficient.

Interrater reliability is an index of the consistency of two or more independent raters’ judgments in an assessment situation.

Intrarater reliability is an index of the consistency of each independent rater’s judgments in an assessment situation.

Validity

Validity is the extent to which a given test measures or predicts what it purports to measure or predict.

The two basic approaches to the determina-tion of validity include logical analysis, which applies to content validity and item structure, and empirical analysis, which applies to predictive validity and concurrent validity. Construct validity falls under both logical and empirical analyses.

Validation is the process by which the validity of an instrument is measured.

Validity is application specific, not a generalized concept. That is, a test is not in and of itself valid, but rather is valid for use for a specific purpose for a specific group of people in a specific situation.

Face validity is a measure of the acceptability of a given test and test situation by the examinee or user, in terms of the apparent uses of the test.

Concurrent validity is a measure of how well a test score matches a measure of criterion performance.

Example applications include comparing a distribution of scores for men in a given occupation with those for men in general, correlating a personality test score with an estimate of adjustment made in a counseling interview, and correlating an end-of-course achievement or ability test score with a grade-point average.

Content validity is a measure of how well the content of a given test represents the subject matter (domain or universe) or situation about which conclusions are to be drawn.

A construct is a grouping of variables or behaviors considered to vary across people. A construct is not directly observable but rather is derived from theory.

Construct validity is a measure of how well a test score yields results in line with theoretical implications associated with the construct label.

Predictive validity is a measure of how well predictions made from a given test are confirmed by data collected at a later time.

Example applications of predictive validity include correlating intelligence test scores with course grades or correlating test scores obtained at the beginning of the year with grades earned at the end of the year.

Factorial validity is a measure of how well the factor structure resulting from a factor analysis of the test matches the theoretical framework for the test.

Cross-validation is the process of determining whether a decision resulting from one set of data is truly effective when used with another relevant and independent data set.

Convergent evidence is validity evidence derived from correlations between test scores and other types of measures of the same construct and in which the relationships are in predicted directions.

Discriminant evidence is validity evidence derived between test scores and other forms of assessment for different constructs and in which the relationships are in predicted directions.

Appraisal ofIntelligence

A very general definition of intelligence is that it is a person’s global or general level of mental (or cognitive) ability.

However, there is considerable debate as to what intelligence is, and a corresponding amount of debate about how it should be measured.

Perhaps the biggest debate in the assessment of intelligence is how to use intelligence tests effectively.

Given that intelligence is a “global” construct, what are the implications of intelligence test results for relatively specific circumstances and/or sets of behaviors?

In general, intelligence test results have been most useful for interpretation in contexts calling for use of mental abilities, such as in educational processes.

Another argument concerns whether intelligence is “a (single) thing,” which is reflected in unifactor theories of intelligence, or a unique combination of things, which is reflected in multifactor theories of intelligence.

The measurement implications from this debate result in some intelligence tests at-tempting to measure a single construct and some attempting to measure a unique set of interrelated constructs.

Another debate centers on what proportion of intelligence is genetic or inherited and what proportion is environmentally determined. This is the so-called “nature-nurture” controversy.

So-called “fluid” intelligence (theoretically a person’s inherent capacity to learn and solve problems) is largely nonverbal and is a relatively culture-reduced form of mental efficiency.

The nature-nurture concern has significant implications for how intelligence is assessed (e.g., what types of items and/or tasks are included), but there has not been full or consensual resolution of the debate.

So-called “crystallized” intelligence (theoretically) represents what a person has already learned, is most useful in circumstances calling for learned or habitual responses, and is heavily culturally laden.

A fourth major debate concerns the extent to which intelligence tests are racially, culturally, or otherwise biased.

Although evidence of such biases were found in some “early” intelligence tests, improvements in psychometry have done much to alleviate such biases, at least in regard to resultant psychometric properties of “newer” intelligence tests.

In light of these and other considerations, the primary focus for the assessment of intelligence is on the construct validity of intelligence tests.

In general, individually administered intelligence tests have achieved the greatest credibility.

Individual intelligence tests typically are highly verbal in nature, i.e., necessitate command of language for effective performance.

Individual intelligence tests typically include both verbal (e.g., response selection or item completion) and performance (e.g., manipulation task) subsets of items.

However, nonverbal and nonlanguage intelligence tests have been developed.

Group administered intelligence tests, such as those commonly used in schools, are typically highly verbal and non-performance in nature.

Appraisal of Aptitudes

An aptitude is a relatively clearly defined cognitive or behavioral ability.

An aptitude is a much more focused ability than general intelligence, and the measurement of aptitudes also has been more focused.

Literally hundreds of aptitude tests have been developed and are available for a substantial number of rather disparate human abilities.

Theoretically, aptitude tests are intended to measure “innate” abilities (or capacities) rather than learned behaviors or skills.

There remains considerable debate as to whether this theoretical premise is actually achieved in practice.

However, this debate is lessened in importance IF the relationship between a current aptitude test result and a future performance indicator is meaningful and useful.

Aptitude tests are used primarily for prediction of future behavior, particularly in regard to the application of specific abilities in specific contexts.

Predictive validity is usually the foremost concern in aptitude appraisal and is usually established by determining the correlation between test results and some future behavioral criterion.

Although there are many individual aptitude tests, aptitude appraisal is much more commonly achieved through use of multiple-aptitude test batteries.

There are two primary advantages to the use of multiple-aptitude batteries (as opposed to a collection of individual aptitude tests from different sources):

First, the subsections of multiple-aptitude test batteries are designed to be used as a collection; therefore, there is usually a common item and response format, greater uniformity in score reporting, and generally better understanding of subsection and overall results.

Second, the norms for the various subtests are from a common population; therefore, comparison of results across subtests is facilitated.

Perhaps the most widely recognized use of aptitude tests is for educational purposes, e.g., Scholastic Assessment Test (formerly the Scholastic Aptitude Test; SAT), American College Testing Program (ACT), and Graduate Record Examination (GRE).

However, aptitude tests used specifically for vocational purposes (e.g., General Aptitude Test Battery; GATB) or armed services purposes (e.g., Armed Services Vocational Aptitude Battery; ASVAB) also are very widely used.

Appraisal of Achievement

Achievement tests are measures of success, mastery, accomplishment, or learning in a subject matter or training area.

The greatest use by far of achievement tests is in school or educational systems to determine student accomplishment levels in academic subject areas.

The vast majority of achievement tests are group tests.

Most achievement tests also are actually multiple-achievement test batteries because they typically have subtests for several different subject matter areas.

However, there are achievement tests available that measure across several different subject matter areas but that are designed for individual administration.

Individual achievement tests are used most commonly in processes to diagnose learning disabilities.

Most achievement tests are norm-referenced to facilitate comparisons within and between components of educational systems.

However, increasingly, criterion-referenced achievement tests are being used in the at-tempt to determine with greater specificity the particular skills and/or knowledge students are mastering at various educational levels.

Appraisal of Interests

The primary goal of interest assessment is to help individuals differentiate preferred activities from among possible activities.

Presumably, the information derived from interest assessment will enable the respondent to achieve greater vocational productivity, success, and/or life satisfaction.

Most interest inventories are used in the context of vocational counseling (i.e., to help individuals determine preferences in various aspects of the world of work).

However, increasingly, interest inventories are being developed and used to assess preferences in other aspects of life, such as leisure.

Some interest (and some personality) inventories are ipsative measures, which means that the average of the subscale responses is the same for all respondents.

Ipsative measures usually have a forced-choice format, which means that a respondent cannot have all high scores or all low scores across subscales.

Interest inventories are most commonly used by and developed for young adults, such as late high school or college students.

However, interest inventories suitable and valid for use with persons at any age are available.

The major problem with interest inventories is the tendency for respondents to interpret them as measures of ability or probable satisfaction, neither of which is necessarily directly related to any particular preference.

Appraisal of Personality

Personality is a vague, difficult-to-define construct. People tend to think of it as “the way a person is.” However, there are at least two points of agreement about personality:

First, each person is consistent to some extent (i.e., has coherent traits and action patterns that are repeated).

Second, each person is distinct or unique to some extent (i.e., has traits and behaviors different from others).

It is exactly this strange set of conflicting conditions that makes the assessment of personality so complex.

“Normality” is a relativistic term used to describe how some identifiable group of people (should) behave most of the time.

The assessment of personality thus involves determining the extent to which a person’s traits and/or behaviors fit normality (i.e., are compared to average behavior in some reference group).

The use of projective techniques and self-report inventories are the two primary methods of appraisal of personality.

Projective techniques involve respondents constructing their own responses to vague and ambiguous stimuli.

The projective hypothesis is that personal interpretation of ambiguous stimuli reflects unconscious needs, motives, and/or conflicts.

Generally, five types of projective assess-ment techniques are discussed:

Association techniques, such as the Rorschach or Holtzman Inkblot techniques, ask the respondent to “explain” what is seen in the stimulus.

Construction techniques, such as the Thematic Apperception Test or the Children’s Apperception Test, ask the respondent to “tell a story” about what is represented by the stimulus, usually a vague picture.

Expression techniques, such as the Draw-A-Person Test or the House-Tree-Person Test, ask the respondent to create a figure or drawing in response to some instruction.

Arrangement techniques ask respondent to place in order the elements of a set (usually) of pictures and then to “explain” the sequence.

Completion techniques ask the respondent to make a complete sentence from a sentence stem.

Historically, the results of projective tech-niques have exhibited poor psychometric properties.

However, the use of projective techniques remains quite popular, primarily because respondents often do disclose information, particularly “themes” of information, not easily obtainable through other methods.

Generally, three types of self-report personality inventories are discussed in the professional literature:

Theory-based inventories, such as the Myers-Briggs Type Inventory, State-Trait Anxiety Inventory, or Personality Research Form, assess traits and/or behaviors in accord with the constructs upon which the inventory is based.

Factor-analytic inventories, such as the Sixteen Personality Factor Questionnaire or the Neo-Personality Inventory-Revised, assess personality dynamics outside the context of any particular theory of personality.

Items in these types of instruments are selected from the results of factor analyses of large samples of items and generally have very good psychometric properties.

Criterion-keyed inventories, such as the Minnesota Multiphasic Personality Inventory-2 or Millon Clinical Multiaxial Inventory-III, contain subscale items that discriminate between a criterion group (e.g., schizoid or narcissistic) and a relevant control group (e.g., “normals”).

These types of inventories usually are used to assist in making clinical diagnoses.

Self-report personality inventories generally have much better psychometric properties than do projective techniques.

However, clinical diagnoses should never be made solely on the basis of personality instrument results; clinical judgments should be used in combination with assessment results.

Computers and Appraisal

Clearly the most prominent trend in appraisal today is toward “computerization” of testing.

In computer-based testing, instruments or techniques that are or could be in other formats (e.g., “paper-and-pencil”) are converted to a situation in which they are presented on and responded to through use of a computer.

Adaptive testing is when an item presented subsequently to a respondent is selected based on the qualitative or accuracy nature of the response to the preceding item.

Adaptive testing is facilitated through the use of computers due to the capability to handle large numbers of contingencies and choices efficiently and accurately.

Computer-generated interpretive reports also are increasing in frequency of use.

A computer’s capability to analyze complex data sets and intricate patterns in data are the primary reasons for the increasing use of computer-generated interpretive reports.

However, computer-generated interpretive reports are only as good as the programming underlying them, and never as good as when used in conjunction with sound clinical judgment.

This concludes Part 2 of the presentation on

APPRAISAL

Date post:	13-Feb-2016
Category:	Documents
Upload:	minnie
View:	20 times
Download:	0 times

Comprehensive Exam Review

Documents