+ All Categories
Home > Documents > AME Article

AME Article

Date post: 14-Apr-2018
Category:
Upload: anwalker123
View: 221 times
Download: 0 times
Share this document with a friend

of 56

Transcript
  • 7/29/2019 AME Article

    1/56

    ForP

    eerRev

    iewOnly

    THE REVISED SAT SCORE AND ITS MARGINAL PREDICTIVE

    VALIDITY

    Journal: Applied Measurement in Education

    Manuscript ID: HAME-2012-0095

    Manuscript Type: Empirical Article

    Keywords: Predictive Validity, SAT, College Admissions, Revised SAT

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    2/56

    ForP

    eerRev

    iewOnly

    1

    ABSTRACT

    This papers explores the predictive validity of the Revised SAT (R-SAT) score as an

    alternative to the student SAT score. Freedle proposed this score for students who may

    potentially be harmed by the relationship between item difficulty and ethnic DIF observed in the

    test they took in order to apply to college. The R-SAT score is defined as the score minority

    student would have received if only the hardest questions from the test had been considered and

    was computed using formula score and an inverse regression approach. Predictive validity of

    short and long-term academic outcomes is considered as well as the potential effect on the

    overprediction and underprediction of grades among minorities. The predictive power of the R-

    SAT score was compared to the predictive capacity of the SAT score and to the predictive

    capacity of alternative Item Response Theory (IRT) ability estimates based on models that

    explicitly considered DIF and/or were based on the hardest test questions. We found no evidence

    of incremental validity in favor of the R-SAT score nor of the IRT ability estimates.

    ge 1 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    3/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    2

    THE REVISED SAT SCORE AND EXPLORING ITS RELATIVE PREDICTIVE VALIDITY

    Introduction

    One way that admission examinations are judged is by how well they are able to predict

    college outcomes. Predictive validity studies analyze the degree of association between

    admissions test scores and college outcomes, such as college grades and graduation rates.

    Academic outcomes are relatively easy to collect and are also related to the behavior that tests

    like the SAT are expected to predict, success in college. Some studies have also addressed the

    prediction of nonacademic outcomes such as earnings, leadership, job satisfaction, satisfaction

    with life and civic participation (Bowen & Bock, 1998; Allen, Robbins & Sawyer, 2010;

    Oswald, Schmitt, Kim, Ramsay & Gillespie, 2004; Willingham, 1985).

    In this study, we examine a measure of academic preparedness that has been proposed to

    complement the SAT. This measure, the Revised-SAT or R-SAT, was proposed by Roy Freedle

    (2003) with the goal of correcting the unfairness of SAT results for minorities he found through

    his application of the Standardization method for DIF (Dorans & Kulick, 1983, 1986; Dorans &

    Holland, 1992). The R-SAT was proposed as a score based on a subset of the SAT questions. We

    will judge its success using the result of predictive validity analyses of short and long-term

    outcomes.

    This article is divided into five sections. The first section summarizes previous research

    on the prediction of college outcomes. The research question for this investigation, the data

    sources and methods are presented in the next 3 sections. Lastly, the results section presents the

    findings obtained when calculating the revised SAT score, and using it to predict academic

    Page 2

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    4/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    3

    outcomes. The predictive capacity of the R-SAT score will be compared to that of the original

    SAT score and three Item Response Theory (IRT) versions of the SAT score.

    Prior Research on Prediction of College Outcomes

    There is a substantial body of research on the validity of multiple variables to predict

    college outcomes in a wide range of dimensions: education, employment and social outcomes.

    This section presents a brief overview of this literature, with a particular focus on the role of high

    school grades and standardized test scores in the prediction of (i) college grades and (ii)

    graduation rates. Although these outcome indicators offer only a partial portrayal of students

    educational achievement, the convenience of their collection and updating process makes them

    the outcome most commonly used in predictive validity studies. A subsequent section describes

    recent studies examining the prediction of non academic college outcomes and the role of non-

    cognitive predictors.

    Freedle proposed computing a new score based on the hardest questions of the most

    widely taken standardized test in the US in order to compensate for the potentially unfair results

    of minority students he found when analyzing differential item functioning and its relationship to

    item difficulty. Details on the calculation of the R-SAT and Freedles expectations of this index

    as well as the criticisms made to it are presented in this section as well.

    College Grade Point Average

    The relationship between high school grade point average, SAT scores andfreshmen

    grade point average has been widely examined by researchers at the College Board and research

    units within higher education institutions (Ramist, Lewis, & McCamley-Jenkins, 1994; Geiser &

    Studley, 2004). In general College Board studies find that SAT scores make a substantial

    contribution to predicting cumulative college GPAs and that the combination of SAT scores and

    ge 3 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    5/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    4

    high school records provide better predictions than either grades or test scores alone (Burton &

    Ramist, 2001; Hezlett, Kuncel, Vey, Ahart, Ones, Campbell & Camara 2001). College Board

    researchers have studied the validity of the SAT mostly using correlational analysis and have

    taken into consideration the technical issues of range restriction, differences in grading across

    colleges and unreliability of college grades to measure success in college (Camara &

    Echternacht, 2000; Willingham, Lewis, Morgan & Ramist, 1990).1

    Typical correlations between first-year grades and the SAT I (Verbal and Math scores

    combined) range between 0.3 and 0.6 depending on the characteristics of the studies with an

    average of 0.4 (Ramist, Lewis & McCamley-Jenkins, 1994; Zwick, 2002). Bridgman, Pollack

    and Burton (2004) for example, report a correlation between freshman grades and the SAT I

    score composite of 0.55, while the SAT Verbal test score has a correlation of 0.50 with

    freshman grades, the SAT Math correlates 0.52.2

    On average the measurement error of the SAT I

    Math and Verbal sections is 30 points and the correlation with the outcome criterion tends to be

    less strong when measurement error is considered (Zwick, 2002).

    Standardized tests allow all applicants the opportunity to perform in an environment with

    the same testing conditions, instructions and time-constraints, opportunities to ask questions and

    procedures for scoring. Standardized test scores permit one to compare among students who

    come from different schools in which grading standards can vary significantly. Zwick (2002),

    aware that SAT scores add little predictive power to high school grades, justifies the use of the

    standardized test scores in admissions to large institutions by noting the cost of interviewing

    1Studies that do not adjust for range restriction and variations in grading standards tend to lower the

    observed correlations underestimating the predictive power of the indexes used in admissions processes (Camara &

    Echternacht, 2000).

    2 They also report a correlation between high school grades and first-year college grades of 0.58.

    Page 4

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    6/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    5

    candidates or reviewing applications in elaborate detail. The cost for the school of collecting and

    processing the scores is minimal. In addition, standardized test scores help reduce the

    overprediction of African American college grades observed when purely using high school

    grades.3

    In 2005 the SAT I was revised in a number of ways (Kobrin, Patterson, Shaw, Mattern &

    Barbuti, 2008): analogies were removed and replaced with more questions on reading passages

    and the Verbal section was renamed the Critical Reading section. The Math section now includes

    items from more advanced courses and does not include quantitative comparison items. In

    addition, a third test was added including multiple-choice items on grammar and a student-

    produced essay. Kobrin et al. report a correlation between test scores and first-year college

    grades similar to that from previous studies (unadjusted r=0.35, r adjusted for range

    restriction=0.53), concludes that the new writing test is the most predictive based on bivariate

    correlations and multiple correlations (unadjusted r=0.36, r adjusted for range restriction=0.51)

    and encouraged institutions to use both high school GPA and test scores when making

    admissions decisions since that maximizes pedictibility of first-year college grades (unadjusted

    r=0.46, r adjusted for range restriction=0.62).

    Relative Predictive Validity of Different Academic Indicators

    Previously, Geiser and Studley (2002), from the University of California, analyzed the

    relative contribution of high school GPA, SAT I and SAT II scores to the prediction of college

    success and found that SAT II scores were the best single predictor of first year GPA, and that

    3 Overprediction means that a groups average predicted first-year grade point average (GPA), is greater than its

    average actual first-year GPA. Although this problem is known to be present in the SAT I for African American and

    Hispanic students, Ramist et al. (1994) finds it even more strongly when only using high school GPA to predict first-

    year college grades.

    ge 5 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    7/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    6

    the SAT I scores added little to the prediction once SAT II scores and high school GPA were

    already considered.4

    After taking the SAT II and high school GPA into consideration, the SAT I

    scores improved the overall prediction rate by a negligible 0.1% (from 21.0% to 21.1%). The

    standardized coefficient of the SAT I, after controlling for SAT II and high school GPA, was

    0.07, but statistically significant due to the large number of observations used. Geiser & Studley

    (2002) analyzed a sample of 80,000 freshmen who entered the University of California from fall

    1996 to fall 1999 using regression analysis. Their results were confirmed by subsequent findings

    from College Board researchers (Ramist et al., 2001; Bridgeman, Burton & Cline, 2001; Kobrin,

    Camara and Milewski, 2002) . For a more detailed review of these articles see Author (Year).

    Based on the findings from multivariate analyses considering multiple academic

    predictors of college performance such as the one conducted by Geiser & Studley (2002) the

    National Center for Fair and Open Testing (FairTest) has stated the SAT I has little value in

    predicting future college performance (FairTest, 2003) and highlights the better performance of

    class rank, high school grades and SAT II scores. Others, however, have chosen to advocate for

    admissions tests that are focus on achievement and that are based on standards and criterion

    (Atkinson & Geiser, 2009).

    Prediction by Ethnic Group and Gender

    Notable differences in the validity and predictive accuracy of SAT scores and high school

    grades by race and sex have been substantiated through numerous studies (Young, 2004). The

    accuracy of high school grades and SAT scores for predicting freshmen grade point average is

    higher for women, Asian Americans and White students, and lower for men, African Americans

    4 Geiser& Studley (2002) combined three SAT II scores into a single composite variable that weights each SAT II

    test equally; they did not analyze the predictive validity of separate test scores.

    Page 6

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    8/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    7

    and Hispanics. Furthermore, these admissions variables often overpredict the grades of African-

    Americans and Hispanic students, and underpredict those of women (Burton & Ramist, 2001).

    Ramist et al. (1994) report an overprediction of first-year GPA of -0.16 for African American

    students and of -0.13 for Hispanic students when considering HSGPA and SAT scores. Geiser &

    Studley (2002), on the other hand, found no significant over-prediction for African Americans

    and an average overprediction of -0.04 for Hispanic students when including high school GPA

    and SAT I scores in the regression equation. Zwick, Brown & Sklar (2004) conducted the same

    type of analyses for each of the University of California campuses and for two merged-cohorts

    (1996-1997 and 1998-1999). Their results vary significantly by the campus and merged-cohort

    analyzed but were interpreted by the authors as supporting previous findings from the literature.5

    There are a number of theories about the reasons for over and underprediction, for details

    see Zwick, Brown, & Sklar (2004), Zwick (2002) and Steele and Aronson (1998) .

    More recently researchers have looked at the differential prediction of test scores and

    high school grades among students from different language background (Zwick & Schemler,

    2004; Zwick & Sklar, 2005) and from schools with different financial and teaching resources

    (Zwick & Himelfarb, 2011) as a way to investigate possible explanations to the issue of over and

    underprediciton. Results show a reduction of prediction error for Hispanics and African

    American students but not a complete elimination (from -0.15 to -0.08 and from -0.13 to -0.03

    respectively) when using the second approach but no change when considering first language:

    5

    Zwick, Brown & Sklar (2004) observed no significant differences in overprediction of minorities if the SAT IIs

    were considered instead of the SAT I and whether income and parental education were included in the regression

    equation. Geiser & Studley (2002) also reported no practical change in the overprediction of minority groups when

    examining the predictive power of using SAT II scores instead of SAT I scores but the underprediction of African

    American grows to 0.03 and the overprediction for Hispanics grows to 0.08.

    ge 7 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    9/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    8

    overprediction is still observed for African American and Hispanic students when considering

    first language.

    Relative Predictive Validity Using Multivariate Regression Analysis and Considering

    Sociodemographic Variables

    Parental income and education play a modest role in the prediction of college

    performance when controlling for additional academic indicators such as high school grades and

    standardized tests. Geiser & Studley (2002) for example reported standardized coefficients that

    ranged between 0.03-0.04 and 0.05-0.06 respectively.6

    The modest standardized coefficients

    associated to parental income and education was also reported by Bowen & Bok (1998) when

    using multivariate regression analysis to predict college performance.7

    The consideration of sociodemographic variables in the predictive validity regression

    equation however is based on the results of Rothstein (2004) who find that most of the SAT

    predictive power comes from the correlation with unobserved variables such as high school

    sociodemographic variables.8

    Rothsteins estimates show that the predictive contribution of the

    SAT I score is 60% lower than would be indicated by traditional methods.

    6 The R2 of the regression equation that included high school GPA, SAT I and SAT II scores increased from 22.3 to

    22.8 when considering parental income and education.

    7 Performance in college was measured as percentile rankbased on cumulative GPA of the entering cohort, rather

    than freshmen grade point average, as a way to avoid school and major differences in grading philosophies and

    practices (pages 72 to 76). The book also looks at differences in economic outcomes (such as employment, wage

    and job satisfaction) and social outcomes (such as civic contribution, marital status and satisfaction with quality of

    life).

    8 The student-level variables he included are individual race and gender. The demographic make up of the school

    was described by the fraction of students who were Black, Hispanics and Asian; the fraction of students receiving

    subsidized lunches; and the average education of students parents.

    Page 8

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    10/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    9

    Controversy mounted between Geiser & Studley (2002) and Zwick and her colleagues

    (Zwick et al., 2004; Zwick & Green, 2007) around the issue of whether the SAT I or the SAT IIs

    were more sensitive to sociodemographic characteristics. This argument fueled the discussion

    that prompted modifications made to the SAT I in 2005 much in line with those suggested by the

    University of California (Atkinson & Pelfrey, 2004). The sensitivity of scores from different test

    type to socioedemographic characteristics has also been prominent in the discussion of whether

    general aptitude tests (like the SAT I) or curriculum-based test (more like the SAT IIs) should be

    used for college admissions (Atkinson & Geiser, 2009). For more details about the controversy

    see Author (Year, pp 107-109).

    College Graduation

    The ultimate goal of post-secondary education is college graduation, still this goal is

    elusive. According to Baum & Ma (2007)people with a college degree earned, on average, 62

    percent more than individuals with only a high school diploma in 2005. According to the

    National Educational Longitudinal Study (NELS)9, 59% of those who started college earned

    bachellors degrees by age 26 (Bowen, Chingos & McPherson, 2009). The National Center for

    Higher Education Management Systems (NCES, IPEDS, 2007) reports that only 77.4 percent of

    first-time, full-time students attending a four-year institution returned to that institution for their

    second year of college in 2005 (this information excludes students who transfer to another

    institution). Studies typically find that woman are slightly more likely to graduate from college

    than men and that African Americans, Hispanics and Native Americans have a lower rate of

    graduation than White students (Astin, Tsui & Avalos, 1996; Bowen and Bok, 1998).

    In general, studies exploring the role of SAT scores and high school grades in college

    persistence and college graduation find a moderate relationship between these college outcomes

    9 NELS surveyed students who were in eigth grade in 1988, most of whom graduated from high school in 1992.

    ge 9 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    11/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    10

    and preadmission measures (Astin et al., 1996; Burton & Ramist, 2001; Mattern & Patterson,

    2009, 2011a, 2011b). Although the traditional variables included in the multivariate regression

    models explain a small proportion of the variance, Author (Year) found high school grades to be

    the strongest predictor of college persistence, followed by the SAT II Writing scores. The

    importance of high school grades was corroborated by Zwick & Sklar (2005). Sociodemographic

    variables play a minor role in explaining college persistence and graduation (Authot, Year)

    nevertheless Bowen & Bok (1998) found these variables to be more important in the college

    prediction for African American students than for White students.

    The lower correlation between college persistence and preadmission characteristics is to

    be expected since persistence in college and ultimate graduation are more substantially

    influenced by nonacademic factors than college GPA. Some of the variables that research has

    identified as playing an importante role in determining persistence are finances, motivation,

    social adjustment, family and health problems, institutions selectivity and size (Reason, 2009;

    Bowen, Chingos & McPherson, 2009).10

    Non Academic Predictors of College Success

    Recently a number of studies have looked into the importance of non academic variables

    to predict college success. These studies have claimed for the expansion of the definition of

    college success to include longer-term outcomes such as persistence and graduation, as well as

    less-researched outcomes such as leadership and civic participation, and have stressed the

    importance of nonacademic predictors (Camara & Kimmel, 2005; Robbins, Lauver, Le, Davis &

    Langley, 2004; Sternberg 1999, 2003; Kyllonen, 2008). Doing so allows to predict college

    success more broadly and avoid relying exclusively on cognitive criteria and predictors. This in

    10 Wilsons (1983) observes that the best predictor of college graduation are persistence to sophomore year and first-

    year GPA. This information is closest in time and in content to what is being predicted, and it is not available at

    admission.

    Page 10

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    12/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    11

    light of universities broader missions including social and personal outcomes for their students

    and the reduced adverse impact that the consideration may have in the admission of traditional

    minority students (Oswald, Schmitt, Kim, Ramsay & Gillespie, 2004; Breland, Maxey, Gernard,

    Cumming & Trapani, 2001). The admissions decisions consider different dimensions of the

    applicant depending on the institutional mission and philosophy (Perfetto, 1999). Sinha, Oswald,

    Imus & Schmitt (2011) show that the adverse impact of admissions decisions can be reduced if

    colleges use a battery of cognitive and non-cognitive predictors that are weighted according to

    the values institutional stakeholders place on an expanded performance criterion of students

    success.

    Previous studies that looked into nonacademic measures of success (Bowen & Bok,

    1998; Willingham, 1985) showed that the traditional academic predictors such as test scores and

    high school records, have moderate to no relationship to nonacademic success. Sinha, Oswald,

    Imus & Schmitt (2011) confirmed the same type of results: SAT/ACT scores and High School

    GPA were more strongly correlated with college GPA than with non cognitive attributes.11

    The Revised-SAT

    Freedle proposed a methodology to correct the unfairness generated by the relationship

    he observed between item difficulty and differential item functioning in the SAT and known as

    the Freedle phenomenon: he observed that harder items showed DIF in favor of minority

    students while easier items showed DIF in favor of White students (Freedle, 2003).12

    The

    11 Allen, Robbins & Swayer (2010), however, claim that noncognitive indicators and psychosocial factors can

    increase the marginalprediction of academic college outcomes beyond what is already explained by traditional

    predictors.12 Differential item functioning (DIF) studies refer to how items function after differences in score distributions

    between groups have been statistically removed. The remaining differences indicate that the items function

    differently for both groups. Typically, the groups examined are derived from classifications such as gender,race, ethnicity, or socioeconomic status. The performance of the group of interest (focal group) on a given test

    item is compared to that of a reference or comparison group. White examinees are often used as the reference

    group, while minority students are often the focal groups (Holland & Wainer, 1993).

    ge 11 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    13/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    12

    proposed methodology focused on how students perform on the hard half of the SAT test and is

    called the Revised-SAT or R-SAT (Freedle, 2003). According to Freedle, the R-SAT would

    increase the SAT verbal scores by as much as 200 to 300 points for individual minority test-

    takers, it would reduce the mean score differences between White and minority test-takers by a

    third and would produce a score that is a better indicator of the academic abilities of minority

    students.

    Freedle, citing the work from Diaz-Guerrero & Szalay (1991), interprets the difference

    between a students R-SAT and his/her regular SAT score as a measure of the degree to which

    the examinees cultural background diverges from White, middle class culture. In his paper,

    Freedle recommends exploring the validity of the R-SAT index by comparing the correlation

    between the observed R-SAT index and college grades to that observed between the SAT score

    and college grades and by looking at how many admissions decisions would change if we

    assume that SAT or R-SAT scores over 600 indicate students qualified for college.13

    Freedle was strongly criticized by the College Board (Camara & Sathy, 2004), Dorans

    (2004) and Dorans and Zeller (2004a, 2004b). Some of the criticisms concerned the method

    used to study differential item functioning (standardization approach) and the way Freedle

    implemented it. Those criticisms were addressed by Author (Year) and results partially

    replicated Freedles findings when correctly implementing the standardization approach.

    However, the relationship between item difficulty and DIF was present only in the SAT Verbal

    test and only for African American students (Author, Year). When considering IRT methods to

    13 Freedle recognizes that predictive validity analyses will necessarily be limited because many people who did not

    attend selective colleges might have matriculated at such schools if their R-SAT scores had been used in the

    admission process, but nevertheless considers it relevant to examine the implications of using the measure he

    proposed.

    Page 12

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    14/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    13

    study DIF and to model guessing Freedles findings were also observed for Hispanic students

    (Author, Year).

    Dorans (2004) and Dorans and Zeller (2004a) also criticized the methods Freedle used

    for calculating the necessary components of the R-SAT: the use of proportion correct rather than

    formula score, his consideration of different (ethnic) samples for the half-test and his application

    of inverse regression.Furthermore, Dorans & Zeller (2004b) explored the fairness of Freedles

    R-SAT using Score Equity Assessment (SEA), a new methodology presented as a complement to

    the existing procedures for fairness assessment, namely DIF analysis and differential prediction.

    Using SEA Dorans and Zeller (2004b) found that the half-test to total test linking may be

    population-dependent and therefore the scores produced on the hard-half test cannot be used

    interchangeably with scores produced on the full-length SAT verbal test. For a more

    comprehensive review of the criticisms Dorans (2004) and Dorans & Zeller (2004b) posed, see

    Author (Year, pp. 113-114).

    Research Questions

    The current paper provides evidence regarding the predictive validity of the R-SAT and

    aims to explore the validity of Freedles measure using multivariate regression models. These

    models allow the exploration of the predictive validity of the R-SAT while controlling for the

    effect of other relevant measures influencing the academic outcomes achieved by students in

    college. The investigation starts by calculating the revised SAT score using Freedles

    methodology while considering the methodological criticisms made by Dorans & Zeller to the

    way Freedle calculated the necessary components of the R-SAT (Dorans, 2004; Dorans & Zeller,

    2004a). Once the R-SAT is calculated, we examined how beneficial it was for minority students

    ge 13 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    15/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    14

    and how it fared in terms of predictive validity of college outcomes in comparison to the original

    SAT score.

    The predictive power of the R-SAT was also compared to the predictive capacity of

    alternative Item Response Theory (IRT) ability estimates. IRT methods consider students ability

    to be a latent variable to be inferred from the data and due to the invariance property these

    estimates are not dependant on the set of test items under analysis.14

    The model used in this

    research, the Rasch model (Wu, Adams & Wilson, 1998), provides examinees ability estimates

    that are a direct transformation of the sum of correct responses and allowed us to include a

    parameter to consider DIF in the estimation of examinees ability. The predictive power of the R-

    SAT score and original SAT score will be compared to the predictive capacity of ability

    estimates from the Rasch model and the Rasch DIF model (Paek, 2002).15

    Data Sources

    Since the analyses requires information about the students SAT test scores and college

    experience, the information was drawn from two primary sources: the University of California

    Corporate Data System and the College Board.

    The College Board datafiles contained item level performance, students individual

    scores as well as students responses to the Student Data Questionnaire (forty-three questions),

    including self-reported demographic and academic information such as parents education,

    family income, and high school grade point average.

    The University of California Corporate Student Information System provides systemwide

    admissions and performance data. Through their applications to UC, students provide academic

    14 Note that the invariance property holds only when the models used hold. This is tested using fit statistics.

    15 For more details about this model and its aplication to the Freedle phenomenon see Author (Year) (2012).

    Page 14

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    16/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    15

    and demographic information that is subsequently verified and standardized. For those students

    who enroll at UC, this information is complemented with their academic history including

    college grades, number of courses and number of units completed, persistence and graduation.

    Information about parental education level and family income is also provided.

    Information from the College Board and UC system was complemented with an indicator

    of school performance on a state standardized test (Academic Performance Index) from the

    California Department of Education.

    This study was conducted using the subset of examinees from the College Board file who

    were juniors, came from California public high schools, took the SAT forms DX and QI in 1994

    or SAT forms IZ and VD in 1999, spoke English as their best language and applied and enrolled

    at the University of California. Only UC eligible students are admitted and are allowed to the

    University of California. Although at the time there were several routes to become UC eligible,

    most students became eligible through the statewide eligibility path. This path required students

    to complete a certain number of courses by subject area and to achieve a certain test score

    depending on their high school grades. In general, the UC eligibility criteria were set with the

    ultimate goal to identify the top 12.5% high school graduates who, according to the California

    Master Plan of Higher Education, should be considered for the University of California.

    As result of the eligibility criteria and of enrollment decisions, the sample used has a

    higher mean SAT score, higher high school grade point average, higher family income and

    parents education than the College Boad sample of all high shool juniors from California public

    high schools who took SAT forms DX and QI in 1994 and SAT forms IZ and VD in 1999 (see

    Table 1). The difference in academic and demographic characteristics does not change the

    phenomenon originally described by Freedle and studied by Author (Year). The relationship

    ge 15 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    17/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    16

    between item difficulty and DIF estimates is still observed among high and low ability students

    when using the Rasch model to study DIF (Author, Year).16

    INSERT TABLE 1 HERE

    Methods

    This section presents the details of how the R-SAT score was calculated, how one IRT

    version of the original SAT score and two IRT versions of the R-SAT were estimated, and how

    the relative predictive validity of these scores and ability estimates was assessed. Since a

    previous study found stronger evidence of the relationship between DIF estimates and item

    difficulty in the Verbal test than in the quantitative test (Author, Year), the analyses presented in

    this paper focus exclusively on the Verbal test.

    Calculation of the Revised SAT score and Estimation of IRT Ability Parameters

    The R-SAT scores were calculated and the IRT ability estimates were estimated for the

    specific SAT form and ethnic subgroup where previous studies (Author, Year) showed evidence

    of a relationship between DIF and item difficulty estimates as defined by the standardization

    method (Dorans & Kulick, 1983) and/or the Item Response Theory approach to DIF (Camilli &

    Shepard, 1994). Table 2 presents a summary of the results obtained when using these two

    methodologies across forms and ethnic groups. Thus, R-SAT was calculated and ability

    16

    The Freedle phenomenon was analyzed among high and low ability students and ability was defined as having a

    high SAT score. The Freedle phenomenon was not analyzed among enrolled and non-enrolled students as this

    categorization is not exclusively based on ability but it is also determined by financial considerations and personal

    preferences. In addition, the sample size would have been extremely small for minority students. See Author (Year)

    for more details.

    Page 16

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    18/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    17

    estimates were estimated for African Americans in forms IZ, QI and DX and for Hispanics in

    forms IZ and VD.

    INSERT TABLE 2 HERE

    The R-SAT was obtained by calculating the corresponding formula score17

    in the hardest

    half of the test for all students who took each test form and then assigning African

    American/Hispanic students the total score obtained by White students who performed similarly

    in the hard half of that specific test form. Specifically, in order to obtain the revised score that

    African American/Hispanic students should have gotten, a linear regression was estimated only

    among the White students who took each form. The linear regression was used to predict their

    SAT scores using the formula score obtained in the hard half of the test. A constant and a slope

    coefficient were estimated and subsequently those parameter estimates were applied to the

    formula score obtained by African American/Hispanic students in the hard half of the test.18

    Although the R-SAT was calculated incorporating Dorans and Zellers recommendations

    regarding the use of formula scores rather than the original proportion correct scores (Dorans,

    2004; Dorans & Zeller, 2004a), the methodology employed to obtain the R-SAT is still subject to

    criticism for the use of inverse regression and combining results from different ethnic groups

    (Dorans & Zeller, 2004a, 2004b).

    Hence, in addition to the R-SAT, ability estimates using IRT methodology were also

    obtained. Initially the Rasch and Rasch DIF models (Adams, Wilson & Wang, 1997; Moore,

    1996) were estimated in each form and ethnic group for which there was evidence of the Freedle

    17 Formula scoring adjusts scores for the possibility of random guessing (Frary, 1988; Rogers, 1999).18 This methodology, originally used by Freedle (2003), allowed expressing the number of correct responses

    (adjusted by random guessing) in a score that ranged from 200 to 800 just as the regular SAT Verbal score. The

    scores of White students are used as the reference because they have been considered the reference group in all DIF

    analyses.

    ge 17 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    19/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    18

    phenomenon, using all students from California public high schools who took the form (see

    Table 3). These models were estimated using ConQuest (Wu, Adams & Wilson, 1998); the

    Rasch Model Ability Estimate and Rasch DIF Model Ability Estimates were obtained

    respectively. In addition, an IRT version of Freedles revised SAT was estimated by considering

    only the hardest half of the items in each test form (Hard Half Ability Estimate using the Rasch

    Model).19,20

    In total, three IRT ability estimates were obtained for each African American or

    Hispanic student.

    While the ability estimates obtained from the Rasch model are a direct (but non-linear)

    transformation of the sum of correct responses, they differ from the original SAT score in that

    the IRT ability estimates consider guessing by using formula score. The ability estimates

    obtained from the Rasch DIF model directly incorporates a parameter for DIF, and therefore

    explicitly considers the phenomenon Freedle described in the ability estimation. The third IRT

    ability estimate, obtained from estimating the Rasch model in only the hard half of the test,

    attempts to adjust the ability estimate for the phenomenon described by Freedle following

    exactly the same logic behind the methodology he proposed, but using IRT methods instead.

    Since each of these models is directly estimated for a specific ethnic group comparison, the

    ability estimates generated are not subject to the concerns expressed by Dorans and Zeller

    (Dorans, 2004; Dorans & Zeller, 2004a; 2004b) regarding the use of inverse regression and

    aggregation of estimates from different ethnic groups. Although IRT scaling tends to produce

    ability estimates that are linearly related to the underlying ability measured, they may be more

    useful than aggregated scores when examining the linear relationship between test scores and

    19 See Author (Year, Appendix 1) for the model fit statistics for the Rasch, Rasch DIF and Hard Half models.

    20 The item difficulty estimates from the original Rasch DIF model were used to define the hardest half of the items.

    Page 18

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    20/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    19

    external variables (e.g., outcome measures) because IRT ability estimates are less subject to the

    ceiling and/or floor effects observed in aggregated scores (Thissen & Orlando, 2001; Xu &

    Stone, 2011).

    Predictive Validity Analyses

    The predictive power of the regular SAT verbal score, the R-SAT score and the three IRT

    ability estimates were compared for African American, Hispanic and Whites students. Linear

    regression was used for GPA prediction and logistic regression was used for the prediction of

    graduation because UC GPA is a continuous numerical variable and graduation is a dichotomous

    outcome variable.21

    The ordinary least squares method was used for estimating linear regressions

    and the maximum likelihood technique was implemented for the estimation of logistic

    regression. The college outcomes examined were the first through fourth year annual UC GPA,

    cumulative fourth year UCGPA and whether students graduated by their fourth year at UC.

    The academic outcomes included in this study are of particular interest because they are

    not limited to grade point averages and they span over four years of the college career of students

    taking the SAT in 1994 and 1999. Most research in this area has been limited to examining the

    predictive validity of standardized test scores and high school grades in short-term academic

    outcomes, specially grades.

    The analyses controlled for academic and sociodemographic variables found to be

    significant in previous college prediction research (Geiser & Studley, 2002; Author, Year;

    Rothstein, 2004; Zwick et al., 2004). The sociodemographic variables included parents

    21Although Bridgeman, Pollack and Burton (2004) find evidence suggesting a potential non-linear relationship

    between college grades and test scores, Rothstein (2004) does not find evidence along this line. Exploratory analyses

    conducted in this research sample did not provide evidence to support a non-linear relationshop between first-year

    college grades and SAT scores.

    ge 19 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    21/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    20

    education and income level from the UC systemwide admissions and performance data. The

    academic variables included a weighted high school GPA, calculated with up to eight honors-

    level courses, the SAT Math score22

    and the school academic performance index expressed as

    quintile ranks for students who took the SAT in 1999. The school academic performance index

    information was not available for the students who took the SAT in 1994 because the index was

    calculated for the first time in 1998.23

    Equations 1, 2 and 3 show the general regression equation models for the prediction of

    annual UC GPA, cumulative fourth-year UC GPA and fourth-year UC graduation respectively.

    UCGPAi

    =1

    +2

    APIQ+3

    Educ+4

    Inc+5

    HSGPA+ 6

    SATM+7

    Zi

    (1)

    CUMUCGPA4 =1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Zi (2)

    LOGIT(GRAD4 )=1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Zi (3)

    where

    UCGPAi is the grade point average that a student had in yeari of college, where i ranges

    between 1 and 4;

    CUMUCGPA4 refers to the cumulative grade point average at the fourth college year;

    GRAD4 is a binary variable indicating graduation by the fourth year of college, where 1

    indicates a student has graduated, 0 indicates a student who has not graduated;

    APIQ refers to the ranking of the school in the California Academic Performance Index;

    Educ is the maximum number years of education achieved by the parents as reported in

    the UC application;

    Inc refers to the family income reported in the UC application (expressed in dollars) as

    reported in the UC application;

    22 Different ability/scores for the Verbal section were also included and the explanation is included below.

    23 Regression models excluding API rank as explanatory variables are included in Author (Year, Appendix 5).

    Page 20

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    22/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    21

    HIGH SCHOOL GPA is the weighted GPA considering up to eight honors-level courses;

    SAT Mis the original score obtained in the SAT Math test ; and

    Zirefers to different indices of verbal ability. For each of the three regression models

    there were five versions which differed in the verbal ability index included. In the first version of

    each model (models 1.1, 2.1 and 3.1 in the tables) the verbal ability index is the SAT Verbal

    score. The second version of each model uses the original SAT score for White students and the

    highest score between the revised SAT Verbal score and the original SAT score for minority

    students (models 1.2, 2.2 and 3.2 in the tables). The third and fourth versions of the models

    include the Verbal ability estimates from the Rasch (models 1.3, 2.3 and 3.3 in the tables) and

    Rasch DIF model respectively (models 1.4, 2.4 and 3.4 in the tables). Lastly, the fifth version of

    the models consider the Verbal ability estimate obtained from estimating the Rasch model using

    only the hardest half of the Verbal items (models 1.5, 2.5 and 3.5 in the tables).

    The model presented in the text includes only SAT I Verbal and SAT I Math scores as

    explanatory variables, and not SAT II scores, as most higher education institutions require only

    the SAT I (or ACT) exam and results from these models will be more generalizable to other

    institutions. Regressions including SAT II test scores as explanatory variables are included in

    Author (Year, Appendix 4) and do not offer stronger evidence in support of the R-SAT Verbal

    test score.

    The analyses could not control for the effect of discipline or campus on the dependent

    variable due to the small sample size of minority groups (Brown & Zwick, 2006). Sample size

    also, as those from most limited our ability to properly model the within and between school

    variation in high-school GPA and API quintile (Zwick & Green, 2007). In addition, it is

    important to note that as most predictive validity studies, conclusions from this research are

    ge 21 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    23/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    22

    necessarily limited because many people who did not attend selective colleges might have

    matriculated at such schools if their R-SAT Verbal scores had been used in the admission

    process.

    The analyses compared the explained variance as well as the size and statistical

    significance of the standardized coefficients across models. The explained variance was

    measured by the adjustedR2

    statistic (Singer & Willett, 2003), an alternative to theR2, which

    considers the number of variables included in the model. The adjustedR2

    statistic is presented

    below.

    Adj R

    2

    = 1-[(n-1)/(n-p)] (1-R

    2

    )

    where

    n is the sample size

    p refers to the number of parameters in the model

    In logistic regression there is no precise counterpart to the R2

    or adjusted R2

    used in linear

    regression. Several measures of goodness of fit have been proposed and Nagelkerkes

    maximum-rescaled R2 or 2~R is used here. The statistic, given below, can achieve a maximum

    value of 1:

    max2

    22~

    R

    RR =

    where

    n

    L

    L

    R

    /22

    })(

    )0(

    {1 =

    .

    R2

    achieves a maximum of less than 1 for discrete models, where the maximum is given

    by nLR /22

    max )}0({1= ,

    )0(L is the likelihood of the intercept-only model,

    Page 22

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    24/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    23

    )(L is the likelihood of the specified model, and

    n is the sample size.

    Standardized regression coefficients, or beta weights, show the relative strength of

    different predictor variables within a regression equation; the weights represent the number of

    standard deviations that an outcome variable changes for each one standard deviation change in

    any given predictor variable, all other variables held constant. A standardized regression

    coefficient is computed by dividing a parameter estimate by the ratio of the sample standard

    deviation of the dependent variable to the sample standard deviation of the regressor.

    Results

    This section presents the results of this research in three parts. The first two sections refer

    to the calculation of the R-SAT and its predictive validity compared to the SAT, including its

    performance on the issue of over or underprediction. The third section offers the predictive

    validity findings related to the IRT ability estimates.

    Freedles Revised SAT Verbal Score

    Table 3 shows the number of students from California public high schools who originally

    took each test form and for whom the adjusted scores were calculated. The adjusted scores were

    calculated for a total of 3,922 Hispanic examinees and 2,234 African American examinees.

    INSERT TABLE 3 HERE

    The R-SAT Verbal score mean is higher than the original mean SAT Verbal score in all

    ethnic groups and test forms (see Author (Year, Appendix 2) for details). On average, the R-SAT

    ge 23 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    25/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    24

    Verbal score increases the mean performance of African American students from 382.5 to 407

    (6.4%) and the mean performance of Hispanic students from 471.6 to 484.0 (2.6%).

    Table 4 shows results that display greater detail about whether and how the R-SAT

    Verbal score benefits minority students. Note that the bottom 3 rows represent the students who

    benefit from the use of the R-SAT Verbal score. We observe that 68% (a total of 1,537 over

    2,234) of African American examinees improve their scores when the R-SAT Verbal score is

    considered in place of the SAT Verbal score. The same occurs for 58% (a total of 2,271 over

    3,922) of the Hispanic sample. In addition, the R-SAT Verbal tends to benefit students in the low

    end of the original SAT Verbal score distribution. While most examinees increase their scores

    by between 0 and 50 points, the increment reaches as high as 202 points in a number of cases.

    On average, however, the score increase is not as large as Freedle described it to be.

    INSERT TABLE 4 HERE

    In order to assess the impact of the revised SAT score in the admissions decisions of

    minority students, Freedle estimated and compared the number of African American students

    who would be offered admission at competitive colleges when considering each score. Freedle

    hypothesized that receiving an R-SAT score of at least 600 would be sufficiently meritorious to

    interest many colleges in an applicant who received such a score.24

    He found that by considering

    the revised SAT score instead of the original SAT score the number of African Americans

    scoring over 600 in two of the forms he analyzed increased from 166 to 235 (Form 4I) and from

    24Freedle chose to consider an SAT score of 600 or above as meritorious because students whose high school grade

    point average is between the 97 and 100 percentile receive an average SAT verbal score of 610 and, in addition, a

    score of 600 also reflects a level of test performance that only about 5 percent of the test-taking population receives,

    using the normal SAT scoring procedures (Freedle, 2003).

    Page 24

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    26/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    25

    117 to 167 (Form OB023) which was equivalent to an increase in admission to selective colleges

    by 342 percent and by 334 respectively.

    The analyses reported here show an effect in the same direction Freedle described,

    however, the impact in the number of African American students whose admissions are likely to

    have changed is more modest. When using the maximum of the SAT and the R-SAT Verbal

    scores, the number of African American students scoring over 600 increases from 79 to 86. This

    represents an increase of 8.9% over the original number of African American students in the

    sample scoring over 600 (see Table 5) or an increase from 3.5% of all African Americans to

    3.8% . When considering both African American and Hispanic students, the number of students

    scoring over 600 increases from 458 (7.4% of all minority students) to 516 (8.3% of all minority

    students), which is equivalent to an increase of 12.6%.

    Overall, 7.4% of minority examinees score over 600. In comparison, 3,889 White

    students, or 19.7% of all White examinees, score 600 or above and received an average score of

    653.

    INSERT TABLE 5 HERE

    The consideration of a different cut-off score would only result in significant benefit for

    minorities if it was drastically reduced. More than 60% of the African American and Hispanic

    students considered in this analysis would receive an R-SAT Verbal score below 450 therefore

    only a cut-off score around or below this level would result in a different admission decision.

    This drastic reduction in score level, however, does not seem consistent with the assumption of

    being admitted to highly competitive colleges.

    ge 25 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    27/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    26

    The analyses presented in Table 5 regarding the impact of Freedles R-SAT in

    admissions decisions and subsequent analyses looking at the R-SATs predictive validity

    consider the maximum score between the SAT Verbal score and R-SAT Verbal score for

    minority students, and not just the revised SAT score. This is done in consideration of Freedles

    own recommendations:

    the solution is to recognize that this is a pervasive phenomena that can be easily

    remedied by reporting two scores, the usual SAT and the R-SAT. (Freedle, 2003)

    Since Freedle recommends reporting both scores and interprets the difference between

    them as the difference between the White majoritys culture and the cultural background of

    minority groups, then the consideration of the maximum of the two scores represents the less

    disadvantageous scenario in which minority groups might compete for admission into selective

    colleges.

    Predictive Validity of the Revised SAT Verbal Score

    This section presents the results on the predictive capacity of the revised SAT Verbal

    score. Its capacity to predict short and long term academic outcomes is compared to that of the

    original SAT Verbal score by ethnic group and academic outcome. It is important to keep in

    mind that although the results are presented side-by-side for three ethnic groups, the main focus

    of this investigation was to compare the goodness of fit statistics and parameter estimates within

    ethnic groups, especially within minority groups. The results for the White students sample are

    presented as a comparison with minority students results.

    In order to increase the sample size the R-SAT Verbal score for all SAT forms were

    combined. This aggregation was possible because the performance in each form was previously

    Page 26

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    28/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    27

    scaled by ETS.25

    The aggregation conducted also assumes that the four SAT forms were equated

    during test development.26

    The inclusion of the school ranking in the model, however, meant

    that only students taking the 1999 forms (IZ and VD) where included in the analysis.27

    Table 6 shows the adjusted R2

    for the multivariate models estimated within each ethnic

    group. The overall predictive power of the models examined varies depending on the academic

    outcome and ethnic group. In general, the models predict college grades better for White students

    than for minority students. While the capacity to predict annual college grades for all groups

    tends to decline over time, the overall prediction of cumulative fourth-year grade point average is

    unexpectedly high for White and Hispanic students. In addition, and only for White students, the

    prediction of fourth year graduation is significantly weaker than the prediction of college grades.

    Interestingly, this is not the case for African American and Hispanic students. The models

    capacity to predict long term outcomes, such as fourth-year cumulative grade point average and

    four-year graduation, is surprising considering that these indices are measured four years into the

    students college career. Long-term outcomes are often assumed to be affected by variables

    different from those included here, such as financial aid and previous experience in college

    (Wilson, 1983; Reason, 2009).

    25Scaling refers to a psychometric process conducted to achieve comparability among test score from different test

    forms.

    26Equating is a process different from scaling and aims to adjust for differences in difficulty among test forms. For

    an introduction to traditional scaling and equating methods please Kolen (1988).

    27 The maximum score between the original SAT and the R-SAT Verbal score was used for minority students.

    Models using just R-SAT Verbal score and excluding school ranking as explanatory variables are presented in

    Author (Year, Appendix 5) and result in findings similar to the ones displayed in this section. They and do not

    provide stronger evidence in favor of the R-SAT score.

    ge 27 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    29/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    28

    INSERT TABLE 6 HERE

    In general, the adjusted R2

    for Hispanic and White students are consistent with the results

    reported by similar studies (Author, Year; Author, Year; Geiser & Studley, 2002; Zwick et al.,

    2004). The power to predict college GPA for African American students, though, it is below

    what has been reported by other studies and below the power to predict college GPA for the

    other two ethnic groups and we believe it is in part an artifact of the small sample size. Geiser &

    Studley (2002), for example, reported R2s closer to 10% for African American students (pp. 15).

    When predicting graduation, however, the models predict better for African Americans than for

    White and Hispanic students.

    Table 6 shows that the capacity to predict college outcomes using the R-SAT Verbal

    score is close to, but slightly less, than the predictive power capacity achieved when using the

    original SAT score. The R-SAT Verbal score predicts better than the original SAT score only in

    two cases and just for the African American group: 4th

    year college grade point average and

    fourth year cumulative grade point average. The difference in predictive power, though, does

    not seem of large practical significance. It ranges between 0 and 1 percentage point and the

    maximum increase in predictive capacity is only 0.59%.

    The relatively weaker capacity to predict college outcomes associated with the use of the

    R-SAT can also be observed in Tables 1, 2 and 3 in Author (Year, Appendix 3) which show the

    standardized coefficient estimates and their statistical significance (p-values) when predicting

    first-year UC GPA, cumulative fourth-year UC GPA and fourth-year graduation by ethnic group.

    They also present the adjusted R2

    for each regression and its sample size. In Author (Year,

    Page 28

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    30/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    29

    Appendix 3) we also discuss the results associated to the other explanatory variables included in

    the regression models which are similar to the findings from previous literature.

    Over and Underprediction of Freshmen Grades

    Freedle suggested that the revised SAT score would help reduce the problem of over and

    underprediction reported by the literature on predictive validity of college admissions tests

    (Zwick et al., 2004; Zwick et al., 2002; Ramist et al., 1994; Ramist et al. 2001). The potential

    improvement of over and underprediction obtained from using the revised SAT score rather than

    the original SAT score was assessed and the results are presented in this section.

    Under or overprediction is usually assessed by fitting one general prediction model for

    college students from all ethnic groups and then summing the regression residuals for a particular

    ethnic group. In order to have an idea of the average individual over or underprediction the sum

    of residuals is then divided by the number of students in each ethnic group. In this case,

    regression models 1.1 and 1.2 were estimated and the average residual by ethnic group

    compared. All explanatory variables included in these models were described in the previous

    section.

    1stYRGPA =1+

    2APIQ+

    3Educ+

    4Inc+

    5HSGPA+

    6SATM+

    7SATV

    i(1.1)

    1stYRGPA=1 +2APIQ+3Educ+4Inc+5HSGPA+ 6SATM+7Max(SATV_RSATV)i (1.2)

    Table 7 shows the regression output for regression models 1.1 and 1.2 for all first-year

    UC students. The results are similar to those presented in Table 6 for White students. This is not

    ge 29 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    31/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    30

    surprising given that White students are the most numerous ethnic group included in the

    sample.28

    We find underprediction of White students grades (0.01) and overprediction of Hispanic

    (-0.025) and African American students grades (-0.098) when using the SAT, just as previous

    research did (Ramist et al., 1994, 2001). On average, the overprediction is smaller than the one

    reported by Ramist et al. (1994) for African American students (-0.16) and larger than the that

    reported by Geiser & Studley (2002) and by Zwick et al. (2004) for African American students,

    except for the1998-1999 UCLA mega-cohort for the African American group.29

    For Hispanic

    students the overprediction is smaller than the one reported by Ramist et al. (2001) (-0.13) and

    similar to some of the results reported by Zwick et al. (2004) (see for example Berkeley 1996-

    1997 mega-cohort, Irvine 1998-1999 mega-cohort, San Diego 1996-1997 mega-cohort).

    We found no improvement in the prediction accuracy from using the R-SAT Verbal score

    for minority groups. On the contrary, the prediction errors for minorities increased when using

    the maximum from the SAT and R-SAT Verbal score to 0.114 for African American students

    and 0.032 for Hispanic students respectively.30

    28 Although they also resemble somewhat the results obtained for the Hispanic subsample, the standardized

    coefficients associated to parents education and income as well as the overall R2 are closer to those observed for the

    White students. See Author (Year, Appendix 3) for details.

    29

    We focused our attention on Zwick et al. s model 6, which is the most similar to the analyses reported in this

    section.

    30 The same analysis was conducted for fourth-year cumulative UC GPA and the average underprediction for

    African American and Hispanic students increased as well (from 0.181 to 0.194 and from 0.033 to 0.040

    respectively).

    Page 30

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    32/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    31

    Predictive Validity of IRT Ability Estimates

    This section presents the results regarding the predictive power of the IRT ability

    estimates and compares those results to the predictive capacity of the R-SAT and original SAT

    Verbal scores. The IRT ability estimates include: (i) ability estimates obtained from estimating

    the Rasch model in all the test items, (ii) ability estimates obtained from estimating the Rasch

    DIF model in all the test items and (iii) ability estimate obtained from estimating the Rasch

    model in only the hardest half of the items. These three ability estimates were obtained for all

    White, African American and Hispanic students.31

    These analyses were conducted separately for each combination of ethnic group,

    academic outcome and test form in which the Freedle phenomenon was observed. This analysis

    structure translated in reduced sample sizes. Table 2 of this paper shows the ethnic groups and

    forms in which the relationship between item difficulty and DIF estimates was observed.

    Test forms and ethnic groups could not be aggregated as in the R-SAT predictive validity

    analysis because the Conquest estimation, especially that of the Rasch DIF model, generates one

    student ability estimate per ethnic comparison. In addition, ability estimates from different Rasch

    models, student samples and test forms cannot be directly aggregated because they are on

    different scales. Even if we assumed that test forms were equated during test development,

    information about the difficulty parameters of items used in equating is not available, preventing

    the use of a common scale for all ability estimates.

    Two of the five output tables for form 1999 IZ are presented here (see Tables 8 and 9).

    Both tables display summary statistics of the analyses conducted using one of the most current

    31 This differs from the R-SAT analysis presented in the previous sections, in which, the new score was only

    computed for minority students.

    ge 31 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    33/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    32

    forms analyzed (1999 IZ): (i) R square information for each of the models explaining a total of

    six dependent variables for the African American/White comparison and (ii) R square

    information for each of the models explaining a total of six dependent variables for the

    Hispanic/White comparison. Form 1999 IZ has the largest sample size. The two tables presented

    here are representative of the results obtained in the other test forms and ethnic groups (Hispanic

    students taking Form 1994 VD, African American students taking form 1994 QI and 1994 Form

    DX) . The remainder output tables are included in Author (Year, pp. 161-164). Although there

    are differences in the overall predictive capacity by ethnic group, academic outcome and test

    form, overall predictive validity results lead to the same conclusions as the findings presented in

    Tables 8 and 9.

    The predictive power of the multivariate regression models assessed fares best when

    predicting college grades of White students and the performance of the model decreases over

    time with the exception of cumulative fourth year GPA. College grades of minority students,

    especially African American students, are not predicted in any meaningful way. Surprisingly, the

    models under study predict fourth-year graduation better for African American and Hispanic

    students than for White students. This trend was already noted in the previous section. Negative

    adjusted R2

    indicate very low explained variance in spite of the inclusion of a large number of

    parameter in the regression model.

    Although the overall predictive power varies significantly by form, ethnic group and

    academic outcome, within the same ethnic group and academic outcome there is no significant

    practical difference among the predictive capacity achieved when using either of the three IRT

    ability estimates. In addition, there is no clear trend, as measured by the R2, in the superiority of

    using either of the IRT ability estimates, the SAT original score or the revised SAT scores.

    Page 32

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    34/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    33

    The small sample size and related instability of results allow us to present only a tentative

    conclusion about the little practical difference observed in the overall predictive power

    associated to the different IRT ability estimates, and how they fare in comparison to the original

    SAT score. In addition there is some evidence suggesting that Rasch and Rasch DIF ability

    estimates fare better in predicting short term academic outcomes for minorities while the

    original SAT score predicts better long-term outcomes for the same group.

    Discussion

    The research presented in this article aimed to examine the predictive validity of the R-

    SAT score addressing the methodological criticisms made to the way Freedle obtained the

    different component to calculate de R-SAT score (Dorans, 2004; Dorans & Zeller, 2004a,

    2004b). We did so by using formula score rather than proportion correct as the basis to calculate

    the R-SAT score and by directly estimating studentsability using the Rasch and the Rasch DIF

    model, both with all items and with only the hardest half of the items. This latter approach

    addressed the issue of inverse regression and aggregation of estimates from different ethnic

    groups

    Analyses presented above show that, in this sample, the R-SAT score helps minority

    students, although not as much as Freedle expected. On average, it increases scores by 24 points

    or 6% for African American students and by 12 points or 2.5% for Hispanic students. Using

    Freedles assumptions, the consideration of the R-SAT would change admissions decisions of

    minority students admitted into selective colleges by about 10%. This is much less than

    Freedles prediction of approximately 300% increase.32

    The small increases in R-SAT scores

    are consistent with the magnitude of score increase reported by Dorans (2004) and Dorans &

    Zeller (2004a).

    32 Freedle identified an increase of 342% for Form 4I and an increase of 334% for Form OB023.

    ge 33 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    35/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    34

    In addition the predictive validity analyses show no significant difference in the capacity

    to predict short and long-term outcomes when using either the original or the revised SAT score.

    Also, results show that the traditional problem of over and underprediction would remain the

    same when using the revised SAT score.

    Results from using the IRT ability measures are somewhat less straightforward but also

    support the conclusion that there is little practical difference in the overall predictive power

    associated to the different IRT ability estimates, and how they fare in comparison to the original

    SAT and the R-SAT scores.

    This research has several limitations. Among them is the fact that predictive validity

    analyses were conducted on a group of students who were already accepted to college and

    therefore present significant restriction of range in some of the explanatory variables. In addition

    many students who did not attend selective colleges might have matriculated at such schools if

    their R-SAT scores had been used in the admission process but this limitation is also observed in

    other predictive validity studies (Geiser & Studley, 2002; Zwick, 2002; Zwick, Brown & Sklar,

    2004; Zwick & Sklar, 2005). This consideration limits in some extent the validity of our

    findings. The use of inverse regression and the aggregation of different ethnic groups in order to

    obtain the R-SAT scores (not the IRT ability estimates) are still subject to Dorans & Zellers

    original criticisms. Recent changes to the content of the SAT and the inclusion of a mandatory

    Writing test may limit the generalizability of the findings presented here since they were based in

    somewhat older test forms. Larger sample size for each minority group may be desirable in

    order to implement future research, especially for African American students, however, that will

    require the combination of data for a number of colleges and universities that exceeds the overall

    and minority sample size of the nine campuses of the University of California combined.

    Page 34

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    36/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    35

    Furthermore, and despite the limited sample size of African American and Hispanic students, we

    were still able to observe results that were similar to those reported by previous research, such as

    the statisticall significance and practical importance of high school grades for predicting college

    grades and graduation. These results provide support for the validity of our results for these

    particular samples.

    We think it is important to highlight the consistency of the results obtained in the

    numerous and diverse analyses implemented across African American and Hispanic students: no

    strong evidence in favor of the R-SAT score is observed when (a) recalculating the scores using

    only the most difficult items for minorities, (b) when using that R-SAT score to directly predict

    short and long term outcomes using models that considered and did not consider SAT II scores,

    (c) when using models that did not control for school quality and allowed us to have larger

    sample sizes, (d) when evaluating the over and underprediction problem for minorities, and (e)

    when using using IRT ability estimates (considering all items, all items plus a DIF parameter and

    only the hardest-half items of the test) to predict short and long term outcomes.

    The findings presented in this article consistently reveal that there are minimal benefits

    associated with Freedles R-SAT and suggest that, rather than using measures aimed to

    complement the SAT, efforts and energy should be directed to studying the phenomenon behind

    the systematic relationship between item difficulty and DIF estimates (Author, Year) and directly

    addressing those issues during test development. The investigation of potential causes should

    include studies that investigate at Freedles proposed explanation, the influence of academic

    versus home language (Freedle, 2010) including investigation of the cognitive processes of

    students while taking the test as well as quantitative analyses and modeling techniques (De

    Boeck, 2010). In addition, further research should investigate the sensibility of Freedles

    ge 35 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    37/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    36

    phenomenon to alternative forms of guessing such as differential guessing strategies between

    White and students from other ethnic groups.

    These results also suggest that alternative policy options should be considered if the goal

    is to increase the representation of minority groups in higher education, specially at highly

    selective institutions (Bowen, Chingos & McPherson, 2009)33

    . Those options may include the

    use of school quality indices as input in the admissions processes (Zwick & Himelfarb, 2011)

    and/or explicitly considering nonacademic outcomes as desirable college goals and therefore

    adjusting the weight of admission indicators accordingly (Sinha, Oswald, Imus & Schmitt,

    2011).

    33 Bowen et al. (2009) call undermatching to the phenomenon by which students enroll in institutions that are less

    demanding than they are qualified to attend. The phenomenon is described as most pronounced among well-

    qualified low-income and minority students, who enroll at two-year institutions o less-selective four year

    insitutions. Since college completion varies sharply with school selectivity, even after controlling for student

    characteristics, the penomenon of undermatching results in minority students graduating from less-demanding

    colleges at lower rates than similar students at highly-selective institutions.

    Page 36

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    38/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    37

    Table 1: Descriptive Statistics. Overall Sample Taking SAT forms and Subsample of Students

    who enrolled at UC.

    Variable N Mean Std Dev.

    Overall Sample

    SAT Composite28,860 958 224

    HSGPA 28,367 3.23 0.45

    Income 25,678 56,853 30,239

    Max Ed Level 28,489 6.40 2.18

    UC Applicant Sample

    SAT Composite11,155 1067 206

    HSGPA 11,016 3.47 0.36

    Income 9,866 62,550 30,779

    Max Ed Level 11,027 6.89 2.16

    UC Enrolled Sample

    SAT Composite 4,804 1098 195

    HSGPA 4,754 3.55 0.32

    Income 4,253 63,250 30,938

    Max Ed Level 4,749 6.93 2.19

    Source: College Board

    ge 37 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    39/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    38

    Table 2

    Presence of the Freedle Phenomenon According to the Standardization and Rasch Model Across

    Methods, Forms and Ethnic Groups. Verbal Tests.*

    Group Method 1999 IZ 1999 VD 1994 QI 1994 DX

    White, African

    American

    Standardization

    ApproachYES NO YES NO

    Rasch Model YES NO YES YES

    White, Hispanics

    Standardization

    ApproachNO NO NO NO

    Rasch Model YES YES NO NO

    * Presence of the Freedle phenomenon is defined as a statistically significant and high (above 0.3) correlation.

    Page 38

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    40/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    39

    Table 3: Number of Students for Whom the Revised Score was Calculated and

    IRT Ability Parameters Estimated.

    Group 1999 IZ 1999 VD 1994 QI 1994 DX Total

    White Examinees 6548 6682 3360 3188 19778

    Hispanics Examinees 1904 2018 - - 3922

    African American Examinees 854 - 671 709 2234

    ge 39 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    41/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    40

    Table 4: Distribution of Score Difference by Ethnic Groups and Corresponding Mean SAT Verbal Score.

    Overall Sample.

    Difference Between

    R-SAT Verbal Score

    and SAT Verbal

    Score (both end

    points included)

    African American Examinees Hispanic Examinees

    Number PercentageMean SAT

    ScoreNumber Percentage Mean SAT Score

    [-106, -101] - 2 0% 515.0

    [-100, -51] 39 2% 433.6 95 2% 506.2

    [ -50, 0] 658 29% 438.7 1554 40% 518.4

    [ 0, 49] 966 43% 396.2 1704 43% 468.9

    [ 50, 101] 452 20% 301.6 418 11% 370.0

    [ 100, 210] 119 5% 251.7 149 4% 276.1

    TOTAL 2234 100% 382.5 3922 100% 471.6

    Page 40

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    42/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    41

    Table 5: Number of Examinees Scoring 600, or above, in the Sample and their Mean Scores.

    Ethnic Group

    Number of

    Students Scoring

    Over 600 when

    considering SAT

    Verbal Score

    Mean SAT

    Verbal

    Number of Student

    Scoring Over 600 when

    considering Max.

    between SAT and R-

    SAT Verbal Score

    Mean of Max.

    Between SAT V

    and R-SAT V

    Total

    Number of

    Examinees

    in the

    Sample

    African

    American

    Students

    79 637 86 643 2,234

    African

    American and

    Hispanic

    Students

    458 645 516 648 6,156

    White

    Students3,889 653 - - 19,778

    ge 41 of 55

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    43/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    42

    Table 6: Overall Predictive Power of the Original SAT Verbal Scores and the Maximum between the SAT

    Verbal Scores and the Revised SAT Verbal scores. Multivariable Regression Models.

    Score UCGPA 1st Year UCGPA 2nd Year

    African

    American

    Students

    Hispanic

    Students

    White

    Students

    African

    American

    Students

    Hispanic

    Students

    White

    Students

    SAT V 2.15% 15.36% 21.24% 0.18% 13.16% 16.55%

    Max [SATV or RSATV] 1.66% 15.00% - 0.07% 12.40% -

    N 78 597 2253 73 540 2120

    UCGPA 3rd Year UCGPA 4th Year

    African

    American

    Students

    Hispanic

    Students

    White

    Students

    African

    American

    Students

    Hispanic

    Students

    White

    Students

    SAT V -4.39% 8.13% 12.92% 4.81% 5.01% 13.11%

    Max [SATV or RSATV] -5.19% 7.27% - 4.94% 4.38% -

    N 67 497 1964 64 476 1904

    UC CUM GPA 4th YEAR UC GRADUATION BY 4th YEAR*

    African

    American

    Students

    Hispanic

    Students

    White

    Students

    African

    American

    Students

    Hispanic

    Students

    White

    Students

    SAT V 0.12% 15.18% 20.68% 15.97% 13.35% 6.91%

    Max [SATV or RSATV] 0.71% 14.28% - 15.08% 13.13% -

    N 65 481 1927 78 613 2314

    *Pseudo R2

    is reported for the logistic regression used to predict fourth-year graduation.

    Page 42

    URL: http://mc.manuscriptcentral.com/hame Email: [email protected]

    Applied Measurement in Education

  • 7/29/2019 AME Article

    44/56

    ForP

    eerRev

    iewOnly

    R-SAT Predictive Validity

    43

    Table 7: Predictive Power of First-Year UC GPA: A Joint Regression Equation. Standardized Estimates

    and Statistical Significance.

    Regression

    Model

    API

    Quintile

    Parents

    Education

    Income

    LevelHS GPA

    SAT

    Math

    Max

    [SATV

    or R-

    SAT

    VERBAL

    SCORE]

    SAT

    Verbal

    Adjusted

    R2

    N

    1.1 0.102 0.098 0.039 0.330 -0.021 0.191 23.85% 2,928

    (

  • 7/2


Recommended