1896192019872006 Automated Writing Evaluation(AWE): Past, Present and Prospect Dr. Li Zhang (...

1896 1920 1987 2006

Automated Writing Evaluation(AWE):

Past, Present and Prospect

Dr. Li Zhang (张荔）Shanghai Jiao Tong University

Shanghai, China

1896 1920 1987 2006

Outline

Introduction of kinds of AWE.

Introduction of JUKU - AWE developed in China.

Prediction of future development of AWE

1896 1920 1987 2006

Some most widely used AWE systems

PEG (Project Essay Grader)

IEA (Intelligent Essay Assessor)

E-rater

IntelliMetric

BETSY (Bayesian Essay Test Scoring sYstem).

1896 1920 1987 2006PEG

Ellis Page, Duke University

1966

uses correlation to predict the intrinsic quality of essays (Chung & O’Neil, 1997).

Trins: intrinsic variables such as fluency, grammar, punctuation

Proxes : the surface features related to the intrinsic variables, such as word length, part of speech or word meaning (Page & Peterson, 1995).

1896 1920 1987 2006

Essay evaluation processPEG is trained on a sample of more than 300 essays to obtain text features, which is analyzed to establish the correlation to human raters.

Proxes are determined for each essay and entered into the prediction equation to get beta weights through regression analysis.

A score is assigned to the essay by computing beta weights (coefficients) (Chung & O’Neil, 1997)

1896 1920 1987 2006

Figure 1: PEG system scoring process. Shadowed blocks refer to major sources of variations. Barred blocks indicate results of computations (Cited from Chung & O’Neil, 1997, p. 7）

1896 1920 1987 2006IEA

Thomas Landauer and his colleagues, University of Colorado

late 1990s

Latent Semantic Analysis (LSA; Lemaire & Dessus, 2001).

1896 1920 1987 2006

Essay evaluation processIEA system is trained on domain-representative texts.

These texts and the new essay are taken as vectors.

The conceptual relevance of the essay is compared to the texts by using LSA.

Texts most similar to the essay are selected and weighted by cosine average to obtain a score, which is interpreted as the final score of the essay (Landauer, Laham, & Foltz, 2003).

1896 1920 1987 2006

1896 1920 1987 2006E-rater

Educational Testing Service (ETS)

1990s

E-rater uses Natural Language Processing (NLP) techniques, vector-spec model and linear regression model

The features of e-rater include a syntactic module, a discourse module, and a topical-analysis module

1896 1920 1987 2006

Essay evaluation processUses linear regression analysis to process texts scored by human raters

Decides the optimal weighting model that can predict the human ratings.

Uses NLP to identify some features in an essay and combine them into feature scores.

Generates scores by using the weighting model to measure feature scores (Enright & Quinlan, 2010).

1896 1920 1987 2006

1896 1920 1987 2006Criterion

Criterion is an online essay scoring and evaluating system that relies on 1) e-rater to give scores to an essay, and 2) Critique writing analysis tools to provide detailed evaluation and feedback on language, discourse, contents, etc. (Dikli, 2006).

1896 1920 1987 2006IntelliMetric

Vantage Learning

1998

A cognitive model of information processing and understanding

Core technology: Artificial intelligence, NLP, computational linguistics, statistics, machine learning, CogniSearch and Quantum Reasoning （ Elliot 2003）

1896 1920 1987 2006

Evaluates an essay by features of content and structure (Vantage Learning, 2005)

Five categories:

focus and unity

development and elaboration

organization and structure

sentence structure

mechanics and conventions

(Vantage Learning, 2005).

1896 1920 1987 2006Essay evaluation processPreprocess electronic form of the essay to make sure it is readable by IntelliMetric.

Extracts information from the essay by using NLP.

Transforms the information into numerical form to support computation of the mathematical models.

Applies the mathematical understanding to a new essay and integrates the information to yield the final score (Vantage Learning, 2005).

1896 1920 1987 2006

Figure 4: Architecture of IntelliMetric (cited from Vantage Learning, 2005, p. 12)

1896 1920 1987 2006MY Access!

MY Access! is a web-based writing assessment tool that relies on IntelliMetric to provide students with a writing environment that offers immediate scoring and diagnostic feedback (Vantage Learning, 2005).

1896 1920 1987 2006

BETSYLawrence M. Rudner of Maryland University in 2002

Two models of Bayesian theorem: the Multivariate Bernoulli Model and the Multinominal Model.

Core idea: classification of essays on the basis of about 1000 trained texts (Valenti, et al., 2003).

This classification is based on essay features including content related features and form related features.

1896 1920 1987 2006

BETSY use the models to analyze features

specific words and phrases

frequency of certain content words

number of words

sentence length

number of verbs

the order that concepts are presented

the occurrence of specific noun verb pairs (Rudner & Liang, 2002).

and categorizes new texts into groups: Advanced/Proficient/Basic/Below

1896 1920 1987 2006

Reliability

Systems Correlation Agreement Citation

PEG 0.72- 0.78 (Page & Peterson, 1995)

IEA 0.85 (Landauer,et al., 2000)

E-rater 0.73-0.93 87% - 97% （ Burstein, et al., 2004）

IntelliMetric 0.83 94%-98% (Elliot, 2002)

BETSY 80% (Rudner & Liang, 2002).

1896 1920 1987 2006

SEAR （ Christie, 1999）APEX （ Lemaire, 2001）PS-ME （ Mason & Grove-Stephenson, 2002）ATM （ Callear et al., 2001）C-rater （ Leacock, 2003）eGrader （ Byrne et al., 2010）MaxEnt （ Sukkarieh & Bolge, 2010），Writing Roadmap （ Rich et al., 2013）， LightSIDE （ Mayfield & Rose, 2013）Crase （ Lottridge et al., 2013）……

1896 1920 1987 2006JUKU

Developed by Chinese researchers in 2010, JUKU (http://www.pigai.org/)

Used by more than 200 universities in China.

1896 1920 1987 2006

Based on corpus and cloud computing technology.

Measures the comparative distance between students’ essays and the standard corpora contents.

Each essay is measured with 192 dimensions within the categories of vocabulary, sentence, discourse and content.

Provides reports involving scores, overall comments and line-by-line feedbacks.

1896 1920 1987 2006Reliability

Reliability analysis based on 1456 essays written by students from Nanjing University.

Agreement 92%

Complete + adjacent agreement (15 points, 5 levels) 93.37% (Zhang, unpublished)

Correlation less than 0.7

1896 1920 1987 2006

1896 1920 1987 2006

1896 1920 1987 2006

1896 1920 1987 2006

1896 1920 1987 2006

Combination of machine and human evaluation

Teacher feedback

Peer feedback

Recommendation

Praise

Comments

1896 1920 1987 2006Prospect

Writing evaluation, whether by human raters or automated scoring, should satisfy two conditions: 1) the rubric should reflect the essential aspects of writing competence and 2) the ratings should be consistent with the rubric (Weigle, 2013).

Cope, et al. (2011) propose “an alternative potential for NLP based on an understanding of the writing process as a fluid, iterative struggle to make meaning” (87).

1896 1920 1987 2006

Figure 5: CBAL writing competency model (Deane, et al., 2011, p. 3)

1896 1920 1987 2006

University of California

Give formative feedback on each draft

Provide feedback on a wide range of student writing

Use LightSIDE to encourage new feature extraction

Use machine learning technology to provide an open source for adjusting to new problems

Install "machine-student dialogue” and “intelligent tutoring system”

1896 1920 1987 2006Summary

1) the design of an AWE that help improve learners’ cognitive and critical thinking ability;

2) a shift of emphasis from language and structure of an essay to its ideas, thinking and rhetorical effectiveness;

3) the evaluation of different genres of writing, including both art and scientific articles;

4) the development of new software engine that can provide formative feedback to student writing;

1896 1920 1987 2006

5) the use of machine learning technology to design an AWE system that provides open source for adjusting to new problems;

6) the use of machine-human dialogue and intelligent tutoring system to enhance the effect of feedback;

7) the cooperation of different disciplines in the development of AWE

writing teachers,

test developers,

cognitive psychologists,

psychometricians,

and computer scientists

1896 1920 1987 2006References

Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The criterion online writing service. AI Magazine 25: 27-35.

Chung, K. W. K., & O’Neil, H. F. (1997). Methodological approaches to online scoring of essays. Retrieved from http://www.cse.ucla.edu/products/reports/tech461.pdf

Cope, B., Kalantzis, M., McCarthey, S., Vojak, C., & Kline, S. (2011). Technology-mediated writing assessments: Principles and processes. Computers and Composition 28: 79–96.

Deane, P., Quinlan, T., & Kostin, I. (2011). Automated scoring within a developmental, cognitive model of writing proficiency. Princeton, NJ: Educational Testing Service.

Elliot, S. (2002). A study of expert scoring, standard human scoring and IntelliMetric scoring accuracy for statewide eighth grade writing responses. Newtown, PA: Vantage Learning.

Elliot, S. (2003). IntelliMetric™: From Here to Validity. In M. D. Shermis & J. Burstein (eds.), Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum, 71-86.

Enright, M., & Quinlan, M. (2010). Complementing human judgment of essays written by English language learners with e-rater® scoring. Language Testing 27: 317-334.

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes 25: 259-284.

Landauer, T. K., Laham, D., & Foltz, P. W. (2000). The Intelligent Essay Assessor. IEEE Intelligent systems: The debate on automated essay grading 15: 27- 31.

Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automatic essay assessment. Assessment in Education 10: 295-308.

Lemaire, B., & Dessus, P. (2001). A system to assess the semantic content of student essays. Educational Computing Research 24: 305-306.

Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47, 238-243.

Page, E., & Peterson, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan 76: 561 - 565.

Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning, and Assessment 1(2): 3-21.

Shermis, M., & Barrera, F. (2002). Exit assessments: Evaluating writing ability through Automated Essay Scoring (ERIC document reproduction service no ED 464 950).

Shermis, M. D., & Burstein, J. (2003). Introduction. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum, xiii–xvi.

Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on automated essay grading. Journal of Information Technology Education 2: 319-330.

Vantage Learning. (2005). How IntelliMetric™ Works. Retrived from http://www.cengagesites.com/academic/assets/sites/4994/WE_2_IM_How_IntelliMetric_Works.pdf

Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical considerations. Assessing Writing 18: 85–99.

Warschauer, M. (2014). DIP: Next-Generation automated feedback in support of iterative writing and scientific argumentation. Unpublished research proposal.

Thanks

[email protected]

Date post:	03-Jan-2016
Category:	Documents
Upload:	frederick-hicks
View:	220 times
Download:	0 times

1896192019872006 Automated Writing Evaluation(AWE): Past, Present and Prospect Dr. Li Zhang (...

Documents