Date post: | 03-Jan-2016 |
Category: |
Documents |
Upload: | frederick-hicks |
View: | 220 times |
Download: | 0 times |
1896 1920 1987 2006
Automated Writing Evaluation(AWE):
Past, Present and Prospect
Dr. Li Zhang (张荔)Shanghai Jiao Tong University
Shanghai, China
1896 1920 1987 2006
Outline
Introduction of kinds of AWE.
Introduction of JUKU - AWE developed in China.
Prediction of future development of AWE
1896 1920 1987 2006
Some most widely used AWE systems
PEG (Project Essay Grader)
IEA (Intelligent Essay Assessor)
E-rater
IntelliMetric
BETSY (Bayesian Essay Test Scoring sYstem).
1896 1920 1987 2006PEG
Ellis Page, Duke University
1966
uses correlation to predict the intrinsic quality of essays (Chung & O’Neil, 1997).
Trins: intrinsic variables such as fluency, grammar, punctuation
Proxes : the surface features related to the intrinsic variables, such as word length, part of speech or word meaning (Page & Peterson, 1995).
1896 1920 1987 2006
Essay evaluation processPEG is trained on a sample of more than 300 essays to obtain text features, which is analyzed to establish the correlation to human raters.
Proxes are determined for each essay and entered into the prediction equation to get beta weights through regression analysis.
A score is assigned to the essay by computing beta weights (coefficients) (Chung & O’Neil, 1997)
1896 1920 1987 2006
Figure 1: PEG system scoring process. Shadowed blocks refer to major sources of variations. Barred blocks indicate results of computations (Cited from Chung & O’Neil, 1997, p. 7)
1896 1920 1987 2006IEA
Thomas Landauer and his colleagues, University of Colorado
late 1990s
Latent Semantic Analysis (LSA; Lemaire & Dessus, 2001).
1896 1920 1987 2006
Essay evaluation processIEA system is trained on domain-representative texts.
These texts and the new essay are taken as vectors.
The conceptual relevance of the essay is compared to the texts by using LSA.
Texts most similar to the essay are selected and weighted by cosine average to obtain a score, which is interpreted as the final score of the essay (Landauer, Laham, & Foltz, 2003).
1896 1920 1987 2006
1896 1920 1987 2006E-rater
Educational Testing Service (ETS)
1990s
E-rater uses Natural Language Processing (NLP) techniques, vector-spec model and linear regression model
The features of e-rater include a syntactic module, a discourse module, and a topical-analysis module
1896 1920 1987 2006
Essay evaluation processUses linear regression analysis to process texts scored by human raters
Decides the optimal weighting model that can predict the human ratings.
Uses NLP to identify some features in an essay and combine them into feature scores.
Generates scores by using the weighting model to measure feature scores (Enright & Quinlan, 2010).
1896 1920 1987 2006
1896 1920 1987 2006Criterion
Criterion is an online essay scoring and evaluating system that relies on 1) e-rater to give scores to an essay, and 2) Critique writing analysis tools to provide detailed evaluation and feedback on language, discourse, contents, etc. (Dikli, 2006).
1896 1920 1987 2006IntelliMetric
Vantage Learning
1998
A cognitive model of information processing and understanding
Core technology: Artificial intelligence, NLP, computational linguistics, statistics, machine learning, CogniSearch and Quantum Reasoning ( Elliot 2003)
1896 1920 1987 2006
Evaluates an essay by features of content and structure (Vantage Learning, 2005)
Five categories:
focus and unity
development and elaboration
organization and structure
sentence structure
mechanics and conventions
(Vantage Learning, 2005).
1896 1920 1987 2006Essay evaluation processPreprocess electronic form of the essay to make sure it is readable by IntelliMetric.
Extracts information from the essay by using NLP.
Transforms the information into numerical form to support computation of the mathematical models.
Applies the mathematical understanding to a new essay and integrates the information to yield the final score (Vantage Learning, 2005).
1896 1920 1987 2006
Figure 4: Architecture of IntelliMetric (cited from Vantage Learning, 2005, p. 12)
1896 1920 1987 2006MY Access!
MY Access! is a web-based writing assessment tool that relies on IntelliMetric to provide students with a writing environment that offers immediate scoring and diagnostic feedback (Vantage Learning, 2005).
1896 1920 1987 2006
BETSYLawrence M. Rudner of Maryland University in 2002
Two models of Bayesian theorem: the Multivariate Bernoulli Model and the Multinominal Model.
Core idea: classification of essays on the basis of about 1000 trained texts (Valenti, et al., 2003).
This classification is based on essay features including content related features and form related features.
1896 1920 1987 2006
BETSY use the models to analyze features
specific words and phrases
frequency of certain content words
number of words
sentence length
number of verbs
the order that concepts are presented
the occurrence of specific noun verb pairs (Rudner & Liang, 2002).
and categorizes new texts into groups: Advanced/Proficient/Basic/Below
1896 1920 1987 2006
Reliability
Systems Correlation Agreement Citation
PEG 0.72- 0.78 (Page & Peterson, 1995)
IEA 0.85 (Landauer,et al., 2000)
E-rater 0.73-0.93 87% - 97% ( Burstein, et al., 2004)
IntelliMetric 0.83 94%-98% (Elliot, 2002)
BETSY 80% (Rudner & Liang, 2002).
1896 1920 1987 2006
SEAR ( Christie, 1999)APEX ( Lemaire, 2001)PS-ME ( Mason & Grove-Stephenson, 2002)ATM ( Callear et al., 2001)C-rater ( Leacock, 2003)eGrader ( Byrne et al., 2010)MaxEnt ( Sukkarieh & Bolge, 2010),Writing Roadmap ( Rich et al., 2013), LightSIDE ( Mayfield & Rose, 2013)Crase ( Lottridge et al., 2013)……
1896 1920 1987 2006JUKU
Developed by Chinese researchers in 2010, JUKU (http://www.pigai.org/)
Used by more than 200 universities in China.
1896 1920 1987 2006
Based on corpus and cloud computing technology.
Measures the comparative distance between students’ essays and the standard corpora contents.
Each essay is measured with 192 dimensions within the categories of vocabulary, sentence, discourse and content.
Provides reports involving scores, overall comments and line-by-line feedbacks.
1896 1920 1987 2006Reliability
Reliability analysis based on 1456 essays written by students from Nanjing University.
Agreement 92%
Complete + adjacent agreement (15 points, 5 levels) 93.37% (Zhang, unpublished)
Correlation less than 0.7
1896 1920 1987 2006
1896 1920 1987 2006
1896 1920 1987 2006
1896 1920 1987 2006
1896 1920 1987 2006
Combination of machine and human evaluation
Teacher feedback
Peer feedback
Recommendation
Praise
Comments
1896 1920 1987 2006Prospect
Writing evaluation, whether by human raters or automated scoring, should satisfy two conditions: 1) the rubric should reflect the essential aspects of writing competence and 2) the ratings should be consistent with the rubric (Weigle, 2013).
Cope, et al. (2011) propose “an alternative potential for NLP based on an understanding of the writing process as a fluid, iterative struggle to make meaning” (87).
1896 1920 1987 2006
Figure 5: CBAL writing competency model (Deane, et al., 2011, p. 3)
1896 1920 1987 2006
University of California
Give formative feedback on each draft
Provide feedback on a wide range of student writing
Use LightSIDE to encourage new feature extraction
Use machine learning technology to provide an open source for adjusting to new problems
Install "machine-student dialogue” and “intelligent tutoring system”
1896 1920 1987 2006Summary
1) the design of an AWE that help improve learners’ cognitive and critical thinking ability;
2) a shift of emphasis from language and structure of an essay to its ideas, thinking and rhetorical effectiveness;
3) the evaluation of different genres of writing, including both art and scientific articles;
4) the development of new software engine that can provide formative feedback to student writing;
1896 1920 1987 2006
5) the use of machine learning technology to design an AWE system that provides open source for adjusting to new problems;
6) the use of machine-human dialogue and intelligent tutoring system to enhance the effect of feedback;
7) the cooperation of different disciplines in the development of AWE
writing teachers,
test developers,
cognitive psychologists,
psychometricians,
and computer scientists
1896 1920 1987 2006References
Burstein, J., Chodorow, M., & Leacock, C. (2004). Automated essay evaluation: The criterion online writing service. AI Magazine 25: 27-35.
Chung, K. W. K., & O’Neil, H. F. (1997). Methodological approaches to online scoring of essays. Retrieved from http://www.cse.ucla.edu/products/reports/tech461.pdf
Cope, B., Kalantzis, M., McCarthey, S., Vojak, C., & Kline, S. (2011). Technology-mediated writing assessments: Principles and processes. Computers and Composition 28: 79–96.
Deane, P., Quinlan, T., & Kostin, I. (2011). Automated scoring within a developmental, cognitive model of writing proficiency. Princeton, NJ: Educational Testing Service.
Elliot, S. (2002). A study of expert scoring, standard human scoring and IntelliMetric scoring accuracy for statewide eighth grade writing responses. Newtown, PA: Vantage Learning.
Elliot, S. (2003). IntelliMetric™: From Here to Validity. In M. D. Shermis & J. Burstein (eds.), Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum, 71-86.
Enright, M., & Quinlan, M. (2010). Complementing human judgment of essays written by English language learners with e-rater® scoring. Language Testing 27: 317-334.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes 25: 259-284.
Landauer, T. K., Laham, D., & Foltz, P. W. (2000). The Intelligent Essay Assessor. IEEE Intelligent systems: The debate on automated essay grading 15: 27- 31.
Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automatic essay assessment. Assessment in Education 10: 295-308.
Lemaire, B., & Dessus, P. (2001). A system to assess the semantic content of student essays. Educational Computing Research 24: 305-306.
Page, E. B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47, 238-243.
Page, E., & Peterson, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi Delta Kappan 76: 561 - 565.
Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning, and Assessment 1(2): 3-21.
Shermis, M., & Barrera, F. (2002). Exit assessments: Evaluating writing ability through Automated Essay Scoring (ERIC document reproduction service no ED 464 950).
Shermis, M. D., & Burstein, J. (2003). Introduction. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective. Mahwah, NJ: Lawrence Erlbaum, xiii–xvi.
Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on automated essay grading. Journal of Information Technology Education 2: 319-330.
Vantage Learning. (2005). How IntelliMetric™ Works. Retrived from http://www.cengagesites.com/academic/assets/sites/4994/WE_2_IM_How_IntelliMetric_Works.pdf
Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical considerations. Assessing Writing 18: 85–99.
Warschauer, M. (2014). DIP: Next-Generation automated feedback in support of iterative writing and scientific argumentation. Unpublished research proposal.