Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | kelly-spencer |
View: | 217 times |
Download: | 4 times |
On using context for automatic correction of non-word misspellings in student
essays
Michael Flor Yoko FutagiEducational Testing Service Educational Testing Service
2012 ACL
Outline
[ 1. Introduction ] [ 2. Corpus ] [ 3. Annotation ] [ 4. Spelling correction systems ] ConSpel system [ 5. Comparative evaluation ] [ 6. Discussion ] [ 7. Conclusions ]
Outline
[ 1. Introduction ] [ 2. Corpus ] [ 3. Annotation ] [ 4. Spelling correction systems ] ConSpel system [ 5. Comparative evaluation ] [ 6. Discussion ] [ 7. Conclusions ]
2. Corpus
High-stakes standardized tests:
- TOEFL - GRE
The corpus includes 3000 essays, for a total of 963,428 words.
Outline
[ 1. Introduction ] [ 2. Corpus ] [ 3. Annotation ] [ 4. Spelling correction systems ] ConSpel system [ 5. Comparative evaluation ] [ 6. Discussion ] [ 7. Conclusions ]
3. Annotation
Annotators were asked to identify all non-word misspellings.
Two annotators: - native English speakers - experienced in linguistic annotation
3. Annotation
Annotators agreed in 82.6% of the cases
(Cohen’s Kappa=0.8, p<.001).
All disagreements were resolved by a third annotator (adjudicator).
3. Annotation
The annotated corpus of 3,000 essays has the following statistics:
- Average essay length is 321 words (the range is 28-798 words)
- 148 essays turned out to have no misspellings at all
- 2.24% of the words in the corpus are non-word misspellings
Outline
[ 1. Introduction ] [ 2. Corpus ] [ 3. Annotation ] [ 4. Spelling correction systems ] ConSpel system [ 5. Comparative evaluation ] [ 6. Discussion ] [ 7. Conclusions ]
4. Spelling correction systems(ConSpel system)
The system focused on non-word misspellings for detection and correction of spelling errors.
4. Spelling correction systems(ConSpel system)
By default, the system will ignore: - numbers - dates - web - email addresses - mixed alpha-numeric strings (e.g. ‘RV400’) - capitalized words (e.g. ‘London’) - all uppercase (e.g. ‘ROME’)
4. Spelling correction systems(ConSpel system)
ConSpel spelling dictionaries include about 360,000 entries.
- includes all inflectional variants (e.g. ‘love’, ‘loved’, ‘loves’, ‘loving’) - international spelling variants (e.g. American and British English)
The core set includes 245,000 entries (modern English vocabulary)
Additional dictionaries include about 120,000 entries.
- international surnames and first names - names for geographical places
4. Spelling correction systems(ConSpel system)
Detection of Misspellings
The string is not in the system dictionaries.
4. Spelling correction systems(ConSpel system)
Correction of Misspellings
Dictionaries are also the source of suggested corrections.
Candidate suggestions:Use edit distance with the default threshold of 5.
Problem:Can easily get hundreds of correction candidates.
4. Spelling correction systems(ConSpel system)
Candidate suggestions are ranked using a set of algorithms:
- edit distance - phonetic similarity - word frequency - local context - context-sensitive
Outline
[ 1. Introduction ] [ 2. Corpus ] [ 3. Annotation ] [ 4. Spelling correction systems ] ConSpel system [ 5. Comparative evaluation ] [ 6. Discussion ] [ 7. Conclusions ]
5. Comparative evaluation
All evaluations were performed in “ full context”(rather than word-by-word)
Outline
[ 1. Introduction ] [ 2. Corpus ] [ 3. Annotation ] [ 4. Spelling correction systems ] ConSpel system [ 5. Comparative evaluation ] [ 6. Discussion ] [ 7. Conclusions ]
6. Discussion
Absence of grammatical errors. For example:
“They received fresh air, interacte with other youth their age, solved problems...”.
Outline
[ 1. Introduction ] [ 2. Corpus ] [ 3. Annotation ] [ 4. Spelling correction systems ] ConSpel system [ 5. Comparative evaluation ] [ 6. Discussion ] [ 7. Conclusions ]