+ All Categories
Home > Documents > Lecture 3: Spell checkers, n-grams

Lecture 3: Spell checkers, n-grams

Date post: 09-Feb-2022
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
30
Lecture 3: Spell Checking in Context: n-gram Models LING 1330/2330: Computational Linguistics Na-Rae Han
Transcript
Page 1: Lecture 3: Spell checkers, n-grams

Lecture 3: Spell Checking in

Context: n-gram Models

LING 1330/2330: Computational Linguistics

Na-Rae Han

Page 2: Lecture 3: Spell checkers, n-grams

Objectives

Context-aware spell checkers

n-gram as context

Character-level n-grams

Word-level n-grams

Frequent n-grams in English

9/18/2018 2

Page 3: Lecture 3: Spell checkers, n-grams

Spell checkers

9/18/2018 3

What did you find?

Which spell checkers work well, which don’t? In what way?

Anything else you noticed?

Page 4: Lecture 3: Spell checkers, n-grams

9/18/2018 4

Samsung Keypad's candidates for what

this word is or would become

What I typed in so

far

Samsung Keypad's top pick

(correction)

Page 5: Lecture 3: Spell checkers, n-grams

9/18/2018 5

Page 6: Lecture 3: Spell checkers, n-grams

9/18/2018 6

Page 7: Lecture 3: Spell checkers, n-grams

9/18/2018 7

Page 8: Lecture 3: Spell checkers, n-grams

9/18/2018 8

Page 9: Lecture 3: Spell checkers, n-grams

9/18/2018 9

Page 10: Lecture 3: Spell checkers, n-grams

9/18/2018 10

Page 11: Lecture 3: Spell checkers, n-grams

9/18/2018 11

Page 12: Lecture 3: Spell checkers, n-grams

9/18/2018 12

Page 13: Lecture 3: Spell checkers, n-grams

9/18/2018 13

What do you think about the

ranking?

Page 14: Lecture 3: Spell checkers, n-grams

9/18/2018 14

Page 15: Lecture 3: Spell checkers, n-grams

9/18/2018 15

typo

correction

correction candidates.

Edit Distancein action

Page 16: Lecture 3: Spell checkers, n-grams

9/18/2018 16

Page 17: Lecture 3: Spell checkers, n-grams

9/18/2018 17

typo

correction

correction candidates.

Edit Distancein action

Page 18: Lecture 3: Spell checkers, n-grams

What type of spell checker is this?

9/18/2018 18

Real-time Generates candidates with each character input

cf. MS word: operates by incoming words (response after a space or enter)

Look-ahead mode Two goals: spell correction + save typing

Predicts whole words based on first few characters

Aggressive Auto-corrects without user confirmation

Context-aware Considers characters that have been typed in

Probability-based Ranks candidates based on likelihood

Page 19: Lecture 3: Spell checkers, n-grams

Larger context?

9/18/2018 19

So Samsung Keypad considers preceding characters as a context

Does it consider preceding words as well?

Page 20: Lecture 3: Spell checkers, n-grams

"Are you"

9/18/2018 20

Page 21: Lecture 3: Spell checkers, n-grams

"Is you"

9/18/2018 21

Same candidates, same top pick.Does NOT take

preceding words into consideration.

This would have been a better pick following

"is"

Page 22: Lecture 3: Spell checkers, n-grams

MS Word considers word contexts

9/18/2018 22

Page 23: Lecture 3: Spell checkers, n-grams

MS Word considers word contexts

9/18/2018 23

suggests "form" → "from"

suggests "your" → "you're" instead of"from" → "form"

Page 24: Lecture 3: Spell checkers, n-grams

n-grams: character-level

9/18/2018 24

n-gram: a stretch of text n units long unigrams (1), bigrams (2), trigrams (3), 4-grams, 5-grams, …

'green ideas' Character unigrams:

['g', 'r', 'e', 'e', 'n', ' ', 'i', 'd', 'e', 'a', 's']

Character bigrams:['gr', 're', 'ee', 'en', 'n ', ' i', 'id', 'de', 'ea', 'as']

Character trigrams:['gre', 'ree', 'een', 'en ', 'n i', ' id', 'ide', 'dea', 'eas']

Character 4-grams:['gree', 'reen', 'een ', 'en i', 'n id', ' ide', 'idea', 'deas']

Page 25: Lecture 3: Spell checkers, n-grams

n-grams: word-level

9/18/2018 25

n-gram: a stretch of text n units long

unigrams (1), bigrams (2), trigrams (3), 4-grams, 5-grams, …

'Colorless green ideas sleep furiously.'

Word bigrams:[('colorless', 'green'), ('green', 'ideas'), ('ideas', 'sleep'), ('sleep', 'furiously'), ('furiously', '.')]

Word trigrams:[('colorless', 'green', 'ideas'), ('green', 'ideas', 'sleep'), ('ideas', 'sleep', 'furiously'), ('sleep', 'furiously', '.')]

Page 26: Lecture 3: Spell checkers, n-grams

n-grams and probability

9/18/2018 26

How likely do you think these letter bigrams are in English:

'th' 'ti' 'tb' 'tq' 'tx'

Putting it in terms of conditional probability:

After a user typed in letter 't', what is the most likely next character input?

How about after 'q'? After 'io'?

For fun:

What are the most frequent English letter bigrams?

th, he, in, er, an, re, nd, on, en, at

Trigrams?

the, and, ing, her, hat, his, tha, ere, for, ent

Page 27: Lecture 3: Spell checkers, n-grams

Word-level n-grams

9/18/2018 27

How likely do you think these n-grams are in English:

Putting in terms of conditional probability:

After a user types in 'are you', what is the most likely next word?

How about 'in the'? 'in the middle'?

are you

is you

are you so

are you also

are you does

46622

4441

428

26

-

Page 28: Lecture 3: Spell checkers, n-grams

For fun: most frequent bigrams?

9/18/2018 28

2551888 of the

1887475 in the

1041011 to the

861798 on the

676658 and the

648408 to be

578806 for the

561171 at the

498217 in a

479627 do n't

Source: http://www.ngrams.info/download_coca.asp

455367 with the

451460 from the

443547 of a

395939 that the

362176 is a

361879 going to

335255 by the

330828 as a

319846 with a

317431 I think

Page 29: Lecture 3: Spell checkers, n-grams

Most frequent trigrams?

9/18/2018 29

198630 I do n't

140305 one of the

129406 a lot of

117289 the United States

79825 do n't know

76782 out of the

75015 as well as

73540 going to be

61373 I did n't

61132 to be a

Source: http://www.ngrams.info/download_coca.asp

Page 30: Lecture 3: Spell checkers, n-grams

4-grams? 5-grams?

9/18/2018 30

54647 I do n't know

43766 I do n't think

33975 in the United States

29848 the end of the

27176 do n't want to

12663 I do n't want to

10663 at the end of the

8484 in the middle of the

8038 I do n't know what

6446 I do n't know if

Source: http://www.ngrams.info/download_coca.asp


Recommended