+ All Categories

Swapnil Chhajer [email protected]

Date post: 24-Feb-2016
Category:
Upload: palti
View: 38 times
Download: 0 times
Share this document with a friend
Description:
Spelling Correction for Search Engine Queries B runo Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004). Swapnil Chhajer [email protected] http://schhajer.co.nr. Topics Covered in Class. - PowerPoint PPT Presentation
19
Spelling Correction for Search Engine Queries Bruno Martins and Mario J. Silva Proceedings of EsTAL-04, España for Natural Language Processing (2004) Swapnil Chhajer [email protected] http://schhajer.co.nr
Transcript
Page 1: Swapnil Chhajer schhajer@usc.edu

Spelling Correction for Search Engine Queries

Bruno Martins and Mario J. SilvaProceedings of EsTAL-04,

España for Natural Language Processing (2004)

Swapnil [email protected]

http://schhajer.co.nr

Page 2: Swapnil Chhajer schhajer@usc.edu

3

Topics Covered in Class• Peter Norvig’s Spelling Corrector: Query Processing [33-35]• Levenshtein Algortihm: Query Processing [36-41]• Evaluation Metrices: Precision & Recall: Introduction to

Information Retrieval [16]• Soundex Algorithm: Query Processing [18]

April 16, 2013 Spelling Correction for Search Engine Queries

Page 3: Swapnil Chhajer schhajer@usc.edu

4

Motivation & Abstract• Misspelled queries retrieve pages with misspelled words

which leaves behind the most appropriate pages.• 10-12% of queries are misspelled.• To provide user with the best possible match instead of

making user choose one of the possible corrections from the correction list.

April 16, 2013 Spelling Correction for Search Engine Queries

Page 4: Swapnil Chhajer schhajer@usc.edu

Google: Spelling Correction

5April 16, 2013 Spelling Correction for Search Engine Queries

Page 5: Swapnil Chhajer schhajer@usc.edu

Spelling Correction• Uses

• Correcting documents being indexed• Retrieve matching documents when query contains

spelling errorFlavors:• Isolated words

• Check words on its own• Unable to catch correctly spelled typos from vs.form

• Context-sensitive• Look at surrounding words, e.g., I flew form Heathrow to

Narita.

6April 16, 2013 Spelling Correction for Search Engine Queries

“a paragraph cud half mini flaws but wood bee past by the isolated spill checker”

Page 6: Swapnil Chhajer schhajer@usc.edu

General issues in Spelling Correction

• UI• Did you mean works for one suggestion. • What about multiple possible corrections ?

• Computational Cost• Spelling Correction is potentially expensive• Avoid running on each query• Maybe just on query that matches few documents• Guess: Spelling Correction of major search engines is

efficient enough to be run on every query

6April 16, 2013 Spelling Correction for Search Engine Queries

Page 7: Swapnil Chhajer schhajer@usc.edu

8

Kinds of Spelling Mistakes: Typos• Wrong characters by mistake• Categorized mainly into 4 categories:

• Insertions (Missing Letter)• “appellate” as “appellare”, “prejudice” as “prejudsice”

• Deletions (Extra Letter)• “plaintiff” as “paintiff”, “judgment” as “judment”, “liability” as

“liabilty”, “discovery” as “dicovery”, “fourth amendment” as “fourthamendment”

• Substitutions (Wrong letter)• “habeas” as “haceas”

• Transpositions• “fraud” as “fruad”, “bankruptcy” as “banrkuptcy, “subpoena”

as “subpeona”, “plaintiff” as “plaitniff”• 80-95% differ from the correct spellings in just one of the four

ways.• Keyboard layout is important in such cases.

April 16, 2013 Spelling Correction for Search Engine Queries

Page 8: Swapnil Chhajer schhajer@usc.edu

• Wrong characters on purpose• Most common type of mistake in general web queries• Mistakes derived from either pronunciation or spelling or

semantic confusions• Brainos: Soundalike (Phonetic Errors)

• “subpoena” as “supena”,“voir” as “voire”, “latter” as “ladder”, “withholding” as “witholding”, “foreclosure” as “forclosure”

• Brainos: Confusions• “preclusion” as “perclusion”, “men” as “mans”, “juries”

as “jurys” or “jureys”, “dramshop” as “dram shop”

8

Kinds of Spelling Mistakes: Brainos

April 16, 2013 Spelling Correction for Search Engine Queries

Page 9: Swapnil Chhajer schhajer@usc.edu

10

Dictionary Storage: Ternary Search Trees(TST)

• Data structure: Ternary Search Tree(TST)• Type of a TRIE, limited to 3 children per node.• TRIE is the common definition for a tree storing strings, in

which there is one node for every common prefix and the strings are stored in extra leaf nodes.

• Searching: O(log(n)+k)• n: number of strings in tree• k: length of string being searched for

April 16, 2013 Spelling Correction for Search Engine Queries

Page 10: Swapnil Chhajer schhajer@usc.edu

TST Continued…

11

Figure: A ternary search tree storing the words “to”, “too”, “toot”, “tab” and “so”, all within an associated frequency of 1

April 16, 2013 Spelling Correction for Search Engine Queries

Page 11: Swapnil Chhajer schhajer@usc.edu

Spelling Correction Algorithm• Implemented using edit distance, rule-based techniques, n-grams

probabilistic techniques, neural nets, similarity key techniques, or combinations.

• Goal: To find edit distance based on different strategies.• Shorter distance implies Good Correction.• Soundex System:

• Indexing based on sound.• Devised to help with the problem of phonetic errors.

• Metaphone Systems:• Specific to English language• Transforming words into codes based on phonetic properties• Based on consonants & diphthongs

• Spelling correction for web• Complete waste to make context dependent correction as user hardly

type more than three terms for a query

11April 16, 2013 Spelling Correction for Search Engine Queries

Page 12: Swapnil Chhajer schhajer@usc.edu

12

Spelling Correction Algorithm Continued…

• User entered query is tokenized ignoring non-word characters.

• Convert all words into lower case, and check whether the word is correctly spelled.

• Update the frequencies for correctly spelled words. This basically acts as a feedback to the system.

• Feedback system can be helpful for Spell Checker in predicting patterns in user’s searches.

• Misspelled words are replaced by correctly spelled words.• Finally, a new query is presented to the user as a

suggestion, together with the results page for the original query.

April 16, 2013 Spelling Correction for Search Engine Queries

Page 13: Swapnil Chhajer schhajer@usc.edu

• Algorithm is divided into 2 phases:• Phase 1: Generation of a set of candidate suggestions• Phase 2: Select the best choice among those selections

• Phase 1• 9 Steps, at each step look up dictionary for words that relate to the

original misspelling.• Differ in one character from the original word.• Differ in two character from the original word.• Differ in one letter removed or added.• Differ in one letter removed or added, plus one letter different.• Differ in repeated characters removed.• Correspond to 2 concatenated words (space between words

eliminated).• Differ in having two consecutive letters exchanged & 1 character

different• Have the original word as a prefix.• Differ in repeated characters removed & 1 character different.

13

Spelling Correction Algorithm Continued…

April 16, 2013 Spelling Correction for Search Engine Queries

Page 14: Swapnil Chhajer schhajer@usc.edu

• Phase 2: Heuristics used• Return the one if it only differs in accented characters• Return if it only differs in one character, with the error corresponding to

an adjacent letter in the same row of the keyboard.• Return the smallest one, if there are solutions having same metaphone

key as the original string.• Return if it only differs in one character, with the error corresponding to

an adjacent letter in an adjacent row of the keyboard.• In last, return the last word.

• Heuristics are followed sequentially and only move to the next if no matching words are found.

• If there are more than one matching words, return the one with first character matched.

• If still, there are more than one, choose the word with highest frequency.

14

Spelling Correction Algorithm Continued…

April 16, 2013 Spelling Correction for Search Engine Queries

Page 15: Swapnil Chhajer schhajer@usc.edu

15

Results Comparison• Aspell Spell Checker

• http://aspell.sourceforge.net/• Aspell uses Metaphone algorithm with near miss strategy• 48.33% correct forms were correctly guessed.• Outperformed Aspell by 1.66%

April 16, 2013 Spelling Correction for Search Engine Queries

* Doesn’t detect the misspelling - Failed in returning a suggestion.

Page 16: Swapnil Chhajer schhajer@usc.edu

16

Results Comparison Continued…• Tumba! : Search engine for Portuguese web

April 16, 2013 Spelling Correction for Search Engine Queries

Table: Results from spelling checker with Tumba!

Page 17: Swapnil Chhajer schhajer@usc.edu

17

Conclusion & Future Work

• Spelling checker uses a ternary search tree data structure for storing the dictionary.

• For data source, referred two popular Portuguese newspapers.• Queries in search engine may contain company or person’s name.

In such cases, keeping two dictionaries, one in the TST used for correction and another in an hash-table used only for checking valid words, could yield good results.

April 16, 2013 Spelling Correction for Search Engine Queries

Page 18: Swapnil Chhajer schhajer@usc.edu

Pros & Cons• Pros

• Considered various factors affecting edit distance including probabilistic estimations.

• Used feedback system to improve the quality of user queried results.

• Cons• Did not consider Context Sensitive spell checking.• It is not language independent system. Mainly focused on

Portuguese words.• No discussion about spell corrected completion suggestions as a

query is incrementally entered.

18April 16, 2013 Spelling Correction for Search Engine Queries

Page 19: Swapnil Chhajer schhajer@usc.edu

References• Contemporary Spelling Correction - Decoding the noisy channel, Bob

Carpenter• Using the Web for Language Independent Spellchecking and

Autocorrection, Whitelaw, Hutchinson, Chung and Ellis• How Difficult is it to Develop a Perfect Spell-checker? A Cross-linguistic

Analysis through Complex Network Approach, Choudhury, Thomas, Mukherjee, Basu and Ganguly

19April 16, 2013 Spelling Correction for Search Engine Queries


Recommended