March 2006 Introduction to Computational Linguistics
1
CLINT
Tokenisation
March 2006 Introduction to Computational Linguistics
2
Information Food Chain Inference↑ Knowledge Representation↑ Meaning Extraction↑ Semantic Relationships↑ Chunking (noun phrases; verb
phrases)↑ Part of Speech Annotation↑ Paragraph and sentence identification↑ Tokenisation↑ Raw Text
March 2006 Introduction to Computational Linguistics
3
Start with a Corpus
• A corpus is an organised body of materials from language that is used as a basis for empirical studies.
• Corpora classfied according to– Representativeness– Medium– Language– Information Content– Structure
March 2006 Introduction to Computational Linguistics
4
Examples of Corpora
• Project Gutenberg: public domain text resources. http://www.promo.net/pg
• Brown Corpus: a tagged corpus of about 1M words put together at Brown 1960-70
• Penn Treebank: a corpus of parsed sentences based on text from the WSJ
• Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament.
March 2006 Introduction to Computational Linguistics
5
Low Level Issues
• Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc.
• Normalisation: deciding on standard character representations; adopting upper or lower case (or both)
• Tokenisation
March 2006 Introduction to Computational Linguistics
6
Tokenisation
• Tokenisation is a process which divides input text into individual units called tokens.
• Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information.
• An example of such information is the type of the token: word, punctuation, number
March 2006 Introduction to Computational Linguistics
7
What counts as a word?
• Words are quite tricky to define
• The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967)
• It is easy to find exceptions.
March 2006 Introduction to Computational Linguistics
8
Problems Identifying Words
VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London)
• VfB Stuttgart, Manchester United• succession• 2-1• Wednesday
March 2006 Introduction to Computational Linguistics
9
Problems Identifying WordsProblems Involving Spaces
• Lack of spaces between wordsLebensversicherungsgesellschaftsanngesteller (life insurance company employee)Ix-Xemx
• The presence of spaces may not indicate a word breakCoca Cola; +356 21 456 457
March 2006 Introduction to Computational Linguistics
10
Problems Involving Special Characters
• Words often include non-alphanumeric characters which are actually part of the word.$22.50; www.di-ve.com.mt; BSc. IT :-)
• Words are often terminated by punctuation which is not part of the word.
• Sometimes, terminating punctuation is part of the word.
March 2006 Introduction to Computational Linguistics
11
Periods
• In general, punctuation marks attach to words, and can be removed. However there are special cases:
• Most periods mark end of sentence• Others mark abbreviations, e.g. "e.g.".
"Wash."• Note that when an abbreviation occurs at
the end of a sentence there is only one period.
March 2006 Introduction to Computational Linguistics
12
Apostrophe
• English contractions such as won't or I'll count as one word according to the classic definition
• However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP)
• Penn Treebank splits such contractions into two words.
March 2006 Introduction to Computational Linguistics
13
Apostrophe
• This sometimes leaves odd wordsFor example isn’t yields is + n't
• 's is ambiguous– Abbreviation for is (he's strange)– Possessive (John's car)
• Word-final aprostrophe is ambiguous– end of quotation– possessive of word ending in s
March 2006 Introduction to Computational Linguistics
14
Exercise
• How is the apostrophe used in Maltese
• How should a Maltese tokeniser deal with it?
March 2006 Introduction to Computational Linguistics
15
Hyphen
• Issue: do sequences of words joined by hyphens count as one word or more?
• Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old)are usually removed.
• Typesetting hyphens can be ambiguous• Lexical hyphens are usually kept
hi-fi• Hyphens – standing alone – are used as
punctuation.• Texts are often inconsistent in usage of hyphens
March 2006 Introduction to Computational Linguistics
16
Case
• Types vs. Tokens– How many tokens in the following sentence:
The cat chased the rat on the table– How many types?
• Tokenisation should correctly identify word types, i.e.– Tokens of the same type should be identified– Tokens of different type should be distinguished
• Case representation of ordinary words must be standardised.
March 2006 Introduction to Computational Linguistics
17
Case
• Heuristics– Map first character of a sentence to standard
case – Map all words in titles to lowercase
• Problems– Identification of sentence boundaries– Identification of proper names
March 2006 Introduction to Computational Linguistics
18
Normalisation
• Character representations.
• Converting all letters to lower or upper case
• Removing punctuation
• Removing letters with accent marks and other diacritics
• Expanding abbreviations
March 2006 Introduction to Computational Linguistics
19
Further Normalisation
• Stemming: are eats and eating different words?
• They are two different wordforms
• that have the same stem, eat, but different suffixes, -s and -ing
• Stemming versus full morphological analysis.
March 2006 Introduction to Computational Linguistics
20
Summary
• The tokenisation problem interacts with design decisions at different levels concerning– Handling of non alphanumeric characters– Case– Punctuation
• Typically many of these problems are dealt with by hand crafting special rules which match a particular case.
• Such rules are often built out of regular expressions.
March 2006 Introduction to Computational Linguistics
21
Sources
Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999