Download - March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.

March 2006 Introduction to Computational Linguistics

1

CLINT

Tokenisation


2

Information Food Chain Inference↑ Knowledge Representation↑ Meaning Extraction↑ Semantic Relationships↑ Chunking (noun phrases; verb

phrases)↑ Part of Speech Annotation↑ Paragraph and sentence identification↑ Tokenisation↑ Raw Text


3

Start with a Corpus

• A corpus is an organised body of materials from language that is used as a basis for empirical studies.

• Corpora classfied according to– Representativeness– Medium– Language– Information Content– Structure


4

Examples of Corpora

• Project Gutenberg: public domain text resources. http://www.promo.net/pg

• Brown Corpus: a tagged corpus of about 1M words put together at Brown 1960-70

• Penn Treebank: a corpus of parsed sentences based on text from the WSJ

• Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament.

http://www.promo.net/pg


5

Low Level Issues

• Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc.

• Normalisation: deciding on standard character representations; adopting upper or lower case (or both)

• Tokenisation


6

Tokenisation

• Tokenisation is a process which divides input text into individual units called tokens.

• Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information.

• An example of such information is the type of the token: word, punctuation, number


7

What counts as a word?

• Words are quite tricky to define

• The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967)

• It is easy to find exceptions.


8

Problems Identifying Words

VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London)

• VfB Stuttgart, Manchester United• succession• 2-1• Wednesday


9

Problems Identifying WordsProblems Involving Spaces

• Lack of spaces between wordsLebensversicherungsgesellschaftsanngesteller (life insurance company employee)Ix-Xemx

• The presence of spaces may not indicate a word breakCoca Cola; +356 21 456 457


10

Problems Involving Special Characters

• Words often include non-alphanumeric characters which are actually part of the word.$22.50; www.di-ve.com.mt; BSc. IT :-)

• Words are often terminated by punctuation which is not part of the word.

• Sometimes, terminating punctuation is part of the word.


11

Periods

• In general, punctuation marks attach to words, and can be removed. However there are special cases:

• Most periods mark end of sentence• Others mark abbreviations, e.g. "e.g.".

"Wash."• Note that when an abbreviation occurs at

the end of a sentence there is only one period.


12

Apostrophe

• English contractions such as won't or I'll count as one word according to the classic definition

• However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP)

• Penn Treebank splits such contractions into two words.


13

Apostrophe

• This sometimes leaves odd wordsFor example isn’t yields is + n't

• 's is ambiguous– Abbreviation for is (he's strange)– Possessive (John's car)

• Word-final aprostrophe is ambiguous– end of quotation– possessive of word ending in s


14

Exercise

• How is the apostrophe used in Maltese

• How should a Maltese tokeniser deal with it?


15

Hyphen

• Issue: do sequences of words joined by hyphens count as one word or more?

• Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old)are usually removed.

• Typesetting hyphens can be ambiguous• Lexical hyphens are usually kept

hi-fi• Hyphens – standing alone – are used as

punctuation.• Texts are often inconsistent in usage of hyphens


16

Case

• Types vs. Tokens– How many tokens in the following sentence:

The cat chased the rat on the table– How many types?

• Tokenisation should correctly identify word types, i.e.– Tokens of the same type should be identified– Tokens of different type should be distinguished

• Case representation of ordinary words must be standardised.


17

Case

• Heuristics– Map first character of a sentence to standard

case – Map all words in titles to lowercase

• Problems– Identification of sentence boundaries– Identification of proper names


18

Normalisation

• Character representations.

• Converting all letters to lower or upper case

• Removing punctuation

• Removing letters with accent marks and other diacritics

• Expanding abbreviations


19

Further Normalisation

• Stemming: are eats and eating different words?

• They are two different wordforms

• that have the same stem, eat, but different suffixes, -s and -ing

• Stemming versus full morphological analysis.


20

Summary

• The tokenisation problem interacts with design decisions at different levels concerning– Handling of non alphanumeric characters– Case– Punctuation

• Typically many of these problems are dealt with by hand crafting special rules which match a particular case.

• Such rules are often built out of regular expressions.


21

Sources

Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999