Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | ella-bradley |
View: | 215 times |
Download: | 1 times |
Methods in Computational Linguistics II
with reference to Matt Huenerfauth’s Language Technology material
Lecture 4: Matching Things. Regular Expressions
3
Regular Expressions
• Can be viewed as a way to specify – Search patterns over a text string– Design a particular kind of machine, a Finite
State Automaton (FSA) • we probably won’t cover this today.
– Define a formal “language” i.e. a set of strings
4
Uses of Regular Expressions
• Simple powerful tools for large corpus analysis and ‘shallow’ processing– What word is most likely to begin a sentence– What word is most likely to begin a question?– Are you more or less polite than the people
you correspond with?
5
Definitions
• Regular Expression: Formula in algebraic notation for specifying a set of strings
• String: Any sequence of characters• Regular Expression Search
– Pattern: specifies the set of strings we want to search for
– Corpus: the texts we want to search through
10
Character Groups
• Some groups of characters are used very frequently, so the RE language includes shorthands for them
12
Escape Characters
• Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier.
13
RE Matching in Python NLTK
• Set up:– import re– from nltk.util import re_show– sent = “colourless green ideas sleep furiously
• re_show(pattern, str)– shows where the pattern matches
14
Substitutions
• Replace every l with an s
• re.sub(‘l’, ‘s’, sent)– ‘cosoursess green ideas sseep furioussy’
• re.sub(‘green’, ‘red’, sent)– ‘colourless red ideas sleep furiously’
15
Findall
• re.findall(pattern, sent)– will return all of the substrings that match the
pattern– re.findall(‘(green|sleep)’, sent)
• [‘green’, ‘sleep’]
16
Match
• Matches from the beginning of the string• match(pattern, string)
– Returns: a Match object or None (if not found)
• Match objects contain information about the search
19
Search
• re.search(pattern, string)– Finds the pattern anywhere in the string.
– re.search(‘\d+’, ‘ 1034 ’).group() • ‘1034’
– re.search(‘\d+’, ‘ abc123 ‘).group()• ‘123’
20
Splitting
• ‘text can be made into lists’.split()
• re.split(pattern, split)– uses the pattern to identify the split point– re.split(‘\d+’, “I want 4 cats and 13 dogs”)
• [“I want ”, “ cats and ”, “ dogs”]
– re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”)• [“I want”, “cats and”, “dogs”]
21
Joining
• ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’]
• This simple formatting can be helpful to report results or merge information
22
Stemming with Regular Expressions
def stem(word):
regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
stem, suffix = re.findall(regexp, word)[0]
return stem