Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology...

Methods in Computational Linguistics II

with reference to Matt Huenerfauth’s Language Technology material

Lecture 4: Matching Things. Regular Expressions

2

Today

• Regular Expressions• Snippet on Speech Recognition

– At least half of it.

3

Regular Expressions

• Can be viewed as a way to specify – Search patterns over a text string– Design a particular kind of machine, a Finite

State Automaton (FSA) • we probably won’t cover this today.

– Define a formal “language” i.e. a set of strings

4

Uses of Regular Expressions

• Simple powerful tools for large corpus analysis and ‘shallow’ processing– What word is most likely to begin a sentence– What word is most likely to begin a question?– Are you more or less polite than the people

you correspond with?

5

Definitions

• Regular Expression: Formula in algebraic notation for specifying a set of strings

• String: Any sequence of characters• Regular Expression Search

– Pattern: specifies the set of strings we want to search for

– Corpus: the texts we want to search through

6

Simple Example

7

More Examples

8

And still more examples

9

Optionality and Repetition

• /[Ww]oodchucks?/

• /colou?r/• /he{3}/• /(he){3}/• /(he){3},/

10

Character Groups

• Some groups of characters are used very frequently, so the RE language includes shorthands for them

11

Special Characters

• These enable the matching of multiple occurrences of a pattern

12

Escape Characters

• Sometimes you want to use an asterisk “*” as an asterisk and not as a modifier.

13

RE Matching in Python NLTK

• Set up:– import re– from nltk.util import re_show– sent = “colourless green ideas sleep furiously

• re_show(pattern, str)– shows where the pattern matches

14

Substitutions

• Replace every l with an s

• re.sub(‘l’, ‘s’, sent)– ‘cosoursess green ideas sseep furioussy’

• re.sub(‘green’, ‘red’, sent)– ‘colourless red ideas sleep furiously’

15

Findall

• re.findall(pattern, sent)– will return all of the substrings that match the

pattern– re.findall(‘(green|sleep)’, sent)

• [‘green’, ‘sleep’]

16

Match

• Matches from the beginning of the string• match(pattern, string)

– Returns: a Match object or None (if not found)

• Match objects contain information about the search

17

Methods in Match

18

More Match Methods

19

Search

• re.search(pattern, string)– Finds the pattern anywhere in the string.

– re.search(‘\d+’, ‘ 1034 ’).group() • ‘1034’

– re.search(‘\d+’, ‘ abc123 ‘).group()• ‘123’

20

Splitting

• ‘text can be made into lists’.split()

• re.split(pattern, split)– uses the pattern to identify the split point– re.split(‘\d+’, “I want 4 cats and 13 dogs”)

• [“I want ”, “ cats and ”, “ dogs”]

– re.split(‘\s*\d+\s*’, “I want 4 cats and 13 dogs”)• [“I want”, “cats and”, “dogs”]

21

Joining

• ‘ ‘.[‘lists’, ‘can’, ‘be’, ‘made’, ‘into’, ‘strings’]

• This simple formatting can be helpful to report results or merge information

22

Stemming with Regular Expressions

def stem(word):

regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'

stem, suffix = re.findall(regexp, word)[0]

return stem

23

Play with some code

24

Snippet on Speech Recognition

Date post:	27-Dec-2015
Category:	Documents
Upload:	ella-bradley
View:	215 times
Download:	1 times

Methods in Computational Linguistics II with reference to Matt Huenerfauth’s Language Technology...

Documents