+ All Categories
Home > Documents > Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from:...

Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from:...

Date post: 04-Jan-2016
Category:
Upload: horatio-merritt
View: 224 times
Download: 1 times
Share this document with a friend
13
Basic Text Processing Regular Expressions
Transcript
Page 1: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Basic Text Processing

Regular Expressions

Page 2: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

2

The original slides from:

http://web.stanford.edu/~jurafsky/NLPCourseraSlides.html

Some changes has done to these slides to fit with our NLP course

Page 3: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Regular expressions• A formal language for specifying text strings• How can we search for any of these?

• woodchuck• woodchucks• Woodchuck• Woodchucks

Page 4: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Regular Expressions: Disjunctions• Letters inside square brackets []

• Ranges [A-Z]

Pattern Matches[wW]oodchuck Woodchuck, woodchuck

[1234567890] Match Any digit

Pattern Matches (with red and blue color)

[A-Z] An upper case letter Drenched Blossoms

[a-z] A lower case letter my beans were impatient

[0-9] A single digit Chapter 1: Down the Rabbit Hole

Page 5: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Regular Expressions: Negation in Disjunction

• Negations [^Ss]• Carat ^ means negation only when first in square bracket []

Pattern Matches (with red and blue color)

[^A-Z] Not an upper case letter Oyfn pripetchik

[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reaSon”

[^e^] Neither e nor ^ Look here ^

a^b The pattern a carat b Look up a^b now

Page 6: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Regular Expressions: More Disjunction

• Woodchucks is another name for groundhog!• The pipe | for disjunction

Pattern Matches

groundhog|woodchuck groundhogwoodchuck

yours|mine yours mine

a|b|c = [abc]

[gG]roundhog|[Ww]oodchuck groundhogGroundhogwoodchuckWoodchuck

Photo D. Fletcher

Page 7: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

• Period (.) Itself mean any character but backslash period (\.) means period

Regular Expressions: ? * + .

Stephen C Kleene

Pattern Matches

colou?r Optional previous char

color colour

oo*h! 0 or more of previous char

oh! ooh! oooh! ooooh!

o+h! 1 or more of previous char

oh! ooh! oooh! ooooh!

baa+ baa baaa baaaa baaaaa

beg.n begin begun begun beg3n

Kleene *, Kleene +

Page 8: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Regular Expressions: Anchors ^ $• ^ match the begging of the line

• $ match the end of the line

Pattern Matches (with blue color)

^[A-Z] Palo Alto

^[^A-Za-z] 1 “ Hello ”

\.$ The end.

.$ The end? The end!

Page 9: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Example

• Question: Find me all instances of the word “the” in a text.• Solutions: the problem#1 Misses capitalized examples

problem#2 Incorrectly returns other or theology[tT]he

problem#2 Incorrectly returns other or theology[^a-zA-Z][tT]he[^a-zA-Z]

solves both problems1&2

Page 10: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Errors

• The process we just went through was based on fixing two kinds of errors• Matching strings that we should not have matched (there,

then, other)• False positives (Type I)

• Not matching things that we should have matched (The)• False negatives (Type II)

Page 11: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Errors cont.

• In NLP we are always dealing with these kinds of errors.• Reducing the error rate for an application often

involves two antagonistic efforts: • Increasing accuracy or precision (minimizing false positives)• Increasing coverage or recall (minimizing false negatives).

Page 12: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

Dan Jurafsky

Summary

• Regular expressions play a surprisingly large role• Sophisticated sequences of regular expressions are often the first model

for any text processing text

• For many hard tasks, we use machine learning classifiers• But regular expressions are used as features in the classifiers• Can be very useful in capturing generalizations

12

Page 13: Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: jurafsky/NLPCourseraSlides.h tml Some changes.

13

Exercises in the Class1 -see this link for practicing

:// .http regexpal com

2 -Write the following test text:

We looked! Then we saw him step in on the mat. We looked! And we saw him! The cat in the Hat!

3 -Practice these expressions:[Ww]

[em][A-Z][a-z]

[A-Za-z]]! [........................

^[Aa]]!^[

^[A-Za-z].....................looked|stepat|ook.........................

o+.........................

[A-Z]$!$

.\

.

.........................the

[tT]he


Recommended