+ All Categories
Home > Documents > CSC 4630 Meeting 21 April 4, 2007. Return to Perl Where are we? What is confusing? What practice do...

CSC 4630 Meeting 21 April 4, 2007. Return to Perl Where are we? What is confusing? What practice do...

Date post: 14-Jan-2016
Category:
Upload: jordan-andrews
View: 216 times
Download: 3 times
Share this document with a friend
21
CSC 4630 Meeting 21 April 4, 2007
Transcript

CSC 4630

Meeting 21

April 4, 2007

Return to Perl

• Where are we?

• What is confusing?

• What practice do you need?

Ray’s Problem

Given a string of the form:1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 = 100replace the 8 b’s with

– one plus sign– two minus signs– five empty strings, signifying close up the

spacing to make a number

and find which replacements yield a true statement.

Ray’s Problem (2)

Thoughts on the answer:

• 1234-56-78+9 = 100 is an example

• How many possible strings are there?

• Proof by exhaustion may be the best

Regular Expressions Revisited

Returning to a fundamental structure

• Theoretically defined

• Implemented in grep, egrep,

• Implemented in awk, gawk, nawk

• Implemented in Perl

RE(2)

• Theoretically a RE defines a set of strings on an alphabet

• In implementation matching with a RE checks whether the current string is an element of a set of strings that is constructed from the strings defined theoretically.

RE(3)

• A single character c• Theoretically defines the set of strings {c}• Which generates the set of matching lines {ScT},

where S and T are arbitrary, possibly empty strings.

• In implementation,– grep c somelines returns ______________– awk “/c/” somelines returns ______________– if (/c/) print {$_;} returns ______________

RE(4)

so grep c somelines is equivalent to

perl re1 <somelines where re1 is the Perl program

while <STDIN> {

if (/c/) {print $_;}

}

RE(5)

• Theoretically if r and s are regular expressions defining languages L and M respectively, then– rs defines the language LM, meaning

concatenate a string in L with a string in M

• Hence,– grep abc somelines– awk “/abc/” somelines– while <STDIN> { if (/abc/) {print $_;}}

RE(6)

all return the lines that are contained in the set {SabcT} where S and T are arbitrary, possibly empty strings.

Details: /a/ defines {a}, /b/ defines {b}, /c/ defines {c}

/abc/ defines {abc} by concatenation

Lines matching /abc/ are in {SabcT}

RE(7)

• The * operator shows that the previous simple regular expression is repeated 0 or more times.

• /ab*c/ defines the language formed as the union of the languages defined by /ac/, /abc/, /abbc/, /abbbc/, etc. This is the set {abnc | n = 0,1,2, …} (an infinite set)

• Hence /ab*c/ matches any string of the form SabncT

RE(8)

• The symbol . designates any character in the alphabet (What is the alphabet we’re using?) except \n which stands for newline. (A Perl definition, check for the various shells and the various awks).

• Thus . defines the language A-{\n}• And . matches any line that contains at least

one character. Officially an empty line looks like\n

and every line ends with \n

RE(9)

Exercise: Construct all possible lines of text that will not be matched by /a./

Exercise: Construct all possible lines of text that will be matched by /.a.b./

Exercise: Regardless of their content, what lines of text will not be matched by /.a.b./

RE(10)

Character Classes

• Any set of characters enclosed in brackets– The vowels [aeiou]

• Any range of consecutive ASCII coded characters enclosed in brackets– The lower case letters [a-z]– The digits [0-9]– The hex digits [0-9A-F]

RE(12)

• Including special characters in the set– To get ], use \] or []a-z] (Think about reading this

string character by character to learn its meaning.)

– To get -, use \- or [a-z-]

• Complementing (not complimenting) a set– Use ^ as leading character, [^0-9] or [^aeiou]

• More special characters– To get ^, use \^ or place it away from the first

position [a-z^_]

RE(13)

The Matching Game:• [0123456789]• [0-9]• [0-9\-]• [a-z0-9]• [a-zA-Z0-9_]• [^0-7]• [^A-M.,;]• [^\^]• [0 - 9]• [.]

RE(14)

Short character set names

• \d means [0-9]

• \D means [^0-9]• \w means [a-zA-Z0-9_] (identifier characters)

• \W means [^a-zA-Z0-9_]

• \s means [ \r\t\n\f]

• \S means [^ \r\t\n\f]

RE(15)

More repetition symbols• b* means zero or more repetitions of b, as does

b{0,}• b+ means one or more repetitions of b, as does

b{1,}• b? means zero or one repetitions of b, as does

b{0,1}• b{5,8} means five, six, seven or eight repetitions

of b• b{4} means exactly four repetitions of b

RE(16)

• Splitting a string

split(/:/,$line) divides $line into substrings at the colons and places the substrings in a list (array)

Note: Two adjacent colons :: produce an empty string.

split(/:+/,$line) divides $line into nonempty substrings

Andy’s Problem

Lines from a text file look like• 105028|Adam Mrugalski|AJM Residential|1067 Shoecraft

rd|Webster|NY|14580||||||[email protected]||No||No|||Thu Dec 21 21:23:23 2006|

• 105029|robert ritchey|robert industries|po box 472|crockett |ca|94525|510-787-7290|||||[email protected]||No||No|||Fri Dec 22 02:54:54 2006|

• 105030|Jack Still|WISE TV|PO BOX 280|Coeburn|VA|24230|2763959339|||||[email protected]||No||No||9feet 1inch floor to floor. Connects to balcony. Need oak 4 feet round with landing at top. Send me a quote. J. Still WISE TV |Fri Dec 22 03:18:19 2006|

Andy (2)

The lines need to be cleaned and parsed into several reports:

• Phone contact information

• Email contact information

• Address labels

• Full data base, checking for unique entries


Recommended