Date post: | 18-Dec-2014 |
Category: |
Technology |
Upload: | dan-chudnov |
View: | 626 times |
Download: | 0 times |
hacker 102code4lib 2010 preconference
Asheville, NC, USA 2010-02-21
iv. regular expressions
JavaScript
if all languagelooked like
“aabaaaabbbabaababa”it’d be
easy to parse
parsing “aabaaaabbbabaababa”
•there are two elements, “a” and “b”
•either may occur in any order
•/([ab]+)/
• [] denotes “elements” or “class”
• // demarcates regex
• + denotes “one or more of previous thing”
• () denotes “remember this matched group”
• /[ab]/ # an ‘a’ or a ‘b’
• /[ab]+/ # one or more ‘a’s or ‘b’s
• /([ab]+)/ # a group of one or more ‘a’s or ‘b’s
to firebug!
• [a-z] is any lower case char bet. a-z
• [0-9] is any digit
• + is one or more of previous thing
• ? is zero or one of previous thing
• | is or, e.g. [a|b] is ‘a’ or ‘b’
• * is zero to many of previous thing
• . matches any character
• [^a-z] is anything *but* [a-z]
• [a-zA-Z0-9] is any of a-z, A-Z, 0-9
• {5} matches only 5 of the preceding thing
• {2,} matches at least 2 of the preceding thing
• {2,6} matches from 2 to 6 of preceding thing
• [\d] is like [0-9] (any digit)
• [\S] is any non-whitespace
• visit any web page
• open firebug console
• title = window.document.title
• try regexes to match parts of the title
try this
most every languagehas regex support
try unix “grep”
v. glue it together
Python
problem: Carol’s data
TITLE: ABA journal. BD. HOLDINGS: Vol. 70 (1984) - Vol. 94 (2008)CURRENT VOL.: Vol. 95 (2009) -OTHER LIBRARIES: Miami:v. 68 (1982) - USDC: v. 88 (2002) - Birm.:v. 89 (2003) -(Formerly: American Bar Association Journal)(Bound and on Hein)
TITLE: Administrative law review. BD. HOLDINGS: Vol. 22 (1969/1970) - Vol. 60 (2008)CURRENT VOL.: Vol. 61 (2009) - (Bound and on Hein)
starter codefor you
#!/usr/bin/env pythonimport rere_tag = re.compile(r'([A-Z \.]+):')re_title = re.compile('TITLE: (.*)')for line in open('journals-carol-bean.txt'): line = line.strip() m1 = re_tag.match(line) m2 = re_title.match(line) if line == "": continue print "\n->", line, "<-" if m1 or m2: print "MATCH" if m1: print 'tag:', m1.groups() if m2: print 'title:', m2.groups()