Colloquium - grepv1.0
A. Magee
April 6, 2010
1 / 16
Colloquium - grep, v1.0
A. Magee
Outline
1 IntroductionWhat does grep offer?When should I use grep?
2 Understanding Regular ExpressionsClass BasicsQuantifiers & GroupingOnline ToolsExamples
3 Using Regular Expressions With grep
2 / 16
Colloquium - grep, v1.0
A. Magee
Outline
1 IntroductionWhat does grep offer?When should I use grep?
2 Understanding Regular ExpressionsClass BasicsQuantifiers & GroupingOnline ToolsExamples
3 Using Regular Expressions With grep
2 / 16
Colloquium - grep, v1.0
A. Magee
Outline
1 IntroductionWhat does grep offer?When should I use grep?
2 Understanding Regular ExpressionsClass BasicsQuantifiers & GroupingOnline ToolsExamples
3 Using Regular Expressions With grep
2 / 16
Colloquium - grep, v1.0
A. Magee
Introduction What?
What does grep offer?
grep matches regular expressions.
Your first question should be“What is a regular expression?”A regular expression is a language pattern.
grep and REs allow us to find complex things in text.
Complex is relative and can vary from a single character to an IPaddress.
Single character complex: [ajk+0-]IP complex: (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
3 / 16
Colloquium - grep, v1.0
A. Magee
Introduction What?
What does grep offer?
grep matches regular expressions.
Your first question should be“What is a regular expression?”A regular expression is a language pattern.
grep and REs allow us to find complex things in text.
Complex is relative and can vary from a single character to an IPaddress.
Single character complex: [ajk+0-]IP complex: (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
3 / 16
Colloquium - grep, v1.0
A. Magee
Introduction What?
What does grep offer?
grep matches regular expressions.
Your first question should be“What is a regular expression?”A regular expression is a language pattern.
grep and REs allow us to find complex things in text.
Complex is relative and can vary from a single character to an IPaddress.
Single character complex: [ajk+0-]IP complex: (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
3 / 16
Colloquium - grep, v1.0
A. Magee
Introduction When?
When should I use grep?
Always!
Unless you find some better tool.
P.S. - grep stands for g/re/p, an ed command that means global/regex/print
4 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Class Basics
Class Basics
A character class is a symbol or collection of symbols that describes agroup of characters.
. (period): This matches any single character.
[...]: This matches any one character in the set.
[aeiou] matches one of the vowels.[a-z] matches one of the lowercase alphabet.[0-5] matches one numeral 0 through 5.You will not remember all of these until you use them often, but
there are many special classes that can save you some typing.
5 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Class Basics
Common Classes
Special Class Meaning Simple RE\d Digit characters [0-9]\D Non-digit characters [ˆ0-9]\w Word characters [a-zA-Z 0-9]\W Non-word characters [ˆa-zA-Z 0-9]\s Whitespace characters characters [\f\n\r\t]\S Non-space characters [ˆ\f\n\r\t]\b Word boundary
The word boundary class is very special as it is zero length and matchestransitions between \s and \w and vice versa.
6 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Class Basics
More Common Classes
Special Class Meaning Simple RE[:alpha:] All alphabetic characters [a-zA-Z][:alnum:] All alphabetic and numeric [a-zA-Z0-9][:blank:] Tab and space[:cntrl:] Control characters [\x00-\x1F\x7F][:digit:] A numeric digit [0-9]
[:graph:] Any visible character [\x21-\x7E][:lower:] Lowercase characters [a-z][:print:] Printables (i.e. no controls) [\x20-\x7E][:punct:] Punctuation & symbols [!”#$%&’()*+,\-./:;<=>?
@[ ]ˆ ‘{|}∼][:space:] Space, tab, newline, etc [ \t\r\n\v\f][:upper:] Uppercase characters [A-Z][:word:] Word characters [a-zA-Z0-9 ][:xdigit:] Hex digits [A-Fa-f0-9]
7 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Quantifiers & Grouping
Quantifiers & Grouping
Quantifiers are how a RE counts things.? Exactly zero or one occurrence* Zero or more occurrences
+ One or more occurrences*? Zero or more occurrences non-greedy
+? One or more occurrences non-greedy{x} Exactly x occurrences{x,} At least x occurrences{x,y} At least x but no more than y occurrences
Grouping is used to collect patterns together and to createback-references. A group is simply a set of parentheses ().
8 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Online Tools
Helpful Tools
The best way to understand the rest of this presentation is to see what isbeing matched live. Here are some online tools that work for our needs.
RegExr - www.gskinner.com/RegExrbeware Flash, but it works well
regexpal - regexpal.comvery simple
reanimator - osteele.com/tools/reanimatorbeware Flash, recommend CS 4/570 first
rubular - rubular.comnice on-page reference
9 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE
Let’s skip trivial REs and get on to something useful. These may be morecomplex than you’re used to but the quicker you are able to read long,complex REs the better. This is a nice, but not perfect, email addressmatcher.
[[:alnum:]][[:word:]\.%+-]*@(?:[[:alnum:]-]+\.)+[[:alpha:]]{2,4}
[[:alnum:]][[:word:]\.%+-]*Match a word that doesn’t start with [.%+-].
@(?:[[:alnum:]-]+\.)+Match the @ symbol and any number of subdomains followed byperiods.
[[:alpha:]]{2,4}Match the top level domain of 2, 3 or 4 characters.
10 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 2
Let’s examine the first part.
[[:alnum:]][[:word:]\.%+-]*
[[:alnum:]] - Must start with an alphanumeric character.NB: All [: ... :] classes must live in a set like [[: ... :]].
[[:word:]\.%+-] - Other characters maybe a ‘word’ character,a literal space, percent symbol, plus symbol or a dash.NB: The period must be escaped because it has special meaning.
* - repeat the previous set zero or more times.
11 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 2
Let’s examine the first part.
[[:alnum:]][[:word:]\.%+-]*
[[:alnum:]] - Must start with an alphanumeric character.NB: All [: ... :] classes must live in a set like [[: ... :]].
[[:word:]\.%+-] - Other characters maybe a ‘word’ character,a literal space, percent symbol, plus symbol or a dash.NB: The period must be escaped because it has special meaning.
* - repeat the previous set zero or more times.
11 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 2
Let’s examine the first part.
[[:alnum:]][[:word:]\.%+-]*
[[:alnum:]] - Must start with an alphanumeric character.NB: All [: ... :] classes must live in a set like [[: ... :]].
[[:word:]\.%+-] - Other characters maybe a ‘word’ character,a literal space, percent symbol, plus symbol or a dash.NB: The period must be escaped because it has special meaning.
* - repeat the previous set zero or more times.
11 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 3
Now the second part, the subdomains, sub-subdomains, etc.
@(?:[[:alnum:]-]+\.)+
@ - Well that literally matches the ‘at’ character.
The parenthesis denote the beginning of a group.The ?: is a confusing notation that suppresses the creation of aback reference. It is here so you’ll know of it, but it is rarely needed.
Again we see a special class for alphanumerics, but we’ve alsoincluded a dash. The plus symbol tells us to look for one or more ofthese characters, followed by a period.
And lastly we close the group and the plus symbol now tells us tolook for one or more of these groups.
12 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 3
Now the second part, the subdomains, sub-subdomains, etc.
@(?:[[:alnum:]-]+\.)+
@ - Well that literally matches the ‘at’ character.
The parenthesis denote the beginning of a group.The ?: is a confusing notation that suppresses the creation of aback reference. It is here so you’ll know of it, but it is rarely needed.
Again we see a special class for alphanumerics, but we’ve alsoincluded a dash. The plus symbol tells us to look for one or more ofthese characters, followed by a period.
And lastly we close the group and the plus symbol now tells us tolook for one or more of these groups.
12 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 3
Now the second part, the subdomains, sub-subdomains, etc.
@(?:[[:alnum:]-]+\.)+
@ - Well that literally matches the ‘at’ character.
The parenthesis denote the beginning of a group.The ?: is a confusing notation that suppresses the creation of aback reference. It is here so you’ll know of it, but it is rarely needed.
Again we see a special class for alphanumerics, but we’ve alsoincluded a dash. The plus symbol tells us to look for one or more ofthese characters, followed by a period.
And lastly we close the group and the plus symbol now tells us tolook for one or more of these groups.
12 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 3
Now the second part, the subdomains, sub-subdomains, etc.
@(?:[[:alnum:]-]+\.)+
@ - Well that literally matches the ‘at’ character.
The parenthesis denote the beginning of a group.The ?: is a confusing notation that suppresses the creation of aback reference. It is here so you’ll know of it, but it is rarely needed.
Again we see a special class for alphanumerics, but we’ve alsoincluded a dash. The plus symbol tells us to look for one or more ofthese characters, followed by a period.
And lastly we close the group and the plus symbol now tells us tolook for one or more of these groups.
12 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your First RE - Part 4
Finally the third part, the domain.
[[:alpha:]]{2,4}We’ll now this part is easy. Just match 2, 3 or 4 alphabeticalcharacters.
13 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your Second RE
Now we’ll look at a RE that can help use build a header file for a cprogram file, given that some neglectful programmer has failed to designhis/her c program properly. This will be a quicker example.
ˆ[\w\s]*\([\w\s\*&,]*\)\s*{
ˆ[\w\s]*\(At the beginning of a line match some keywords and types andthe function name and then literal parenthesis.
[\w\s\*&,]*Match some more words, keywords, variable modifiers and commas.
\)\s*{Finally match the closing parenthesis, some whitespace and theleft curly brace, denoting the start of the function body.
14 / 16
Colloquium - grep, v1.0
A. Magee
Regular Expressions Examples
Your Second RE - Fine Details
ˆ[\w\s]*\([\w\s\*&,]*\)\s*{
In general, most RE parsers will not match across multiple lines, eventhough the \s class matches the newline character. This is verybothersome but is easily overcome by using pcregrep. pcre is PerlCompatible Regular Expression. This is all I will ever say about Perl.
Notice that the literal * must be escaped like so, \*.
As must the parentheses due to their special RE meaning.
Escaping so many characters is very annoying, but unfortunately it isnecessary.
15 / 16
Colloquium - grep, v1.0
A. Magee
Appendix
4 Appendix
16 / 16
Colloquium - grep, v1.0
A. Magee