Post on 15-Jul-2015
transcript
Regex
makes me want to ( weep | give up | (╯°□°)╯︵ ┻━┻ )\.? A presentation by Brett Florio of FoxyCart.com.
Follow along at bit.ly/regex-makes-me-wanna
Who it’s for?▷ Beginners looking to
understand the basics
▷ Intermediate regex devs wanting a review and some new approaches
▷ Advanced programmers who just don’t really grok regular expressions.
▷ Anybody who hates regex because they don’t understand it.
Slides… are available atbit.ly/regex-makes-me-wanna
How we’ll learn:Rather than abstract concepts like “cat” and “dog”, we’ll focus on real use-cases you might run across in your daily programming.
What we’ll learn:▷ Our goal
▷ A brief history of regex
▷ Matching
▷ Validating
▷ Replacing
▷ Working with HTML
▷ Common gotchas
About this presentation!
▷ Co-founded FoxyCart.com (now Foxy.io) in 2007
▷ Dove into regex when @lukestokes told me something was impossible. Proved him wrong.
▷ Spent the past five years traveling full-time or half-time in an RV with my wife and 3 kids.
▷ Currently in Austin, TX, and happy to grab food or drinks if you’re in town!
@brettflorio
http://brettflorio.com/ has more photos like this -->
FoxyCart.com / Foxy.io is where I solve problems.
About @brettflorio
# Credit card number matcher
CREDIT_CARD = re.compile( r'([^\d])([3456][ -]*?(?:\d[ -]*?){12,15})([^\d])')
CC_REPLACEMENT = '\g<1>XXX_CC_LE_REPLACEMENT_XXX\g<3>'
# Password matching
PASSWORD = re.compile( r'customer_password=(.*?)&')
PASSWORD_REPLACEMENT = 'customer_password=XXX_PW_LE_REPLACEMENT_XXX&'
A recent real-life regex…
Extra sanitization of logs,in a Chef recipe:
1. Find emails
2. Validate custom input
3. Link @mentions and #tags in text
4. Strip <script> tags
5. Truly validate a subdomain^(?!-)[a-z0-9-]{1,63}(?<!-)$
Our goals!Understand how to:
http://www.totalprosports.com/2012/06/01/soccer-celebrations-special-effects-win-video/
“Big thanks to NomadPHP.com!
Check out Daycamp4Developers(PHP Application Security day in June)
“Some people, when confronted with a
problem, think “I know, I'll use regular expressions.”
Now they have two problems.
http://regex.info/blog/2006-09-15/247
▷ 1940s-60s: Lots of smart people
▷ 1970s: g/re/p
▷ 1980: Perl and Henry Spencer
▷ 1997: PCRE (Perl Compatible Regular Expressions)
Pronunciation: hard or soft ‘g’
Regular expressions’ history
Matching
int preg_match ( string $pattern , string $subject [, array &$matches [, int$flags = 0 [, int $offset = 0 ]]] )
Returns 1 if match found.0 if not.false if error
Common regex usage: PHP
Replacingmixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject [, int $limit = -1 [, int &$count ]] )
Returns the replaced string or array (based on the $subject).
Matching (all)
int preg_match_all ( string $pattern , string $subject [, array &$matches [, int $flags = PREG_PATTERN_ORDER [, int $offset = 0 ]]] )
Returns # (int) of matches found.
Matchingstring.match(RegExp);
Returns an array of matches, or null if no matches.
Replacingstring.replace(RegExp, replacement);
Returns the string with the replacements performed.
Caveats about JavaScript’s regex▷ No “single-line” or DOTALL mode. (The dot never matches a new line.)▷ No lookbehind support :(▷ Same methods for regex and non-regex matching and replacing.
Common regex usage: JS
Problem: Finding email addresses in a codebase.Goal: /[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i
2.The Basics of
Regex Patterns
Hypothetical situation:
Your project has bloated over the years, and both internal and external emails are going everywhere, maybe including terminated employees, personal accounts, etc.
Your mission:
You need to search the whole codebase to find all the emails so you can tidy things up!
Find all the emails!Or… an alternate story:
You need to strip emails from user-submitted content, to protect privacy or restrict communication (or like Airbnb does).
~12 Special Charactersaka “Metacharacters”
▷ . \ [ ] ? * + { } ( ) ^ $ |▷ - (sometimes)
Nearly everything else is a literal!
Imagine your input string as bolts, and your pattern as a set of sockets (in order).
An analogy:Sockets!
"Socket wrench and sockets" by Kae - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons
The exact match
If you know exactly what you’re looking for…
You still might get more than you wanted!
https://regex101.com/r/qG8zB1/1
The almighty .and the escape \
The dot (.) matches ANYTHING and EVERYTHING.
Except… new lines, by default. PHP and others can enable DOTALL or single-line mode to have the dot match a new line. JavaScript can’t.
The backslash \ escapes special characters (metacharacters). So \. makes a dot match just a dot.
https://regex101.com/r/eR9vT7/1
The almighty . (dot)
The dot (.) matchesANYTHING and EVERYTHING
(except newlines, by default).
Gator Grip Universal Socket, available online.
The almighty . (dot)
The dot (.) matchesANYTHING and EVERYTHING
(except newlines, by default).
Toysmith Classic Pin Art, ~$20. Buy one!
Square brackets match what’s inside them.
[abc] ‘a’ ‘b’ or ‘c’[a-z] Lowercase letters[0-9] Any single digit[a-z.] Letters and the dot
A common case is…[A-Za-z0-9_]which has a shortcut:\w “Word” characters
So… let’s try this: [\w.+-]
Character Classes!
https://regex101.com/r/iW3bW4/1
Dashes need escaping inside square brackets (unless they’re at the start or the end), since they have special meaning
So… [\w.+-] is fine. The dash is at the end.But… [\w.\-+] needs escaping.
When in doubt, escaping doesn’t typically hurt.[\w.+\-] is also just fine.
Escaping!
? 0 or 1 match (optional)* 0 or more matches+ 1 or more matches
But what about at least 3, or 1 through 6 matches?
Repetition!
https://regex101.com/r/sF4tM6/1https://regex101.com/r/aC3iH8/1
https://regex101.com/r/iE3rB4/1https://regex101.com/r/uF5lB7/1
https://regex101.com/r/tI4nO0/1https://regex101.com/r/aX5qG6/1
Curly brackets get you minimum and maximum ranges. Minimum is required:
{1,} At least 1{1,3} 1 through 3{1,64} 1 through 64
64 characters is the maximum length of the username portion of an email, so…
More Repetition!
It looks similar in both PHP…
preg_match(‘/pattern/i‘, $subject);
And JavaScript:
string.match(/pattern/i);
Other common modifiers are:
s Makes the dot match newlines as well. (PHP)g Match all, not just the first. (JavaScript)m Makes ^ and $ line-specific.
References for PHP and JavaScript
By default, regex is case-sensitive.Adding an “i” after the pattern’s delimiter fixes that.
DON’T FORGET CAPS LOCK
Putting it all together
/[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i(Try it on a project in your text editor.)
A great tool for testing how PHP handles preg_match, preg_match_all, and preg_replace is http://www.phpliveregex.com/
See this example at http://www.phpliveregex.com/p/9yD
What that looks like in PHPpreg_match_all( "/[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i", $input_lines, $output_array);
Array ( [0] => Array ( [0] => ceo@example.com [1] => the.woz@example.com [2] => r_wayne@example.commerce.co.uk [3] => hello@apple.com [4] => cto@example.com [5] => coo@example.com [6] => press@foo.example.com [7] => admin@localhost [8] => benedicto@example.com [9] => cto@sub.example-com.ca [10] => CTO@EXAMPLE.COM )
. [] ?
* + {}
Square Brackets
Matches characters inside the brackets. Supports ranges.
[abc] ‘a’ ‘b’ or ‘c’[a-z] Lowercase letters[0-9] Any single digit
Quick review before funny gifs!
The Dot and the \w
Matches everything but new lines. If you want to match a dot and only a dot, escape it like \
\w matches letters, numbers, and the underscore..
Optional
The ? matches 0 or 1
The Star
The * matches 0 or more.
The Plus
Matches 1 or more
Curly Brackets
Min and max ranges.
{1,} At least 1{1,3} 1 through 3{1,64} 1 through 64
Problem: Make sure input is what we expect.Goal 1: /[^0-9a-z\-_.]/
Goal 2: /^[0-9]{1,2}[dwmy]$/
3.Using Regex for Validation
▷ Know your target.
▷ Some targets are impossible:
○ "much.more unusual"@example.com ○ "very.unusual.@.unusual.com"@example.com ○ "very.(),:;<>[]\".VERY.\"very@\\
\"very\".unusual"@strange.example.com ○ admin@mailserver1 (local domain name with no TLD)○ !#$%&'*+-/=?^_`{}|~@example.org ○ "()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~.a"@example.org ○ " "@example.org (space between the quotes)
Hooray! But…
Validating thingsis where you get to determine exactly what you want.
Finding things…is usually a matter of “good enough”.
When not to use regex
http://php.net/manual/en/function.filter-var.phphttp://php.net/manual/en/filter.filters.validate.php
Hammer icon by John Caserta, from The Noun Project
Just because you can use regex for validation doesn’t mean you should. PHP’s got lots handled.
filter_var( 'bob@example.com', FILTER_VALIDATE_EMAIL);
^ Start of string$ End of string
https://regex101.com/r/sN8pA6/1
if (!preg_match( "%^[0-9]{1,2}[dwmy]$%", $_POST["subscription_frequency"]) ) { $IsError = true; })
Anchors
▷ Imagine writing routing rules.These will do very different things.
Small anchors. Big impact.
index(\.php)? ^index(\.php)?
https://regex101.com/r/dS8zC9/1
Negated Character Classes
[^abc] Anything except a, b, or c, including new lines.
// Ensure input only contains// alphanumeric, dash, dot, underscoreif (preg_match("/[^0-9a-z\-_.]/i", $product_code)) { $IsError = true;}
First we need to find them…▷ @foo but not @foo.bar or bar@foo.com
▷ \w works well to get us [A-Za-z0-9_]
▷ \B is an anchor, like ^ or $, but that matches “not a word boundary”. It matches a position, not a character.
▷ Wrap a pattern in parentheses to make a “capturing group”.
But wait… We need pieces: ( )
preg_match_all( "/\B@([\w]{2,})/i", $input, $output_array);
Array ( [0] => Array ( [0] => @calevans [1] => @FoxyCart ) [1] => Array ( [0] => calevans [1] => FoxyCart ))
The result…
Named capturing groups:
preg_match_all( "/\B@(?P<username>[\w]{2,})/i", $input);
0=>array(0=>@calevans1=>@FoxyCart)
username=>array(0=>calevans1=>FoxyCart)
1=>array(0=>calevans1=>FoxyCart)
For complex patterns or ease of reference, you can name capturing groups using (?P<name>) syntax.
The result…
It’s replacin’ time!
preg_replace( "/\B@([\w]{2,})/i", "<a href=\"foo?user=$1\">$0</a>", $input);
Hey <a href="foo?user=calevans">@calevans</a>, could you pick up some #ice_cream and #gingerbread for #CoderFaire? <a href="foo?user=FoxyCart">@FoxyCart</a> will sponsor. Email me a receipt at brett.florio@example.com.
Notice the $0 and $1. $0 is the complete match.$1 is the first captured group. $2 would be the second, etc.
A recent example…
Find credit card numbers, before they get submitted, emailed, saved, logged, or backed up.
Visualization by https://jex.im/regulex/
▷ Backreferences refer back to previous captured groups in the same pattern.
▷ Syntax is \#, where # is the number of the group.
▷ Useful for matching pairs of things (opening/closing quotes and tags).
Backreferences
http://regexr.com/3a8j0
Problem: Strip script tags without stripping extra stuff.Goal: %<script.*?</script>%
6.Greediness & the Dot
https://regex101.com/r/uJ7jQ6/1
Greedy by default
This pattern will match as much as it possibly can.
Anytime you use a dot, remember how greedy it is.
https://regex101.com/r/lO1sB7/1
Adding a ? after a repetition metacharacter (+, *, or {m,n}) will make it non-greedy.
Notice the difference. It’ll stop the match as soon as it can instead of as late as it can.
In general, always throw a ? after a + or *.
Go non-greedy!
The / is often used as the pattern delimiter, so it needs to be escaped.
preg_match('/https?:\/\/.*?\//i'
In PHP you can use others. % or ` (backtick) work well.
preg_match('%https?://.*?/%i'
preg_match('`https?://.*?/`i'In JavaScript, you can’t use others, but you can construct without them…
var re = new RegExp("https?://");http://php.net/manual/en/regexp.reference.delimiters.php
Slashes and HTML
Problem: Validate a subdomain with dashes(which can’t start or end the string)
Goal: ^(?!-)[a-z0-9-]{1,63}(?<!-)$
7.Lookarounds!
Positive Lookahead:Match something followed by something else.
(?=)
Negative Lookahead:
Match something not followed by something else.
(?!)
https://regex101.com/r/gK0mE7/1https://regex101.com/r/mE1fC4/1
Lookaheads
Positive Lookbehind:Match something preceded by something else.
(?<=)Negative Lookbehind:Match something not preceded by something else.
(?<!)
JavaScript doesn’t support lookbehinds, and there are some limitations.
https://regex101.com/r/kL3rA4/1https://regex101.com/r/xT1gA9/1
Lookbehinds
Subdomains can’t be longer than 63 characters, can only contain letters, numbers, and dashes, but cannot start or end with a dash.
The top is without lookarounds.
The bottom is with ‘em.
https://regex101.com/r/jU0yI3/2 from http://stackoverflow.com/a/7933253/862520
https://regex101.com/r/wV7yQ0/2
Practical lookarounds
Special Characters:aka “Metacharacters”
▷ caret ^▷ dollar sign $▷ period or dot .▷ question mark ?▷ asterisk or star *▷ plus sign +▷ parentheses ( )▷ square brackets [ ]▷ curly brackets { }▷ pipe |▷ backslash \
Reading & Resources:
▷ regular-expressions.info▷ regexr.com is my jam.▷ regex101.com does a bit
more if you need it.▷ phpliveregex.com shows
PHP’s handling of preg_ methods.
▷ jex.im/regulex/ is super helpful visualization.
Overview
▷ The pipe character, to match one pattern OR another
▷ All the character classes: \s \S \d \D \W
▷ Unicode support, and how frustrating it can be
▷ Non-capturing (or “passive”) groups
▷ Named capturing groups
▷ How the \b and \B work as they relate to the @mentions example. Why does \B@foo match the way it does? How do they relate to \w and \W?
Homework!
You can find me at:
@brettflorio, brett.florio@foxycart.com
You can leave feedback at https://joind.in/event/lone-star-php-2017/regex-makes-me-weepgive-up-i
Slides available at bit.ly/regex-makes-me-wanna
Thanks!Any questions?
Thanks again to @calevans and @nomadphp for asking me to do this talk in the first place.
Credits
Thanks also to all the people who made and released these awesome resources for free:
▷ Minicons by Webalys▷ Presentation template
by SlidesCarnival