+ All Categories


makes me want to ( weep | give up | (╯°□°)╯︵ ┻━┻ )\.? A presentation by Brett Florio of FoxyCart.com.

Follow along at bit.ly/regex-makes-me-wanna

Who it’s for?▷ Beginners looking to

understand the basics

▷ Intermediate regex devs wanting a review and some new approaches

▷ Advanced programmers who just don’t really grok regular expressions.

▷ Anybody who hates regex because they don’t understand it.

Slides… are available atbit.ly/regex-makes-me-wanna

How we’ll learn:Rather than abstract concepts like “cat” and “dog”, we’ll focus on real use-cases you might run across in your daily programming.

What we’ll learn:▷ Our goal

▷ A brief history of regex

▷ Matching

▷ Validating

▷ Replacing

▷ Working with HTML

▷ Common gotchas

About this presentation!

▷ Co-founded FoxyCart.com (now Foxy.io) in 2007

▷ Dove into regex when @lukestokes told me something was impossible. Proved him wrong.

▷ Spent the past five years traveling full-time or half-time in an RV with my wife and 3 kids.

▷ Currently in Austin, TX, and happy to grab food or drinks if you’re in town!


http://brettflorio.com/ has more photos like this -->

FoxyCart.com / Foxy.io is where I solve problems.

About @brettflorio

# Credit card number matcher

CREDIT_CARD = re.compile( r'([^\d])([3456][ -]*?(?:\d[ -]*?){12,15})([^\d])')


# Password matching

PASSWORD = re.compile( r'customer_password=(.*?)&')


A recent real-life regex…

Extra sanitization of logs,in a Chef recipe:

1. Find emails

2. Validate custom input

3. Link @mentions and #tags in text

4. Strip <script> tags

5. Truly validate a subdomain^(?!-)[a-z0-9-]{1,63}(?<!-)$

Our goals!Understand how to:


“Big thanks to NomadPHP.com!

Check out Daycamp4Developers(PHP Application Security day in June)

1.REGEX: A Brief Intro

With an even briefer coverage of its history.

“Some people, when confronted with a

problem, think “I know, I'll use regular expressions.”

Now they have two problems.


▷ 1940s-60s: Lots of smart people

▷ 1970s: g/re/p

▷ 1980: Perl and Henry Spencer

▷ 1997: PCRE (Perl Compatible Regular Expressions)

Pronunciation: hard or soft ‘g’

Regular expressions’ history


int preg_match ( string $pattern , string $subject [, array &$matches [, int$flags = 0 [, int $offset = 0 ]]] )

Returns 1 if match found.0 if not.false if error

Common regex usage: PHP

Replacingmixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject [, int $limit = -1 [, int &$count ]] )

Returns the replaced string or array (based on the $subject).

Matching (all)

int preg_match_all ( string $pattern , string $subject [, array &$matches [, int $flags = PREG_PATTERN_ORDER [, int $offset = 0 ]]] )

Returns # (int) of matches found.


Returns an array of matches, or null if no matches.

Replacingstring.replace(RegExp, replacement);

Returns the string with the replacements performed.

Caveats about JavaScript’s regex▷ No “single-line” or DOTALL mode. (The dot never matches a new line.)▷ No lookbehind support :(▷ Same methods for regex and non-regex matching and replacing.

Common regex usage: JS

Problem: Finding email addresses in a codebase.Goal: /[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i

2.The Basics of

Regex Patterns

Hypothetical situation:

Your project has bloated over the years, and both internal and external emails are going everywhere, maybe including terminated employees, personal accounts, etc.

Your mission:

You need to search the whole codebase to find all the emails so you can tidy things up!

Find all the emails!Or… an alternate story:

You need to strip emails from user-submitted content, to protect privacy or restrict communication (or like Airbnb does).

~12 Special Charactersaka “Metacharacters”

▷ . \ [ ] ? * + { } ( ) ^ $ |▷ - (sometimes)

Nearly everything else is a literal!

Imagine your input string as bolts, and your pattern as a set of sockets (in order).

An analogy:Sockets!

"Socket wrench and sockets" by Kae - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons

The exact match

If you know exactly what you’re looking for…

You still might get more than you wanted!


The almighty .and the escape \

The dot (.) matches ANYTHING and EVERYTHING.

Except… new lines, by default. PHP and others can enable DOTALL or single-line mode to have the dot match a new line. JavaScript can’t.

The backslash \ escapes special characters (metacharacters). So \. makes a dot match just a dot.


The almighty . (dot)

The dot (.) matchesANYTHING and EVERYTHING

(except newlines, by default).

Gator Grip Universal Socket, available online.

The almighty . (dot)

The dot (.) matchesANYTHING and EVERYTHING

(except newlines, by default).

Toysmith Classic Pin Art, ~$20. Buy one!

Square brackets match what’s inside them.

[abc] ‘a’ ‘b’ or ‘c’[a-z] Lowercase letters[0-9] Any single digit[a-z.] Letters and the dot

A common case is…[A-Za-z0-9_]which has a shortcut:\w “Word” characters

So… let’s try this: [\w.+-]

Character Classes!


Dashes need escaping inside square brackets (unless they’re at the start or the end), since they have special meaning

So… [\w.+-] is fine. The dash is at the end.But… [\w.\-+] needs escaping.

When in doubt, escaping doesn’t typically hurt.[\w.+\-] is also just fine.


? 0 or 1 match (optional)* 0 or more matches+ 1 or more matches

But what about at least 3, or 1 through 6 matches?





Curly brackets get you minimum and maximum ranges. Minimum is required:

{1,} At least 1{1,3} 1 through 3{1,64} 1 through 64

64 characters is the maximum length of the username portion of an email, so…

More Repetition!

It looks similar in both PHP…

preg_match(‘/pattern/i‘, $subject);

And JavaScript:


Other common modifiers are:

s Makes the dot match newlines as well. (PHP)g Match all, not just the first. (JavaScript)m Makes ^ and $ line-specific.

References for PHP and JavaScript

By default, regex is case-sensitive.Adding an “i” after the pattern’s delimiter fixes that.


Putting it all together

/[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i(Try it on a project in your text editor.)

A great tool for testing how PHP handles preg_match, preg_match_all, and preg_replace is http://www.phpliveregex.com/

See this example at http://www.phpliveregex.com/p/9yD

What that looks like in PHPpreg_match_all( "/[\w.+-]+@[a-z0-9-]+(\.[a-z0-9-]+)*/i", $input_lines, $output_array);

Array ( [0] => Array ( [0] => [email protected] [1] => [email protected] [2] => [email protected] [3] => [email protected] [4] => [email protected] [5] => [email protected] [6] => [email protected] [7] => admin@localhost [8] => [email protected] [9] => [email protected] [10] => [email protected] )

. [] ?

* + {}

Square Brackets

Matches characters inside the brackets. Supports ranges.

[abc] ‘a’ ‘b’ or ‘c’[a-z] Lowercase letters[0-9] Any single digit

Quick review before funny gifs!

The Dot and the \w

Matches everything but new lines. If you want to match a dot and only a dot, escape it like \

\w matches letters, numbers, and the underscore..


The ? matches 0 or 1

The Star

The * matches 0 or more.

The Plus

Matches 1 or more

Curly Brackets

Min and max ranges.

{1,} At least 1{1,3} 1 through 3{1,64} 1 through 64

Problem: Make sure input is what we expect.Goal 1: /[^0-9a-z\-_.]/

Goal 2: /^[0-9]{1,2}[dwmy]$/

3.Using Regex for Validation

▷ Know your target.

▷ Some targets are impossible:

○ "much.more unusual"@example.com ○ "[email protected]"@example.com ○ "very.(),:;<>[]\".VERY.\"very@\\

\"very\".unusual"@strange.example.com ○ admin@mailserver1 (local domain name with no TLD)○ !#$%&'*+-/=?^_`{}|[email protected] ○ "()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~.a"@example.org ○ " "@example.org (space between the quotes)

Hooray! But…

Validating thingsis where you get to determine exactly what you want.

Finding things…is usually a matter of “good enough”.

When not to use regex


Hammer icon by John Caserta, from The Noun Project

Just because you can use regex for validation doesn’t mean you should. PHP’s got lots handled.

filter_var( '[email protected]', FILTER_VALIDATE_EMAIL);

^ Start of string$ End of string


if (!preg_match( "%^[0-9]{1,2}[dwmy]$%", $_POST["subscription_frequency"]) ) { $IsError = true; })


▷ Imagine writing routing rules.These will do very different things.

Small anchors. Big impact.

index(\.php)? ^index(\.php)?


Negated Character Classes

[^abc] Anything except a, b, or c, including new lines.

// Ensure input only contains// alphanumeric, dash, dot, underscoreif (preg_match("/[^0-9a-z\-_.]/i", $product_code)) { $IsError = true;}

Problem: Link @mentions and #tagsGoal: /\B@([\w]{2,})/i

4.Finding… and REPLACING

First we need to find them…▷ @foo but not @foo.bar or [email protected]

▷ \w works well to get us [A-Za-z0-9_]

▷ \B is an anchor, like ^ or $, but that matches “not a word boundary”. It matches a position, not a character.

▷ Wrap a pattern in parentheses to make a “capturing group”.

But wait… We need pieces: ( )

preg_match_all( "/\B@([\w]{2,})/i", $input, $output_array);

Array ( [0] => Array ( [0] => @calevans [1] => @FoxyCart ) [1] => Array ( [0] => calevans [1] => FoxyCart ))

The result…

Named capturing groups:

preg_match_all( "/\B@(?P<username>[\w]{2,})/i", $input);




For complex patterns or ease of reference, you can name capturing groups using (?P<name>) syntax.

The result…

It’s replacin’ time!

preg_replace( "/\B@([\w]{2,})/i", "<a href=\"foo?user=$1\">$0</a>", $input);

Hey <a href="foo?user=calevans">@calevans</a>, could you pick up some #ice_cream and #gingerbread for #CoderFaire? <a href="foo?user=FoxyCart">@FoxyCart</a> will sponsor. Email me a receipt at [email protected].

Notice the $0 and $1. $0 is the complete match.$1 is the first captured group. $2 would be the second, etc.

A recent example…

Find credit card numbers, before they get submitted, emailed, saved, logged, or backed up.

Visualization by https://jex.im/regulex/

“preg_replace is the best.

Problem: Match some HTML tag attributes.Goal: %name=(['"]?)amount\1%

5.Backreferences and HTML

▷ Backreferences refer back to previous captured groups in the same pattern.

▷ Syntax is \#, where # is the number of the group.

▷ Useful for matching pairs of things (opening/closing quotes and tags).



Problem: Strip script tags without stripping extra stuff.Goal: %<script.*?</script>%

6.Greediness & the Dot


Greedy by default

This pattern will match as much as it possibly can.

Anytime you use a dot, remember how greedy it is.


Adding a ? after a repetition metacharacter (+, *, or {m,n}) will make it non-greedy.

Notice the difference. It’ll stop the match as soon as it can instead of as late as it can.

In general, always throw a ? after a + or *.

Go non-greedy!


Slashes and HTML

The / is often used as the pattern delimiter, so it needs to be escaped.


In PHP you can use others. % or ` (backtick) work well.


preg_match('`https?://.*?/`i'In JavaScript, you can’t use others, but you can construct without them… 

var re = new RegExp("https?://");http://php.net/manual/en/regexp.reference.delimiters.php

Slashes and HTML

Problem: Validate a subdomain with dashes(which can’t start or end the string)

Goal: ^(?!-)[a-z0-9-]{1,63}(?<!-)$


Positive Lookahead:Match something followed by something else.


Negative Lookahead:

Match something not followed by something else.




Positive Lookbehind:Match something preceded by something else.

(?<=)Negative Lookbehind:Match something not preceded by something else.


JavaScript doesn’t support lookbehinds, and there are some limitations.



Subdomains can’t be longer than 63 characters, can only contain letters, numbers, and dashes, but cannot start or end with a dash.

The top is without lookarounds.

The bottom is with ‘em.

https://regex101.com/r/jU0yI3/2 from http://stackoverflow.com/a/7933253/862520


Practical lookarounds

Problem: You can’t get enough regex!Goal: Learn all the regex!

8.Resources & Homework

Special Characters:aka “Metacharacters”

▷ caret ^▷ dollar sign $▷ period or dot .▷ question mark ?▷ asterisk or star *▷ plus sign +▷ parentheses ( )▷ square brackets [ ]▷ curly brackets { }▷ pipe |▷ backslash \

Reading & Resources:

▷ regular-expressions.info▷ regexr.com is my jam.▷ regex101.com does a bit

more if you need it.▷ phpliveregex.com shows

PHP’s handling of preg_ methods.

▷ jex.im/regulex/ is super helpful visualization.


▷ The pipe character, to match one pattern OR another

▷ All the character classes: \s \S \d \D \W

▷ Unicode support, and how frustrating it can be

▷ Non-capturing (or “passive”) groups

▷ Named capturing groups

▷ How the \b and \B work as they relate to the @mentions example. Why does \B@foo match the way it does? How do they relate to \w and \W?


You can find me at:

@brettflorio, [email protected]

You can leave feedback at https://joind.in/event/lone-star-php-2017/regex-makes-me-weepgive-up-i

Slides available at bit.ly/regex-makes-me-wanna

Thanks!Any questions?

Thanks again to @calevans and @nomadphp for asking me to do this talk in the first place.


Thanks also to all the people who made and released these awesome resources for free:

▷ Minicons by Webalys▷ Presentation template

by SlidesCarnival

Top Related