Fun with Regular Expressions

(no, really)

Based on...

Intermediate JavaScript

Girl Develop It Philly

http://gdiphilly.github.io/Intermediate-JS

This talk...

Elise Wei

Photo of Elise
  • Drexel grad
  • Self-taught (2002-present)
  • Full-time dev (2008-present)
  • Girl Develop It (2012-present)

Elise Wei

Software Engineer at Ticketleap, working on Port (currently in beta).

Monetate: 3rd party script that allows marketers to personalize/customize their e-commerce sites and measure the impact.

Scraping!

Regular Expressions

Allow us to match patterns (sometimes extremely complex) in strings.

What do you do with them?

Given some text and a pattern, you can "split" "replace" "match" "search" "test" and "exec." But the how is a bit quirky.

String methods that accept regexes

  • split: finds all instances of the pattern, deletes them, and uses those points to divide text into an array
  • replace: Can be a simple replacement or can be more complex
  • match: returns an array of all matching substrings
  • search: returns the numerical index of the first matched pattern or -1 if not matched

    var str = 'which way';
    str.search(/hi/); // returns 1
                        

Regex methods that accept strings

  • test: returns true if the pattern is found, otherwise, false
  • exec: returns only the first matched string and captured substring if applicable (this is different from "match" if using the global modifier. More on that later.)

    var pattern = /hi/;
    pattern.test('chocolate chip'); // returns true
                        

JS is like...

And the devs are like...

Regexes are VERY POWERFUL, but can be difficult to use.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. —Jamie Zawinski

So really, what do you do with them?

Fun/silly stuff:


Real uses:

  • sanitizing and validating user-entered data
  • automatically suggesting search terms or tags for a new blog post
  • templating with strings (text replacement)
  • getting data out of a url
  • and many more

For the following examples, you can use regex101 to easily experiment.

The basics of writing a regular expression


    'Hello, class'
                    
  • Start and end with forward slashes /a/, or instantiate with new RegExp('a'). Either would match the first lowercase "a"
  • Can be used to find an exact string, such as /Hello/
  • Can look for specific character sets at the beginning and end of lines using ^ and $ (which you may recognize from some CSS uses) /^class/ would not find a match, but /class$/ would

Escaping & character ranges


    '$50.00'
                    

/\$/We've already seen that the dollar sign has special meaning in a regular expression, so what if we're actually looking for a dollar sign character? Use a backslash

[0123456789]You can also match any of a group of characters by using square brackets

[0-9]But you can also signify any of a range of characters by using a hyphen

Escaping & character ranges

[0-9a-z]Or multiple ranges.
This signifies a range of unicode characters, by number.

What's the difference between [A-z] and [A-Za-z]?

[^13579]You can also use a carat to exclude characters in a range. This example would match only even digits

One of...

x|y|zAny of the pipe-separated options
Here, there is no difference from [xyz], but it becomes more useful when using longer expressions, for example x1|y2|z3.

Wildcards

There are also some special wildcards that you can use to find pre-defined character sets:

.A period matches any character. Escape with a backslash to match an actual period.

\s for whitespace (\S for non-whitespace) Whitespace: spaces, tabs, new lines, CRs, etc

\d for digits (\D for non-digits)

\w for "word" characters, which includes letters, digits, and the underscore (\W for non-word)

\b for the beginnings or ends of words (\B for everything else)

Modifiers

Modifiers added after the final slash signify global, case-insensitive, and multi-line searches that will match the start and end of a line as if it were the start or end of a string (g, i, m). There are a couple other modifiers for unicode and 'stickiness,' but they're not used often.

Modifiers


    'Hello, class.
    How was your week?'
                    

/h/ would have no matches.

/h/i would match the first "H"

/h/ig would match all instances of "H". This is useful for "match" and "replace" functions. "split" is global by default, and "test" and "search" are not affected by a global modifier.

/^h/ig would match only the first "H", while /^h/igm would match both.

Capturing groups

Sometimes we'll want to look for a pattern, but only use part of it. Parentheses indicate a part of the pattern you would specifically like the search to "capture" and return.


    "ace".match(/a(c)e/); // returns ["ace", "c"]
                    

The whole pattern is matched, and the next element of the returned array is the specifically "captured" group of matched characters.

Capturing groups

Now combine that with the character range syntax /a([a-z])e/


    var pattern = /a([a-z])e/;
    "age".match(pattern); // returns ["age", "g"]
    "ale".match(pattern); // returns ["ale", "l"]
    "variegated".match(pattern); // returns ["ate", "t"]
                    

Are you beginning to see the awesome power?

mind. blown.

Quantifiers

You can also specify different numbers of characters to allow in the match.

+ Allows for any (non-zero) number of the preceding character or expression to qualify for the match.

* Looks for zero or more of the preceding character or expression

? Matches exactly 0 or 1 of something

{3} {1,3} Match a specific quantity, or a specific quantity range.

Quantifiers

This looks like...


    "variegated".match(/a([a-z]+)e/); // returns ["ariegate", "riegat"]
                    

(This might not be what you expected. Watch out for "greediness!" use +?, *? or {}? to create a "reluctant", rather than greedy search)


    "variegated".match(/a([a-z]+?)e/); // returns ["arie", "ri"]
                    

Bonus

  • parens (and non-capturing)
  • deciphering regex
  • look-ahead
  • multi-part search/replace

Parentheses can...

  • Create a capturing group
  • Apply modifiers to a group of characters
  • Group multi-character options

    /a([a-z])e/ // matches ace, ate, ale and returns the middle letter
    /a(pa)?ce/ // matches ace or apace and returns "pa" or ""
    /a(cr|bl)e/ // matches acre or able and returns "cr" or "bl"
                        

Non-capturing parens

In cases 2 and 3, if you want parentheses for grouping, but don't need the result, use ?: inside the parens to indicate a non-capturing group.


    /a(?:[a-z])e/ // this is the same as leaving out the parens entirely
    /a(?:pa)?ce/ // matches ace or apace and returns nothing
    /a(?:cr|bl)e/ // matches acre or able and returns nothing
                        

Deciphering Regexes


    /(\d+)[\,\.](?=\d{3})/
                        

(\d+) Capture one or more digits

[\,\.] A period or comma

(?= ) Followed by... (more on looking ahead/back)

\d{3} Exactly 3 digits

Complex Replacement


    var pattern = /(\d+)[\,\.](?=\d{3})/g;
    function stripSeparators(numStr) {
        return numStr.replace(pattern, "$1");
    }
                        

Each time the full pattern is matched, it is replaced with the first captured group (element 1 of the array returned by the pattern match).

Summary

Regular expressions can be incredibly useful and powerful. When you start small and get familiar, they can even be fun!

Contact