R to the E-G-E-X-P
Technical Blog 8: Regular Expressions
In practically every major programming language that’s in operation today, there is a way to search, match, replace, and edit strings on a massive scale without having to already know what’s in the field. Welcome to the world of Regular Expressions, AKA RegExp, AKA RegEx, AKA Rational Expressions.
We’ll get to the nitty gritty of what RegExp looks like and how to create and manipulate it, but first, a quick history lesson: Mathematician Stephen Kleene was the first to come up with the idea for Regular Expressions all the way back in the 1950s, though it wasn’t widely used in the world of programming until Unix came into fruition. Today, RegExp is used so much in code that every time you open up your computer or use your phone, it’s practically guaranteed that you’ll be using a program that takes advantage of RegExp, whether that be web based search engines, word processors, or even your computer’s internal finder tools.
We’ll be focusing today on RegExp in Ruby, though many other languages rely heavily on it, including: JavaScript, Perl, AWK, Tcl, .NET languages, Java, Python and C++. Most other languages offer RegExp via a library.
So, Here’s a string of text:
Now let’s say I wanted to search for the letter ‘m’ in this string. I could go about it two different ways. Here’s the normal way:
And here’s how we accomplish the same task with Regular Expressions:
Hmm. That looks almost exactly the same. Why use RegExp then? We’ll get to that, but first, let’s talk about what’s going on here.
The ‘=~’ symbol that you see is looking for the first occurrence of what’s inside the ‘//’. Inside those forward slashes is our match requirement, in this case a letter. When we want to find a particular letter or character in a string, we can put it in [] and RegExp will look for that exact character, including whether it’s capitalized or not.
So I ask again: What’s the use of RegExp in Ruby when I already have tools that can search for particular characters or patters in a string? The key here is this: What if you don’t know what the string is going to be? On top of that, what if you need to look for lots of characters that all share a common trait, but aren’t exactly the same? RegExp to the rescue!
We can also use ranges of letters or numbers inside those slashes. So, if I wanted to look for any letter between m and z, for example, I could do this:
Wait, what happened to the W? It should have included it, right? Well, with RegExp, you have to be very specific. the ranges of a-z exist in a different group from A-Z (technically they’re in the same range, but A-Z come after z). If you want to include both capital and lowercase letters, you would modify the RegExp a bit:
That’s better.
We can also search for more than one character at a time. If I wanted to, I could search for specific letters and some punctuation, like this:
Also of note: We have to use the #scan method here because it’s very difficult to work with RegExp results as they are. They aren’t quite strings anymore, but aren’t quite arrays, either. By using #scan on the string or #to_s on the RegExp, we can play with those results using the standard methods available to arrays or strings.
Let’s take a look at some more search options for RegExp:
What you’re seeing inside the brackets is called a ‘metacharacter’, which essentially means that RegExp is search for any occurrence of a particular class of characters. In this case, the ‘s’ represents any whitespace character, and Ruby will return the number 3 to us, since there are 3 spaces between the 4 words in our string. Remember, every single part of a string counts as a character, including spaces! Also important to remember: with metacharacters, as with letter ranges, capitalization matters. Searching for whitespace with /\s/ works, but using /\S/ would actually yield us the opposite of whitespace, i.e. all of the other characters. Also, the ‘\’ is important with metacharacters because it ‘cancels out’ the actual letter, so RegExp isn’t attempting to search for the letter ‘s’.
Just for fun, here are the other main metacharacters. See if you can figure out what each one does. I’ll even put the result next to the RegExp code.
Here’s the answer code. How’d you do?
ɹǝʇɔɐɹɐɥɔ ǝlƃuᴉs ʎu∀ = /˙/
ʇᴉƃᴉp ʎu∀ = /p\/
ʇᴉƃᴉp-uou ʎu∀ = /p\/
(ǝɹoɔsɹǝpun ‘ɹǝqɯnu ‘ɹǝʇʇǝl) ɹǝʇɔɐɹɐɥɔ pɹoʍ ʎu∀ = /+ʍ\/
ɹǝʇɔɐɹɐɥɔ pɹoʍ-uou ʎu∀ = /+M\/
ㄣ ʎlʇɔɐxƎ = /{ㄣ}+ʍ\/
I tried to throw you off a bit with a few of those. Let’s go over each one:
/./ is pretty self explanatory base on the result. It scans for every single character, including spaces, letters, numbers, punctuation, everything.
/\d/ looks for digits, which is a pretty neat trick, considering that it’s searching through a string to find them. This would be very useful when you want to pull numbers out of a string and convert them to actual integers.
/\D/ gives us the exact opposite of /\d/, meaning it looks for anything that isn’t a number. For readability, I used #join to turn the RegExp into a readable string, which gives us the entire string, minus the two numbers.
/\w/ would give us every word character, which includes letters, numbers, and even underscores. However, just /\w/ by itself would give us every single word character, separated inside the array. By using the ‘+’ modifier we can actually search for one or more occurrences of word characters, ending each search only when a non-word character is found. In this case, we’re provided with the entire word or number, separated each time there’s a space, i.e., at the end of each word. Other modifiers exist like /\w*/, which would look for 0 or more occurrences, and /\w?/, which would look for 0 or 1 occurrences. You won’t use these as much, but it’s still very important to remember them if you’re looking for items which don’t occur.
Knowing what /\w/ does, /\W/ should be obvious. It will give us every non-word character, namely the spaces and punctuation in our string.
Lastly, we have /\w+{4}/. Based on the result, you probably figured out that this modifier looks for words (or results from the w+) that have 4 characters. We can also search for a range of lengths by using {2,4}, meaning that RegExp will return words that are between 2 and 4 letters long. We can also use {3,} to ask for any words that are 3 letters or longer.
The extra modifiers you see such as the + or the {} can be used in all parts of RegExp, not just on word characters. I used them this way here in order to make the information as understandable as possible.
We are only scratching the surface of the surface with this basic guide. If you’re interested in learning more about Regular Expressions, check out Rubular, The Ruby Docs, and Tuts+ to get started.
That’s it for this week! My name is Edwin Unger, and I’m a web developer. Sort of.