One of my favorite things to write about, back in the VKI days, was RegEx. An incredibly useful tool for people doing anything from simple find and replace scripts in Notepad++ to server admins redirecting pages, RegEx is one of those tools that you really should be familiar with if you work in our industry.
Sprechen Du Regex?
Regex commands can vary in complexity from simple to brain meltingly complex depending on how much “language” (and more importantly: logic) you use with them. The following is a hefty (but not complete) selection of regex terms:
. : The period is a wild card. It can represent any character what-so-ever.
+ : repeats the previous character 1 or more times.
* : repeats the previous character 0 or more times.
() : Parentheses represent a set of “tokens” or rule elements. For instance, (.+) would match any set of characters. This allows you to apply an operator to an entire group. So for instance, if you wanted to match the word “what” you would type “what”, but if you wanted it to also catch “whatwhat” then you could use “(what)+”.
Parentheses also create a “back reference”, which can be recalled with a special symbol in many regex engines (in Google Analytics, for instance, you would use $).
[] : Square brackets represents a “character class”, and are often used for ranges. For instance [a-t] would match any lower case letter between a and t. You can also have multiple items within a bracket, such as [a-zA-Z0-9s-#”=] which would match any single letter, number, space, hyphen, number sign, quotation, or equals sign. (Yes, this would be better written [ws-#”=], but I was making a point about ranges)
{} : Curly brackets are odd. They define repetition. So (what){2} would only match two repetitions of what (whatwhat). Alternatively (what){2,7} would count between two and seven repetitions of what (including 3 repetitions, 4 repetitions, 5 ,6)
d :Represents any digit
s : Represents any whitespace element (space, tag, etc.)
w : Represents any alphanumeric character or underscore
D S W : Negation of the above, so not a digit, not a white space, etc.
$ : Dollar sign matches the end of a string. In htaccess it can also be used to recall sets that have been previously defined by parenthesis.
^ : The caret has two purposes. It can match the start of a string, but also it can negate characters in characters sets. So ^[a-z]$ will only match a a string that starts and ends with a single lower case alpha character, [^a-z] will match any string that does not contains characters other than a lower case letter. So aaa will not match, aAa will match, and AAA will match.
– : a hyphen creates a range. For instance, a-z would match any character from a to z (though not any uppercase characters)
| : The bar stands for “or”. So a|b will match a or b.
: slash means “literally”. So while “.” would match any character “.” would only match periods. Similarly while “?” would match the end of a sentance, “?” would match a question mark. In certain implementations of regex (eg. Notepad ++) slash can also be used with numbers to repeat areas that have previously been defined by brackets (same as $1, $2, etc. in htaccess).
?: Question marks have a lot of uses. Following an expression it matches a string that does or does not contain this. So for example “[1080 ]? Howe st” would match “1080 Howe st.” or “Howe st.” but not “64 Howe st.” while “64?” would match “6” or “64”. The question mark also has the dual purpose of making an expression “lazy” (normally regex is greedy). Greed and laziness makes my head hurt so I’ll just leave this one to LunaMetrics (good greed and bad greed).
(?i) : I said question marks have a lot of uses. This command turns on case insensitivity. So, oh (?i)my gosh will match “oh my gosh” and “oh MY GOSH“.
(?-i) : Yep, a negative sign. Reverses what (?i) does, turning off case insensitivity (yay double negatives). Think of (?i) and (?-i) as HTML’s <> and </> and you’ll have the idea.
(?=): Matches the the preceeding character that follows the character after the equals sign. So in “oh my g… OH MY GOSH, G(?=O) would match.
Got all of that remembered? No? I doubt anyone does.
Sprechen Sie Regex?
So how can we use this? Here’s a neat trick.
Say you want to know how if there is a behavioral difference between people using longer keyphrases or shorter ones. One might assume that longer keyphrases would convert more, since they are more specific, and there is a greater chance that a user is finding exactly that. But why on earth would you assume when you have analytics?
Fortunately, a commenter on Avinash Kaushik’s blog has a neat trick for doing this using regular expressions.
Make a new advanced segment with ‘keyword’ ‘matching RegEx’ and input one of the following:
- ^s*[^s]+s*$ – one keyword
- ^s*[^s]+(s+[^s]+){1}s*$ – two keywords
- ^s*[^s]+(s+[^s]+){2}s*$ – three keywords
- ^s*[^s]+(s+[^s]+){3}s*$ – four keywords
So this reads as:
Start of line: matching any white space(s) repeated zero or more times (*) followed by not-a-whitespace ([^s]) one or more times followed by a white space zero or more times, then end line. Then if you want more than one keyword, you put a repeat ({number}). Repeat once for two keywords, twice for three, etc.
You can also do ranges such as:
- ^s*[^s]+(s+[^s]+){1,4}s*$ – two to five keywords
- ^s*[^s]+(s+[^s]+){5,}s* – six or more keywords
There you go. Try those out and let us know how you find longer phrases affect site metrics.
Keep an eye on the blog over the next couple of weeks as we post more Regex tips and tricks that you can use both within Google Analytics and other Regex engines.