Regular expressions use in marketing: why are they so great?

Lima Vallantin
Wilame
Marketing Data scientist and Master's student interested in everything concerning Data, Text Mining, and Natural Language Processing. Currently speaking Brazilian Portuguese, French, English, and a tiiiiiiiiny bit of German. Want to connect? Send me a message. Want to know more? Visit the about page.

Contents

Don't forget to share:

Share on linkedin
Share on twitter
Share on facebook

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

I do, you do, and everyone else you know absolutely hate regular expressions, also known as RegEx. But OMG, it goes without saying that they are extremely important in the process of mining textual data.

Regular expressions are a great way to find patterns in strings, with the inconvenience of having a very obscure syntax. Being patient enough to learn it is the price to pay to unlock all the regex’s potential.

If you are coming from the marketing and automation fields and you have no idea of what I am talking about, let me introduce you to the regular expressions. They are like a programming language inside programming languages. It can be used to match, extract, or replace data.

Regex can be used in situations where you can identify textual patterns, like log files, CSV files, etc., and when strings are well-written and free of misspells, such as newspaper articles and magazines. But that’s not all. You can use regex to mine and extract data from Google Analytics reports and for a wide range of tasks.

For instance, imagine that you work for an e-commerce and you just got a Google Analytics report with 100,000 lines of the most visited pages on your website. Since you are a good marketer, you added UTM parameters to your campaigns and now you want to group information using them.

One way to do it is by using regular expressions to capture the different parameters on the URLs. You could import the URLs and create a Pandas data frame with them. Then, you’d extract the information that you need, create new columns for them and add the extracted values.

Are you following?

Now that I got your confirmation, let’s start to work with a few regular expressions. And hopefully, you will understand why they are important by the end of this article.

The basics of regular expressions in Python

Python is a great programming language. It’s versatile, relatively quick, and supports regex. However, in order to use regular expressions, you will have to import the re library first. It’s done with a simple:

import re

Our main goal here is to try to use Regex to extract the UTML values from this URL:

https://vallant.in/?utm_source=linkedin&utm_medium=social&utm_campaign=regex2021

The first thing to do is to analyze if we can use regex here. Texts that are highly structured are very good candidates for regex.

What exactly does it mean? Well, take a look at this URL again. The UTM parameters always start with “utm_” something and end with “=”. everything that is between “=” and “&” is the actual UTM value, right?

Let me break this URL so you can see what I am talking about:

https://vallant.in/?
utm_source=linkedin&
utm_medium=social&
utm_campaign=regex2021

We could say that, in general, the UTM values are always between a “=” and a “&”, EXCEPT for the last value. So, yes, we could use Regex here because there’s a pattern!

Now, let’s say that we want to extract the pairs of keys and values for each UTM using the format “key”,”value”, where the key is the equivalent of the UTM parameter without the “utm_” part.

In this URL, we have 3 UTM parameters, right? I want to extract data in a way that I will keep the UTM parameter, followed by its value, like this:

source   = linkedin
medium   = social
campaign = regex2021

Clear? Good 😀

A way to do it (and I am not saying that this is the best nor the most convenient) is this one:

import re
# uri
url = 'https://vallant.in/?utm_source=linkedin&utm_medium=social&utm_campaign=regex2021'

# define the regex
regex = 'utm_([a-z]*)\=([a-z_]*)'

# get the matches
matches = re.findall(regex, url)

# print the matches
print(matches)

# result
>>> [('source', 'linkedin'), ('medium', 'social'), ('campaign', 'regex')]

Let’s analyse the regex that I just used. Take a look at the image below:

Regular expressions: image with a regex example.

We can break this regex into 4 parts. The first one is the “utm_” part. Basically, we say that we want to match all text starting with these characters.

Then, we are capturing what we call a group. I want the group of all characters composed by uncapitalized letters comprised between “utm_” and “=”.

The parenthesis indicates that this is a group and the squared brackets tell that I want every single character of the same kind of what’s written inside them (in this case, every letter of the alphabet). The “*” says that I want all the letters.

The third part is saying that between the first and the second group, the “=” sign has to exist. But since this character is also a regex joker, we need to scape it using “\”. This says that we want to match “=”, literally.

The last part says that every character comprised between “a” and “z”, plus the underscore sign should be captured, no matter how many times they appear.

The notion of groups here is important because the regex will ignore everything that is not in a group, and return only what’s inside the parenthesis. If you want, you can use a regex tester to see each step of the process.

On the image above, you can see a lot of different colors. The light blue color is what we call a full match. It’s basically everything that the regex captured. But since we only want what’s inside the groups, we have to look at what’s in green and in pink:

This is what will be returned!

Don’t worry. As I said, regex is not easy. And I still hate it :D. But I hope that you have realized why regex is so important to be learned. You may argue that you are a marketer and not a developer, but things are moving in a way that the two profiles are converging more and more.

Are regular expressions the solution for everything?

Absolutly NOT. You could accomplish the same result by using the code below:

url = 'https://vallant.in/?utm_source=linkedin&utm_medium=social&utm_campaign=regex2021'

# Remove the URI root and split the arguments 
matches = url[url.index('?') + 1:].split('&')

# Split the matches again by key/value and convert into tuple
matches = [tuple(match.replace('utm_','').split('=')) for match in matches]

print(matches)
>>> [('source', 'linked_in'), ('medium', 'social'), ('campaign', 'regex2021')]

But let’s be honest… this is a mess. Regular expressions are SOMETIMES an elegant solution for difficult problems. Just don’t go crazy on them. Since they are hard to read, you may forget what you were trying to accomplish when you first wrote them.

One second problem is that this second solution would match much more than just the UTM parameters, extrapolating the data that we really need.

If you need help with regular expressions, here are a few articles to help you:

Regular expressions: basic cheat sheet

The basic regex cheat sheet can be found on RegexOne.

abc…      : letters
123…      : digits
\d        : any digit
\D        : any non-digit
.         : any character
\.        : period
[abc]     : only a, b, or c
[^abc]    : all, but a, b, or c
[a-z]     : characters a to z
[0-9]     : numbers 0 to 9
\w        : any alphanumeric character
\W        : any non-alphanumeric character
{m}       : m repetitions
{m,n}     : m to n Repetitions
*         : zero or more repetitions
+         : one or more repetitions
?         : zero or one time
\s        : one whitespace
\S        : all, but whitespace
^…$       : end and start of a string
(…)       : group
(a(bc))   : sub-group
(.*)      : all inside a group
(abc|def) : abc or def 

Don't forget to share:

Share on linkedin
Share on twitter
Share on whatsapp
Share on facebook

Leave a Reply