Categories
Digital Marketing

Regex Basics

Image - Regex Basics
Image – Regex Basics

Regex Basics also known as Regular Expressions starts with symbols and basic syntax to find and manipulate text. Use character sets and repetition expressions to create flexible matching patterns. The Regex engine parses text to find matches and utilizes greedy and lazy strategies. Capture text and lazy strategies as it finds matches. Refer back to it later by using backreferences. Discover look-around assertions to create complex matching patterns. Complex Regex can search for anything such as email addresses, phone numbers, URLs, prices, zip codes, etc. Build step by step and explore some of the common and useful Regular Expressions. Write Regular Expressions for pattern matching, text manipulation, and parsing data. In JavaScript, it can be an object. Regex comes in different flavors based on the programming language or the application. The PCRE-version stands for Perl Compatible Regular Expressions. It is credited for popularizing the usage of RegEx. PCRE is also used in PHP.

Table of Contents

Introduction

Regex Basics can help to search, validate, and transform strings based on distinct rules. The tool is used for pattern matching and text manipulation. Slice and dice through strings for information extraction and easily alter data. Regex Basics allows Digital Marketers to work with textual data and add to their capability to grasp complex string operations. You can depend on Regex Basics to craft complex search patterns, clean up messed-up datasets, and validate user input.

Regex Basics – Tools

RegExRX for Mac and RegExBuddy for Windows are paid tools. One of the free options for writing regular expressions is RegEx101.

In RegEx101, select PCRE (PHP) flavor. For doing multiple lines hit the return tab. But be sure to add the free space mode (?X) to the inline modifier in the search when you need to use several lines of text.

fig1 - Regex Basics - Inline modifier-(?x)
fig1 – Regex Basics – Inline modifier-(?x)

Literal Matches are things that match literally fig1.

fig2 - Regex Basics - Literal match
fig2 – Regex Basics – Literal match

For example, type the text 55555 in the source text fig2. Then type 55 in the search pattern, As seen it matches the first 55. It does not match the leftovers because the first 55 have already been matched or consumed. Next, add another 55. Now we’ve got two different matches. One is searching from left to right. The other found was 55 and it was a match. It found another 55 it was a match. There’s a single 5 which won’t work. 5555 matches the characters 5555 literally (case sensitive)

fig3 - RegEx Characters
fig3 – RegEx Characters

Regex Basics – Character Classes

fig4 - Regex Basics - Character Class
fig4 – Regex Basics – Character Class

In search pattern type [A-Z0-9a-z] fig4. It will match the source text CBAPQRSabcdefghMNOPPONMGHIJijklmnopqrstuvwxyz287640.

For example in the above [a-z] will match the upper & lower case. But if we do the negation [^a-z] only numbers will match. The character ^[^a-z] will have a different meaning. It is referred to as an anchor. To match a dash use [-a-z] or [a-z-]

fig5 - Regex Basics - Character Class @
fig5 – Regex Basics – Character Class @

[@] will match the @ in the Source Text #@&^!(@)%&(&@#% [^@^] will be a negation of [@^]

ASCII is the standard format for text files on computers and the internet. Read more by clicking on A-S-C-I-I.

If we select to space and end at Tilda by typing in search pattern [\ -~], it will match all types of characters in source text #@&^!(@)%&(&@#%abcdefABCDEF12340 fig5.

fig6 - Regex Basics - Alternation
fig6 – Regex Basics – Alternation

Alternations

With apple|mango in the search pattern. And apple mango banana in the source text. The | (pipe) is referred to as the alternation. The apple mango in the source text gets matched fig6.

The alteration tries to match what is on the left of that pipe. Then if it fails it tries to match the next alternative which is mango. Going through it says apple which is matched then it is yes. Next, does Apple match it is not. And then does mango match and it is yes. Does apple or mango match banana it is not. As an alternative, you can keep changing as much as you want.

Metacharacters

Characters in RegEx can be either

  • A regular character with a literal meaning
  • Metacharacter with a special meaning.

Literal Characters

A single literal character, such as ‘a’ matches the first occurrence of that character in the string mango is a fruit. Any letter (A to Z, a to z) or number (0 to 9) or keyboard characters ~ Tilda. ! Exclamation can be used as a single-character pattern. The below special characters cannot be used. They have special meaning in RegEx.

Special Characters

Metacharacters are characters with special meaning.

They are

  • the opening curly brace {,
  • pipe symbol |,
  • the plus sign +,
  • the asterisk *,
  • the caret ^,
  • the dollar sign $,
  • the opening parenthesis (,
  • the closing parenthesis ),
  • the question mark ?,
  • the dot asterisk .*,
  • the opening square bracket [,
  • dashes -,
  • the backslash \,
  • the period or dot (.)

A multiple-character RegEx can be created with

  • A mixture of letters, digits, and other keyboard characters
  • All letters, all digits, all special keyboard characters.

The compiler processes the character before the RegEx library sees the string. You need to know the characters that get special treatment. It can be inside strings depending on the programming language used.

fig7 - Regex Basics - Quantifiers
fig7 – Regex Basics – Quantifiers

Regex Basics – Quantifier & Iteration

In a Greedy Quantifier, the Regex engine matches as many possible occurrences of particular patterns. Whereas a Lazy Quantifier will stop no sooner it encounters the first pattern as per request.

+Quantifier

W+ matches both the W’s in the Source Text ‘WWorld19’ -> it says if you find 1 W match it, 2 W match it …. 100 W match it, and so on. This is what the + says

?Quantifier

? says 0 or 1 match. The two WWs in the Source Text ‘WWorld19’ are treated as a single match. One W can be matched but two W cannot be matched. Zero W can be matched but two single W which is a single hit ie two singles matches and one is allowed in a single header.

*Quantifier

*says 0 or more (an unlimited amount). So 0 and WW or WWWW in the Source Text ‘WWWWorld19’ are matched

 fig8 - +Quantifier with Ranges
fig8 – +Quantifier with Ranges

Quantifier with Ranges

[a-z]+ matches ”orld‘ and iN’ in the Source Text ‘WWorld19iN’ [a-zA-z]+ matches all lower and upper case ‘WWorld’ and ‘iN’ in the Source Text ‘WWorld19iN’ fig8

fig9 - Iteration
fig9 – Iteration

Iteration is just like Quantifiers, but it matches a particular amount of times. An iteration uses the curly brace fig9.
5{4} matches the Source Text 5555.
5{2,4] range 2 to 4 times matches Source Text 5555.
5{2,4}? defaulting to non-greedy matches with sets of two before the 5 in the Source Text 555555. We get three different matches 55, 55 and 55. 5{2,4} defaulting to greedy matches 5 four times and for the next match it will grab two times for the Source Text 555555.

Iterations are also true for Character Class-
\w matches Source Text ‘W,o,r,l,d’.
\w{5} matches Source Text ‘World’. \w{2} will grab the first two words ‘Wo’ and then the next two words ‘rl‘.
\w{6} will not work as the Source Text is a 5 letter word.
\w{3} will grab the first three words ‘Wor’.
[a-zA-Z0-9_]{2} will grab the 2 characters ‘Wo’ and ‘rl‘ from the Source Text ‘World’.
[a-zA-Z0-9_]{1,2} will grab the 3 characters ‘Wo’, ‘rland ‘d‘ from the Source Text ‘World’.

fig10 - Capture 10 digit number
fig10 – Capture 10 digit number

Regex Basics – Capture Groups & Non-Capture Groups

In a Capture group, the matched character sequences are captured. Parentheses group the regex so that different quantifiers can be applied to that group. The part of the string matched by the regex inside parentheses creates a numbered capturing which is stored for possible re-use with a numbered backreference fig10.

Group 0 – 950-784-7659, Group 1 – 784, Group 2 – 7659
\d matches 9,5,0,7,8,4,7,6,5,9
\d{3} matches 950, 784 and 765
\d{3}[-.)]\d matches 950-7
\d{3}[-.)]\d{3} matches 950-784
\d{3}[-.)]\d{3}[-.] matches 950-784-
\d{3}[-.)]\d{3}[-.]\d matches 950-784-7
\d{3}[-.)]\d{3}[-.]\d{4} matches 950-784-7659

fig11 - Non-Capture Group
fig11 – Non-Capture Group

Non-captured groups fig10 do not store anything. Use (?:) to create a non-capture group. It can be used when we do not need the group to capture its match.

fig12 - Look around - Lookahead & Lookbehind
fig12 – Look around – Lookahead & Lookbehind

Look Around

There are two types of look around – look ahead and look behind fig11. These are zero-length assertions which means they match characters. But they give up that match immediately and they only return the result of a ‘match’ or ‘no match’.

fig13 - Look around - Negative Lookahead
fig13 – Look around – Negative Lookahead

Positive Lookahead: sweet(?=\ apple) in the search pattern will match ‘sweet’ but not ”apple for ‘sweet apple’.
Negative Lookahead: sweet(?!\ apple) in the search pattern will match ‘sweet’ only but not ‘mango’, ‘watermelon’ and ‘peach’ for ‘sweet mango’, ‘sweet watermelon’ and ‘sweet peach’. It will not match ‘sweet apple’ fig13.
Positive Lookbehind: (?<=sweet\ )apple in the search pattern will match ‘apple’ but not ‘sweet’ for ‘sweet apple’.
Negative Lookbehind: (?!sweet\ )apple in the search pattern will match apple but not red, green, sweet and custard before apple.

fig14 - Word Boundary
fig14 – Word Boundary

Word Boundary

A word boundary is any character that is not a word character fig14. It can be a dash as in ‘spider-web’, space as in ‘spider web’, tab, etc. Numbers in Regex are considered as word characters. A word boundary is a zero-length assertion.

Type the word ‘web’ in the search pattern. It will be found in all its form in the source text.
web\b matches the word ‘web’ which end in ‘web’ (word boundary) that is ‘spiderweb’ and a whole word ‘web’
\bweb matches the word ‘web’ which are directly before a word boundary and a whole word ‘web’ that is ‘webspider‘ and ‘web’, ‘spider-web’, ‘cob-web’.
\bweb\w+ matches ‘webspider‘ as part of a word and stand-alone word on its own in the source text
\w+web\b matches ‘spiderweb’ where ‘web’ is preceded by a word character.

Anchor

Two strings are ‘site’ and ‘sitemap’. To match ‘site’ only if it is on its own and does not want a match if it is a part of a text. Do this by using Anchor^

site – matches both ‘site’ and ‘sitemap’ in the source text ^site$ – match ‘site’ only if it is on its own and does not match the word ‘sitemap’

fig15 - 's' modifier
fig15 – ‘s’ modifier

Modifiers

The common regex modifiers are (?misx)

(?m)\w+$ -> modifier ‘m‘ matches last word of the strings that are fun, thrilling
(?m)^\w+ -> modifier ‘m’ matches the first word of the strings that is Hunting, SEEING
(?i)[a-z]+ -> modifier ‘i’ is the case insensitive modifier that matches all upper and lower case text in the string.
(?s).+ -> modifier ‘s’ matches both the strings fig15
(?x)\w+ – >modifier ‘x’ matches all the words in the two strings

Build logical patterns using Regex. These patterns identify strings of text that fit the pattern. Programming languages support Regex and are used mainly to identify files on a computer that end with an extension. It validates an email address entered in an online form and performs redirects for URLs recognized with a Regex pattern.

fig16 - Regex Tester - regex101.com
fig16 – Regex Tester – regex101.com

Regex Basics – Test

Test your regex by inputting “Test String” and “Regular Expression” at https://regex101.com/ The above Fig16 shows the string- for query parameter extraction say “utm_source” is and regular expression is ([^&]+)

Read more about UTM Parameters by clicking on Google Analytics UTM Parameters

fig17 - Export Matches and view in JSON | CSV | Plain Text
fig17 – Export Matches and view in JSON | CSV | Plain Text

Export Matches and view the data in JSON | CSV | Plain Text fig17.

Other Regex Generators commonly used are regexr.com, regex-generator.olafneumann.org, etc.

Regex Basics – Digital Marketing Applications

Many applications are available for implementing Regex Basics in the Digital Marketing landscape. Clean your URLs to make them readable by taking out unwanted parameters, special characters, subdomains, or tracking codes, A few applications are listed below –

Regex Basics in SEO

Regex Basics helps to restructure URLs during the migration of your website by 301 redirects that match old URLs to those of the new ones. Analyze your content using Regex to identify meta titles/ descriptions.

fig18 - Exported Data - grouping long-tail keywords containing ‘free’ and regex is ^.free.$
fig18 – Exported Data – grouping long-tail keywords containing ‘free’ and regex is ^.free.$

In Keyword Research, Regex Basics can segment keywords based on a relevant grouping or organize them into lists acquired from tools such as Moz, Semrush, etc. Note the string- “grouping long-tail keywords containing free” and the Regular Expression is ^.free.$

Fig18 shows the exported data for the string- grouping long-tail keywords containing ‘free’ and the regular expression is ^.free.$ in Plain Text using the Regex Tester > regex101.c0m

Regex Basics in GA & GSC

In Google Analytics, RegEx is used to find specific patterns in a list applying to –

  • goals,
  • filter/ view,
  • audiences,
  • channel groupings,
  • segments,

content groups for finding URLs that match particular descriptions eg (1) all pages – within a subdirectory, or (2) with a query string more than ten characters long.

For example from the categories – search-engine-optimization is one of the other categories

For example in the category ‘search-engine-optimization’ there are many pages. Some listed below are

These category pages are important. They correlate unique views against traffic sources & destination pages.

fig19-Regex Basics: Google Analytics - Goal>Destination
fig19 -Regex Basics: Google Analytics – Goal>Destination

Regex Basics helps Google Analytics refine data analysis by including or excluding user agents, specific URLs, and query parameters. In Google Search Console, Regex Basics can be used to filter search queries to exclude brands

In Google Analytics the common filter excludes traffic from your IP address(es) fig19. You can set up exclusions for a series of IPs with Google Analytics regex, eg: 73.234.191.[1-9] would exclude all IP addresses from 73.234.191.1 to 73.234.191.9

Use Regular Expression to segment mobile and desktop traffic.

Regex Basics in CMS

In CMS platforms like WordPress, you can rewrite URLs using Regex patterns, create user-friendly URLs, content tagging to categorize posts or pages, etc.

Use a WordPress plugin such as Search Regex that adds search and replace functions to your posts, pages, custom post types, etc. to search/ replace any data stored on your WordPress website.

Regex Basics in Email Marketing

Regex Basics can segment user email lists according to level of engagement, preferences, or behavior. Regex placeholders may be used for user names by dynamic insertion of variables that customize email content.

Conclusion

Regex Basics deals with a sequence of characters to define a search pattern inside a text. They are used to work with string searching, input validation, and find/ replace operations. A regex contains both literal and special characters supporting programming languages such as Phyton, Perl, etc. Regex testers are recommended for debugging and optimizing patterns as regular expressions can be complex and difficult to read. Some common regex patterns are found in URL matching, IP Address matching, password strength (having at least 8 characters), digits (to handle fractions, negative/ positive & whole/ decimal numbers), alphanumeric characters (with or without space) such as usernames (including hyphens and underscores), email address validation, etc. Regex finds its application in Google Analytics, Text Editors, and in validation tasks for checking user input or searching specific texts. Use Regex to find URLs, extract dates from text or extract hashtags from posts, match email addresses, validate phone nos, replace words in text, etc. Regex generator tools help developers craft accurate regular expressions for matching and manipulating text.

By D Anthony

D Anthony is a SEO content writer and blogger with a passion for digital marketing. When not at work, he goes site seeing.