Patterns in Text

Jon Reades

Can We Describe Text?

Consider the following character sequences:

  • foo@bar.com
  • https://www.ucl.ac.uk/bartlett/casa/
  • (555) 102-1111
  • E17 5RS
  • Now, fair Hippolyta, our nuptial hour / Draws on apace. Four happy days bring in / Another moon. But, oh, methinks how slow / This old moon wanes. She lingers my desires, / Like to a stepdame or a dowager / Long withering out a young man’s revenue. (I.i.)

Strings Methods are Not Enough

'123foo456'.index('foo') # 2
'123foo456'.split('foo') # ['123', '456']
' 123 foo 456 '.strip()  # '123 foo 456'
'HOW NOW BROWN COW?'.lower() # 'how now brown cow?'
'How now brown cow?'.replace('brown ','green-')
# 'How now green-cow?'

See: dir(str) for full list of string methods.

Regular Expressions

Regexes are a way for talking about patterns observed in text, although their origins are rooted in philosophy and linguistics.

Implemented in Python as:

import re
# re.search(<regex>, <str>)
s = '123foo456'
if re.search('123',s):
  print("Found a match.")
else:
  print("No match.")

Prints 'Found a match.'

Capturing Matches

m = re.search('123',s)
print(m.start())
print(m.end())
print(m.span())
print(m.group())

Outputs:

0
3
(0,3)
123

Configuring Matches

s = '123foo456'
m = re.search('FOO',s)
print(m)
m = re.search('FOO',s,re.IGNORECASE)
print(m)

Outputs:

None
<re.Match object; span=(3, 6), match='foo'>

The third parameter allows us to: match newlines (re.DOTALL), ignore case (re.IGNORECASE), take language into account (re.LOCALE), match across lines (re.MULTILINE), and write patterns across multiple lines (re.VERBOSE). If you need multiple options it’s re.DOTALL | re.IGNORECASE. Bitwise again!

More Than One Match

s = '123foo456foo789'
lst = re.findall('foo',s)
print(lst)
lst = re.finditer('foo',s)
[x for x in lst]
rs  = re.sub('foo',' ',s)
print(rs)
rs  = re.split(' ',rs)
print(rs)

Outputs:

['foo','foo']
[<re.Match object; span=(3, 6), match='foo'>, <re.Match object; span=(9, 12), match='foo'>]
'123 456 789'
['123', '456', '789']

Let’s Get Meta

Regular Expressions Do Much More

import re
m = re.search(r'\$((\d+,){2,}\d+)',
        "'That will be $1,000,000 he said...'")
print(m.group(1)) # '1,000,000'

This looks for sequences of 1-or-more digits followed by a comma… and for those sequences to repeat two or more times:

# Look for a literal '$'
re.search(r'\$') 
# Group of >=1 digits followed by a comma...
re.search(r'(\d+,)') 
# Repeated two or more times...
re.search(r'(\d+,){2,}') 

Character Classes

Characters Regex Meta Class Options ‘Antonyms’
a…z [a-z], \w (word-like characters) [^a-z], \W
A…Z [A-Z], \w (word-like characters) [^A-Z], \W
0…9 [0-9], \d (digits) [^0-9], \D
' ', \n, \t, \r, \f, \v \s \S
., [, ], +, $, ^, \|, {, }, *, (, ), ? For safety always precede character with a \. None

Metacharacters

Metacharacter Meaning Example
. Any character at all c.t
^ Start of a string/line ^start
$ End of a string/line end$
* 0 or more of something -*
+ 1 or more of something -+
? 0 or 1 of something; also lazy modifier ,?
{m,n} Repeat between m and n times \d{1,4}
[ ] A set of character literals [1-5]
( ) Group/remember this sequence of characters (\d+)
| Or (A|B)

I am Completely Lost

Building Blocks

Regex Interpretation
r'\s*' 0 or more spaces
r'\d+' 1 or more digits
r'[A-Fa-f0-7]{5}' Exactly 5 hexadecimal ‘digits’
r'\w+\.\d{2,}' 1 or more ‘wordish’ characters, followed by a full-stop, then 2 or more digits
r'^[^@]+@\w+' One more non-@ characters at the start of a line, followed by a ‘@’ then 1 or more ‘wordish’ characters.
r'(uk|eu|fr)$' The characters ‘uk’ or ‘eu’ or ‘fr’ at the end of a line.

Exploring

Regex101 can be a useful way to build a regex interactively:

What’s This?

re.match(r'^[^@]+@([a-z0-9\-]+\.){1,5}[a-z0-9\-]+$', s)

What’s This?

re.match(r'\d{4}-\d{2}-\d{2}', s)

What’s This?

re.match(r'^\s*$', s)

What’s This?

re.match(r'^(http|https|ftp):[\/]{2}([a-zA-Z0-9\-]+\.){1,4}[a-zA-Z]{2,5}(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\'\/\\\+\&\%\$#\=~]*)',s)

What’s This?

re.match(r'([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})',s)

To Help…

re.VERBOSE to the rescue:

regex = r"""
([GIR] 0[A]{2})|    # Girobank 
(
  (
    ([A-Z][0-9]{1,2})| # e.g A00...Z99
      (
        ([A-Z][A-HJ-Y][0-9]{1,2})|  # e.g. AB54...ZX11
          (([A-Z][0-9][A-Z])|  # e.g. A0B...Z9Z 
          ([A-Z][A-HJ-Y][0-9][A-Z]?))  # e.g. WC1 or WC1H
        )
      )
    \s?[0-9][A-Z]{2} # e.g. 5RX
  )
"""
re.match(regex,s,re.VERBOSE|re.IGNORECASE) # Can also use: re.X|re.I

Applications of Regular Expressions

If our problem follows some set of articulable rules about permissible sequences of characters then we can probably validate it using a regex:

Examples More Examples
Email Password
Postcode Phone number
Date Credit cards
Web scraping Syntax highlighting
Sentence structure Data wrangling
Searching for/within
files/content
Lexical analysis/
Language detection

Resources

Thanks to Yogesh Chavan and Nicola Pietroluongo for examples.