Patterns in Text

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

Can We Describe Text?

Consider the following character sequences:

  • foo@bar.com
  • https://www.ucl.ac.uk/bartlett/casa/
  • (555) 102-1111
  • E17 5RS
  • Now, fair Hippolyta, our nuptial hour / Draws on apace. Four happy days bring in / Another moon. But, oh, methinks how slow / This old moon wanes. She lingers my desires, / Like to a stepdame or a dowager / Long withering out a young man’s revenue. (I.i.)

String Methods are Not Enough

print(f"Index: {'123foo456'.index('foo')}")
print(f"Split: {'123foo456'.split('foo')}") 
print(f"Strip: {' 123 foo 456 '.strip()}") 
print(f"Lower: {'HOW NOW BROWN COW?'.lower()}")
print(f"Replace: {'How now brown cow?'.replace('brown ','green-')}")
Index: 3
Split: ['123', '456']
Strip: 123 foo 456
Lower: how now brown cow?
Replace: How now green-cow?

See: dir(str) for full list of string methods.

Regular Expressions

Regexes are a way for talking about patterns observed in text, although their origins are rooted in philosophy and linguistics.

Implemented in Python as:

import re
s = '123foo456'

# re.search(<regex>, <str>)
if re.search('123',s):
  print("Found a match.")
else:
  print("No match.")
Found a match.

Capturing Matches

For singular matches it’s fairly straightforward:

print(f"String: {s}")
m = re.search('123',s)

print(f"Start of match: {m.start()}")
print(f"End of match: {m.end()}")
print(f"Span of match: {m.span()}")
print(f"Match group: {m.group()}")
String: 123foo456
Start of match: 0
End of match: 3
Span of match: (0, 3)
Match group: 123

Configuring Matches

s = '123foo456'

m = re.search('FOO',s)
print(f"First search: {m}")

m = re.search('FOO',s,re.IGNORECASE)
print(f"Second search: {m}")
First search: None
Second search: <re.Match object; span=(3, 6), match='foo'>

The third parameter allows us to: match newlines (re.DOTALL), ignore case (re.IGNORECASE), take language into account (re.LOCALE), match across lines (re.MULTILINE), and write patterns across multiple lines (re.VERBOSE). If you need multiple options it’s re.DOTALL | re.IGNORECASE. Bitwise again!

More Than One Match

s = '123foo456foo789'

lst1 = re.findall('foo',s)
print(f"Match list: {lst1}")

lst2 = re.finditer('foo',s)
print(f"List iterator: {[x for x in lst2]}")

rs  = re.sub('foo',' ',s)
print(f"Substitution: {rs}")

rs  = re.split(' ',rs)
print(f"Splitting: {rs}")
Match list: ['foo', 'foo']
List iterator: [<re.Match object; span=(3, 6), match='foo'>, <re.Match object; span=(9, 12), match='foo'>]
Substitution: 123 456 789
Splitting: ['123', '456', '789']

Let’s Get Meta

Regular Expressions Do Much More

s = "'That will be $1,000,000, he said...'"
m = re.search(r'\$((\d+,){2,}\d+)',s)
        
print(m.group(1))
1,000,000

It breaks down like this:

# Look for a literal '$'
print(re.search(r'\$',s))
# Group of >=1 digits followed by a comma...
print(re.search(r'(\d+,)',s))
# Repeated two or more times...
print(re.search(r'(\d+,){2,}',s))
<re.Match object; span=(14, 15), match='$'>
<re.Match object; span=(15, 17), match='1,'>
<re.Match object; span=(15, 25), match='1,000,000,'>

Character Classes

Characters Regex Meta Class Options ‘Antonyms’
a…z [a-z], \w (word-like characters) [^a-z], \W
A…Z [A-Z], \w (word-like characters) [^A-Z], \W
0…9 [0-9], \d (digits) [^0-9], \D
' ', \n, \t, \r, \f, \v \s \S
., [, ], +, $, ^, \|, {, }, *, (, ), ? For safety always precede character with a \. None

Metacharacters

Metacharacter Meaning Example
. Any character at all c.t
^ Start of a string/line ^start
$ End of a string/line end$
* 0 or more of something -*
+ 1 or more of something -+
? 0 or 1 of something; also lazy modifier ,?
{m,n} Repeat between m and n times \d{1,4}
[ ] A set of character literals [1-5]
( ) Group/remember this sequence of characters (\d+)
| Or (A|B)

I am Completely Lost

Building Blocks

Regex Interpretation
r'\s*' 0 or more spaces
r'\d+' 1 or more digits
r'[A-Fa-f0-7]{5}' Exactly 5 hexadecimal ‘digits’
r'\w+\.\d{2,}' 1 or more ‘wordish’ characters, followed by a full-stop, then 2 or more digits
r'^[^@]+@\w+' One more non-@ characters at the start of a line, followed by a ‘@’ then 1 or more ‘wordish’ characters.
r'(uk|eu|fr)$' The characters ‘uk’ or ‘eu’ or ‘fr’ at the end of a line.

Exploring

Regex101 might be a fun way to learn:

What’s This?

re.VERBOSE to the rescue:

regex = r"""
([GIR] 0[A]{2})|    # Girobank 
(
  (
    ([A-Z][0-9]{1,2})| # e.g A00...Z99
      (
        ([A-Z][A-HJ-Y][0-9]{1,2})|  # e.g. AB54...ZX11
          (([A-Z][0-9][A-Z])|  # e.g. A0B...Z9Z 
          ([A-Z][A-HJ-Y][0-9][A-Z]?))  # e.g. WC1 or WC1H
        )
      )
    \s?[0-9][A-Z]{2} # e.g. 5RX
  )
"""
re.match(regex,s,re.VERBOSE|re.IGNORECASE) # Can also use: re.X|re.I

Applications of Regular Expressions

If our problem follows some set of articulable rules about permissible sequences of characters then we can probably validate it using a regex:

Examples More Examples
Email Password
Postcode Phone number
Date Credit cards
Web scraping Syntax highlighting
Sentence structure Data wrangling
Searching for/within
files/content
Lexical analysis/
Language detection

Additional Resources

Thanks to Yogesh Chavan and Nicola Pietroluongo for examples.

Thank You

References