Patterns in Text

Jon Reades

Can We Describe Text?

Consider the following character sequences:

foo@bar.com
https://www.ucl.ac.uk/bartlett/casa/
(555) 102-1111
E17 5RS
Now, fair Hippolyta, our nuptial hour / Draws on apace. Four happy days bring in / Another moon. But, oh, methinks how slow / This old moon wanes. She lingers my desires, / Like to a stepdame or a dowager / Long withering out a young man’s revenue. (I.i.)

Strings Methods are Not Enough

'123foo456'.index('foo') # 2
'123foo456'.split('foo') # ['123', '456']
' 123 foo 456 '.strip()  # '123 foo 456'
'HOW NOW BROWN COW?'.lower() # 'how now brown cow?'
'How now brown cow?'.replace('brown ','green-')
# 'How now green-cow?'

See: dir(str) for full list of string methods.

Regular Expressions

Regexes are a way for talking about patterns observed in text, although their origins are rooted in philosophy and linguistics.

Implemented in Python as:

import re
# re.search(<regex>, <str>)
s = '123foo456'
if re.search('123',s):
  print("Found a match.")
else:
  print("No match.")

Prints 'Found a match.'

Capturing Matches

m = re.search('123',s)
print(m.start())
print(m.end())
print(m.span())
print(m.group())

Outputs:

0
3
(0,3)
123

Configuring Matches

s = '123foo456'
m = re.search('FOO',s)
print(m)
m = re.search('FOO',s,re.IGNORECASE)
print(m)

Outputs:

None
<re.Match object; span=(3, 6), match='foo'>

The third parameter allows us to: match newlines (re.DOTALL), ignore case (re.IGNORECASE), take language into account (re.LOCALE), match across lines (re.MULTILINE), and write patterns across multiple lines (re.VERBOSE). If you need multiple options it’s re.DOTALL | re.IGNORECASE. Bitwise again!

More Than One Match

s = '123foo456foo789'
lst = re.findall('foo',s)
print(lst)
lst = re.finditer('foo',s)
[x for x in lst]
rs  = re.sub('foo',' ',s)
print(rs)
rs  = re.split(' ',rs)
print(rs)

Outputs:

['foo','foo']
[<re.Match object; span=(3, 6), match='foo'>, <re.Match object; span=(9, 12), match='foo'>]
'123 456 789'
['123', '456', '789']

Let’s Get Meta

Regular Expressions Do Much More

import re
m = re.search(r'\$((\d+,){2,}\d+)',
        "'That will be $1,000,000 he said...'")
print(m.group(1)) # '1,000,000'

This looks for sequences of 1-or-more digits followed by a comma… and for those sequences to repeat two or more times:

# Look for a literal '$'
re.search(r'\$') 
# Group of >=1 digits followed by a comma...
re.search(r'(\d+,)') 
# Repeated two or more times...
re.search(r'(\d+,){2,}')

Character Classes

Characters	Regex Meta Class Options	‘Antonyms’
a…z	`[a-z]`, `\w` (word-like characters)	`[^a-z]`, `\W`
A…Z	`[A-Z]`, `\w` (word-like characters)	`[^A-Z]`, `\W`
0…9	`[0-9]`, `\d` (digits)	`[^0-9]`, `\D`
`' '`, `\n`, `\t`, `\r`, `\f`, `\v`	`\s`	`\S`
`.`, `[`, `]`, `+`, `$`, `^`, `\\|`, `{`, `}`, `*`, `(`, `)`, `?`	For safety always precede character with a `\`.	None

Metacharacters

Metacharacter	Meaning	Example
.	Any character at all	`c.t`
^	Start of a string/line	`^start`
$	End of a string/line	`end$`
*	0 or more of something	`-*`
+	1 or more of something	`-+`
?	0 or 1 of something; also lazy modifier	`,?`
{m,n}	Repeat between m and n times	`\d{1,4}`
[ ]	A set of character literals	`[1-5]`
( )	Group/remember this sequence of characters	`(\d+)`
\|	Or	`(A\|B)`

I am Completely Lost

Building Blocks

Regex	Interpretation
`r'\s*'`	0 or more spaces
`r'\d+'`	1 or more digits
`r'[A-Fa-f0-7]{5}'`	Exactly 5 hexadecimal ‘digits’
`r'\w+\.\d{2,}'`	1 or more ‘wordish’ characters, followed by a full-stop, then 2 or more digits
`r'^[^@]+@\w+'`	One more non-@ characters at the start of a line, followed by a ‘@’ then 1 or more ‘wordish’ characters.
`r'(uk\|eu\|fr)$'`	The characters ‘uk’ or ‘eu’ or ‘fr’ at the end of a line.

Exploring

Regex101 can be a useful way to build a regex interactively:

What’s This?

re.match(r'^[^@]+@([a-z0-9\-]+\.){1,5}[a-z0-9\-]+$', s)

What’s This?

re.match(r'\d{4}-\d{2}-\d{2}', s)

What’s This?

re.match(r'^\s*$', s)

What’s This?

re.match(r'^(http|https|ftp):[\/]{2}([a-zA-Z0-9\-]+\.){1,4}[a-zA-Z]{2,5}(:[0-9]+)?\/?([a-zA-Z0-9\-\._\?\'\/\\\+\&\%\$#\=~]*)',s)

What’s This?

re.match(r'([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})',s)

To Help…

re.VERBOSE to the rescue:

regex = r"""
([GIR] 0[A]{2})|    # Girobank 
(
  (
    ([A-Z][0-9]{1,2})| # e.g A00...Z99
      (
        ([A-Z][A-HJ-Y][0-9]{1,2})|  # e.g. AB54...ZX11
          (([A-Z][0-9][A-Z])|  # e.g. A0B...Z9Z 
          ([A-Z][A-HJ-Y][0-9][A-Z]?))  # e.g. WC1 or WC1H
        )
      )
    \s?[0-9][A-Z]{2} # e.g. 5RX
  )
"""
re.match(regex,s,re.VERBOSE|re.IGNORECASE) # Can also use: re.X|re.I

Applications of Regular Expressions

If our problem follows some set of articulable rules about permissible sequences of characters then we can probably validate it using a regex:

Examples	More Examples
Email	Password
Postcode	Phone number
Date	Credit cards
Web scraping	Syntax highlighting
Sentence structure	Data wrangling
Searching for/within files/content	Lexical analysis/ Language detection

Resources

Thanks to Yogesh Chavan and Nicola Pietroluongo for examples.