Index: 3
Split: ['123', '456']
Strip: 123 foo 456
Lower: how now brown cow?
Replace: How now green-cow?
Jon Reades - j.reades@ucl.ac.uk
1st October 2025
Consider the following character sequences:
Index: 3
Split: ['123', '456']
Strip: 123 foo 456
Lower: how now brown cow?
Replace: How now green-cow?
See: dir(str) for full list of string methods.
Regexes are a way for talking about patterns observed in text, although their origins are rooted in philosophy and linguistics.
Implemented in Python as:
For singular matches it’s fairly straightforward:
First search: None
Second search: <re.Match object; span=(3, 6), match='foo'>
The third parameter allows us to: match newlines (re.DOTALL), ignore case (re.IGNORECASE), take language into account (re.LOCALE), match across lines (re.MULTILINE), and write patterns across multiple lines (re.VERBOSE). If you need multiple options it’s re.DOTALL | re.IGNORECASE. Bitwise again!
Match list: ['foo', 'foo']
List iterator: [<re.Match object; span=(3, 6), match='foo'>, <re.Match object; span=(9, 12), match='foo'>]
Substitution: 123 456 789
Splitting: ['123', '456', '789']
1,000,000
It breaks down like this:
<re.Match object; span=(14, 15), match='$'>
<re.Match object; span=(15, 17), match='1,'>
<re.Match object; span=(15, 25), match='1,000,000,'>
| Characters | Regex Meta Class Options | ‘Antonyms’ |
|---|---|---|
| a…z | [a-z], \w (word-like characters) |
[^a-z], \W |
| A…Z | [A-Z], \w (word-like characters) |
[^A-Z], \W |
| 0…9 | [0-9], \d (digits) |
[^0-9], \D |
' ', \n, \t, \r, \f, \v |
\s |
\S |
., [, ], +, $, ^, \|, {, }, *, (, ), ? |
For safety always precede character with a \. |
None |
| Metacharacter | Meaning | Example |
|---|---|---|
| . | Any character at all | c.t |
| ^ | Start of a string/line | ^start |
| $ | End of a string/line | end$ |
| * | 0 or more of something | -* |
| + | 1 or more of something | -+ |
| ? | 0 or 1 of something; also lazy modifier | ,? |
| {m,n} | Repeat between m and n times | \d{1,4} |
| [ ] | A set of character literals | [1-5] |
| ( ) | Group/remember this sequence of characters | (\d+) |
| | | Or | (A|B) |
| Regex | Interpretation |
|---|---|
r'\s*' |
0 or more spaces |
r'\d+' |
1 or more digits |
r'[A-Fa-f0-7]{5}' |
Exactly 5 hexadecimal ‘digits’ |
r'\w+\.\d{2,}' |
1 or more ‘wordish’ characters, followed by a full-stop, then 2 or more digits |
r'^[^@]+@\w+' |
One more non-@ characters at the start of a line, followed by a ‘@’ then 1 or more ‘wordish’ characters. |
r'(uk|eu|fr)$' |
The characters ‘uk’ or ‘eu’ or ‘fr’ at the end of a line. |
Regex101 might be a fun way to learn:
re.VERBOSE to the rescue:
regex = r"""
([GIR] 0[A]{2})| # Girobank
(
(
([A-Z][0-9]{1,2})| # e.g A00...Z99
(
([A-Z][A-HJ-Y][0-9]{1,2})| # e.g. AB54...ZX11
(([A-Z][0-9][A-Z])| # e.g. A0B...Z9Z
([A-Z][A-HJ-Y][0-9][A-Z]?)) # e.g. WC1 or WC1H
)
)
\s?[0-9][A-Z]{2} # e.g. 5RX
)
"""
re.match(regex,s,re.VERBOSE|re.IGNORECASE) # Can also use: re.X|re.IIf our problem follows some set of articulable rules about permissible sequences of characters then we can probably validate it using a regex:
| Examples | More Examples |
|---|---|
| Password | |
| Postcode | Phone number |
| Date | Credit cards |
| Web scraping | Syntax highlighting |
| Sentence structure | Data wrangling |
| Searching for/within files/content |
Lexical analysis/ Language detection |
Thanks to Yogesh Chavan and Nicola Pietroluongo for examples.