Index: 3
Split: ['123', '456']
Strip: 123 foo 456
Lower: how now brown cow?
Replace: How now green-cow?
Jon Reades - j.reades@ucl.ac.uk
1st October 2025
Consider the following character sequences:
Index: 3
Split: ['123', '456']
Strip: 123 foo 456
Lower: how now brown cow?
Replace: How now green-cow?
See: dir(str)
for full list of string methods.
Regexes are a way for talking about patterns observed in text, although their origins are rooted in philosophy and linguistics.
Implemented in Python as:
For singular matches it’s fairly straightforward:
First search: None
Second search: <re.Match object; span=(3, 6), match='foo'>
The third parameter allows us to: match newlines (re.DOTALL
), ignore case (re.IGNORECASE
), take language into account (re.LOCALE
), match across lines (re.MULTILINE
), and write patterns across multiple lines (re.VERBOSE
). If you need multiple options it’s re.DOTALL | re.IGNORECASE
. Bitwise again!
Match list: ['foo', 'foo']
List iterator: [<re.Match object; span=(3, 6), match='foo'>, <re.Match object; span=(9, 12), match='foo'>]
Substitution: 123 456 789
Splitting: ['123', '456', '789']
1,000,000
It breaks down like this:
<re.Match object; span=(14, 15), match='$'>
<re.Match object; span=(15, 17), match='1,'>
<re.Match object; span=(15, 25), match='1,000,000,'>
Characters | Regex Meta Class Options | ‘Antonyms’ |
---|---|---|
a…z | [a-z] , \w (word-like characters) |
[^a-z] , \W |
A…Z | [A-Z] , \w (word-like characters) |
[^A-Z] , \W |
0…9 | [0-9] , \d (digits) |
[^0-9] , \D |
' ' , \n , \t , \r , \f , \v |
\s |
\S |
. , [ , ] , + , $ , ^ , \| , { , } , * , ( , ) , ? |
For safety always precede character with a \ . |
None |
Metacharacter | Meaning | Example |
---|---|---|
. | Any character at all | c.t |
^ | Start of a string/line | ^start |
$ | End of a string/line | end$ |
* | 0 or more of something | -* |
+ | 1 or more of something | -+ |
? | 0 or 1 of something; also lazy modifier | ,? |
{m,n} | Repeat between m and n times | \d{1,4} |
[ ] | A set of character literals | [1-5] |
( ) | Group/remember this sequence of characters | (\d+) |
| | Or | (A|B) |
Regex | Interpretation |
---|---|
r'\s*' |
0 or more spaces |
r'\d+' |
1 or more digits |
r'[A-Fa-f0-7]{5}' |
Exactly 5 hexadecimal ‘digits’ |
r'\w+\.\d{2,}' |
1 or more ‘wordish’ characters, followed by a full-stop, then 2 or more digits |
r'^[^@]+@\w+' |
One more non-@ characters at the start of a line, followed by a ‘@’ then 1 or more ‘wordish’ characters. |
r'(uk|eu|fr)$' |
The characters ‘uk’ or ‘eu’ or ‘fr’ at the end of a line. |
Regex101 might be a fun way to learn:
re.VERBOSE
to the rescue:
regex = r"""
([GIR] 0[A]{2})| # Girobank
(
(
([A-Z][0-9]{1,2})| # e.g A00...Z99
(
([A-Z][A-HJ-Y][0-9]{1,2})| # e.g. AB54...ZX11
(([A-Z][0-9][A-Z])| # e.g. A0B...Z9Z
([A-Z][A-HJ-Y][0-9][A-Z]?)) # e.g. WC1 or WC1H
)
)
\s?[0-9][A-Z]{2} # e.g. 5RX
)
"""
re.match(regex,s,re.VERBOSE|re.IGNORECASE) # Can also use: re.X|re.I
If our problem follows some set of articulable rules about permissible sequences of characters then we can probably validate it using a regex:
Examples | More Examples |
---|---|
Password | |
Postcode | Phone number |
Date | Credit cards |
Web scraping | Syntax highlighting |
Sentence structure | Data wrangling |
Searching for/within files/content |
Lexical analysis/ Language detection |
Thanks to Yogesh Chavan and Nicola Pietroluongo for examples.