Consider the following character sequences:
'123foo456'.index('foo') # 2
'123foo456'.split('foo') # ['123', '456']
' 123 foo 456 '.strip() # '123 foo 456'
'HOW NOW BROWN COW?'.lower() # 'how now brown cow?'
'How now brown cow?'.replace('brown ','green-')
# 'How now green-cow?'
See: dir(str)
for full list of string methods.
Regexes are a way for talking about patterns observed in text, although their origins are rooted in philosophy and linguistics.
Implemented in Python as:
import re
# re.search(<regex>, <str>)
s = '123foo456'
if re.search('123',s):
print("Found a match.")
else:
print("No match.")
Prints 'Found a match.'
Outputs:
Outputs:
The third parameter allows us to: match newlines (re.DOTALL
), ignore case (re.IGNORECASE
), take language into account (re.LOCALE
), match across lines (re.MULTILINE
), and write patterns across multiple lines (re.VERBOSE
). If you need multiple options it’s re.DOTALL | re.IGNORECASE
. Bitwise again!
s = '123foo456foo789'
lst = re.findall('foo',s)
print(lst)
lst = re.finditer('foo',s)
[x for x in lst]
rs = re.sub('foo',' ',s)
print(rs)
rs = re.split(' ',rs)
print(rs)
Outputs:
import re
m = re.search(r'\$((\d+,){2,}\d+)',
"'That will be $1,000,000 he said...'")
print(m.group(1)) # '1,000,000'
This looks for sequences of 1-or-more digits followed by a comma… and for those sequences to repeat two or more times:
Characters | Regex Meta Class Options | ‘Antonyms’ |
---|---|---|
a…z | [a-z] , \w (word-like characters) |
[^a-z] , \W |
A…Z | [A-Z] , \w (word-like characters) |
[^A-Z] , \W |
0…9 | [0-9] , \d (digits) |
[^0-9] , \D |
' ' , \n , \t , \r , \f , \v |
\s |
\S |
. , [ , ] , + , $ , ^ , \| , { , } , * , ( , ) , ? |
For safety always precede character with a \ . |
None |
Metacharacter | Meaning | Example |
---|---|---|
. | Any character at all | c.t |
^ | Start of a string/line | ^start |
$ | End of a string/line | end$ |
* | 0 or more of something | -* |
+ | 1 or more of something | -+ |
? | 0 or 1 of something; also lazy modifier | ,? |
{m,n} | Repeat between m and n times | \d{1,4} |
[ ] | A set of character literals | [1-5] |
( ) | Group/remember this sequence of characters | (\d+) |
| | Or | (A|B) |
Regex | Interpretation |
---|---|
r'\s*' |
0 or more spaces |
r'\d+' |
1 or more digits |
r'[A-Fa-f0-7]{5}' |
Exactly 5 hexadecimal ‘digits’ |
r'\w+\.\d{2,}' |
1 or more ‘wordish’ characters, followed by a full-stop, then 2 or more digits |
r'^[^@]+@\w+' |
One more non-@ characters at the start of a line, followed by a ‘@’ then 1 or more ‘wordish’ characters. |
r'(uk|eu|fr)$' |
The characters ‘uk’ or ‘eu’ or ‘fr’ at the end of a line. |
Regex101 can be a useful way to build a regex interactively:
re.VERBOSE
to the rescue:
regex = r"""
([GIR] 0[A]{2})| # Girobank
(
(
([A-Z][0-9]{1,2})| # e.g A00...Z99
(
([A-Z][A-HJ-Y][0-9]{1,2})| # e.g. AB54...ZX11
(([A-Z][0-9][A-Z])| # e.g. A0B...Z9Z
([A-Z][A-HJ-Y][0-9][A-Z]?)) # e.g. WC1 or WC1H
)
)
\s?[0-9][A-Z]{2} # e.g. 5RX
)
"""
re.match(regex,s,re.VERBOSE|re.IGNORECASE) # Can also use: re.X|re.I
If our problem follows some set of articulable rules about permissible sequences of characters then we can probably validate it using a regex:
Examples | More Examples |
---|---|
Password | |
Postcode | Phone number |
Date | Credit cards |
Web scraping | Syntax highlighting |
Sentence structure | Data wrangling |
Searching for/within files/content |
Lexical analysis/ Language detection |
Thanks to Yogesh Chavan and Nicola Pietroluongo for examples.
Patterns in Text • Jon Reades