Cleaning Text

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

Bye, bye, for loop!

When a Loop Is Not Best

If you need to apply the same operation to lots of data why do it sequentially?

  • Your computer has many cores and can run many threads in parallel.
  • The computer divides the work across the threads it sees fit.
  • The computer reassembles the answer at the end from the threads.

If you have 4 cores then parallelisation cuts analysis time by 75%.

So Do More with Each Clock Cycle

We can tackle this problem in multiple ways:

  • Implicit: many libraries/packages implement weak forms of vectorisation (or parallelisation).
  • Explicit: you must request it because it requires hardware or other support and it is highly optimsed.
  • Massive: you must commission it because it requires specialist hardware and software.

Pandas.apply() vs. Numpy

Numpy is fully vectorised and will almost always out-perform Pandas apply, but both are massive improvements on for loops:

  • Execute row-wise and column-wise operations.
  • Apply any arbitrary function to individual elements or whole axes (i.e. row or col).
  • Can make use of lambda functions too for ‘one off’ operations (ad-hoc functions).

Lambda Functions

Functional equivalent of list comprehensions: 1-line, anonymous functions.

For example:

x = lambda a : a + 10
print(x(5))
15

Or:

full_name = lambda first, last: f'Full name: {first.title()} {last.title()}'
print(full_name('guido', 'van rossum'))
Full name: Guido Van Rossum

These are very useful with pandas.

Let’s Compare

import time
import numpy as np
def func(a,b):
  c = 0
  for i in range(len(a)): c += a[i]*b[i]
  return c

a = np.random.rand(1000000)
b = np.random.rand(1000000)
t1 = time.time()
func(a,b) # Dot product
t2 = time.time()
np.dot(a,b) # Dot product
t3 = time.time()

print(f"For loop took {(t2-t1)*1000:.0f} milliseconds")
print(f"Numpy took {(t3-t2)*1000:.0f} milliseconds")
For loop took 207 milliseconds
Numpy took 2 milliseconds

Without running this hundreds of times you’ll get a different answer every time, but in general numpy is orders of magnitude faster.

Dealing with Structured Text

Beautiful Soup & Selenium

Two stages to acquiring web-based documents:

  1. Accessing the document: urllib can deal with many issues (even authentication), but not with dynamic web pages (which are increasingly common); for that, you need Selenium (library + driver).
  2. Processing the document: simple data can be extracted from web pages with Regular Expressions, but not with complex (esp. dynamic) content; for that, you need BeautifulSoup4.

These interact with wider issues of Fair Use (e.g. rate limits and licenses); processing pipelines (e.g. saving WARCs or just the text file, multiple stages, etc.); and other practical constraints.

Regular Expressions / Breaks

Need to look at how the data is organised:

  • For very large corpora, you might want one document at a time (batch).
  • For very large files, you might want one line at a time (streaming).
  • For large files in large corpora, you might want more than one ‘machine’.

Managing Vocabularies

Starting Points

These strategies can be used singly or all-together:

  • Stopwords
  • Case
  • Accent-stripping
  • Punctuation
  • Numbers

But these are just a starting point!

Distributional Pruning

We can prune from both ends of the distribution:

  • Overly rare words: what does a word used in one document help us to understand about a corpus?
  • Overly common ones: what does a word used in every document help us to understand about a corpus?

Stemming & Lemmatisation

Different Approaches

Humans use a lot of words/concepts1:

  • Stemming: rules-based truncation to a stem (can be augmented by language awareness).
  • Lemmatisation: usually dictionary-based ‘deduplication’ to a lemma (can be augmented by POS-tagging).

Different Outcomes

Source Porter Snowball Lemmatisation
monkeys monkey monkey monkey
cities citi citi city
complexity complex complex complexity
Reades read read Reades
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
for w in ['monkeys','cities','complexity','Reades']:
    print(f"Porter: {PorterStemmer().stem(w)}")
    print(f"Snowball: {SnowballStemmer('english').stem(w)}")
    print(f"Lemmatisation: {wnl.lemmatize(w)}")

Additional Resources

Thank You

References