Cleaning Text

Jon Reades

Bye, bye, for loop!

When a Loop Is Not Best

If you need to apply the same operation to lots of data why do it sequentially?

Your computer has many cores and can run many threads in parallel.
The computer divides the work across the threads it sees fit.
The computer reassemble the answer at the end from the threads.

If you have 4 cores then parallelisation cuts analysis time by 75%.

So Do More with Each Clock Cycle

Many libraries/packages implement weak forms of vectorisation or parallelisation, but some libraries do more.
You must request it because it requires hardware or other support and it is highly optimsed.
Multiple separate machines acting as one.
Multiple GPUs acting as one.

Pandas.apply() vs. Numpy

Numpy is fully vectorised and will almost always out-perform Pandas apply, but both are massive improvements on for loops:

Execute row-wise and column-wise operations.
Apply any arbitrary function to individual elements or whole axes (i.e. row or col).
Can make use of lambda functions too for ‘one off’ operations (ad-ohoc functions).

Lambda Functions

Functional equivalent of list comprehensions: 1-line, anonymous functions.

For example:

x = lambda a : a + 10
print(x(5)) # 15

Or:

full_name = lambda first, last: f'Full name: {first.title()} {last.title()}'
print(full_name('guido', 'van rossum')) # 'Guido Van Rossum'

These are very useful with pandas.

Let’s Compare

import time
import numpy as np
def func(a,b):
  c = 0
  for i in range(len(a)): c += a[i]*b[i]
  return c

a = np.random.rand(1000000)
b = np.random.rand(1000000)
t1 = time.time()
print(func(a,b))
t2 = time.time()
print(np.dot(a,b))
t3 = time.time()

print(f"For loop took {(t2-t1)*1000:.0f} milliseconds")
print(f"Numpy took {(t3-t2)*1000:.0f} milliseconds")

Generally, I get numpy taking 86ms, while the for loop takes 331ms!

Dealing with Structured Text

Beautiful Soup & Selenium

Two stages to acquiring web-based documents:

Accessing the document: urllib can deal with many issues (even authentication), but not with dynamic web pages (which are increasingly common); for that, you need Selenium (library + driver).
Processing the document: simple data can be extracted from web pages with RegularExpressions, but not with complex (esp. dynamic) content; for that, you need BeautifulSoup4.

These interact with wider issues of Fair Use (e.g. rate limits and licenses); processing pipelines (e.g. saving WARCs or just the text file, multiple stages, etc.); and other practical constraints.

Regular Expressions / Breaks

Need to look at how the data is organised:

For very large corpora, you might want one document at a time (batch).
For very large files, you might want one line at a time (streaming).
For large files in large corpora, you might want more than one ‘machine’.

Managing Vocabularies

Starting Points

These strategies can be used singly or all-together:

Stopwords
Case
Accent-stripping
Punctuation
Numbers

But these are just a starting point!

Distributional Pruning

We can prune from both ends of the distribution:

Overly rare words: what does a word used in one document help us to understand about a corpus?
Overly common ones: what does a word used in every document help us to understand about a corpus?

Stemming & Lemmatisation

Different Approaches

Humans use a lot of words/concepts¹:

Stemming: rules-based truncation to a stem (can be augmented by language awareness).
Lemmatisation: usually dictionary-based ‘deduplication’ to a lemma (can be augmented by POS-tagging).

Different Outcomes

Source	Porter	Snowball	Lemmatisation
monkeys	monkey	monkey	monkey
cities	citi	citi	city
complexity	complex	complex	complexity
Reades	read	read	Reades

from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
for w in ['monkeys','cities','complexity','Reades']:
    print(f"Porter: {PorterStemmer().stem(w)}")
    print(f"Snowball: {SnowballStemmer('english').stem(w)}")
    print(f"Lemmatisation: {wnl.lemmatize(w)}")