If you need to apply the same operation to lots of data why do it sequentially?
If you have 4 cores then parallelisation cuts analysis time by 75%.
Numpy is fully vectorised and will almost always out-perform Pandas apply
, but both are massive improvements on for loops:
lambda
functions too for ‘one off’ operations (ad-ohoc functions).Functional equivalent of list comprehensions: 1-line, anonymous functions.
For example:
Or:
full_name = lambda first, last: f'Full name: {first.title()} {last.title()}'
print(full_name('guido', 'van rossum')) # 'Guido Van Rossum'
These are very useful with pandas.
import time
import numpy as np
def func(a,b):
c = 0
for i in range(len(a)): c += a[i]*b[i]
return c
a = np.random.rand(1000000)
b = np.random.rand(1000000)
t1 = time.time()
print(func(a,b))
t2 = time.time()
print(np.dot(a,b))
t3 = time.time()
print(f"For loop took {(t2-t1)*1000:.0f} milliseconds")
print(f"Numpy took {(t3-t2)*1000:.0f} milliseconds")
Generally, I get numpy
taking 86ms, while the for
loop takes 331ms!
Two stages to acquiring web-based documents:
urllib
can deal with many issues (even authentication), but not with dynamic web pages (which are increasingly common); for that, you need Selenium (library + driver).These interact with wider issues of Fair Use (e.g. rate limits and licenses); processing pipelines (e.g. saving WARCs or just the text file, multiple stages, etc.); and other practical constraints.
Need to look at how the data is organised:
These strategies can be used singly or all-together:
But these are just a starting point!
We can prune from both ends of the distribution:
Humans use a lot of words/concepts1:
Source | Porter | Snowball | Lemmatisation |
---|---|---|---|
monkeys | monkey | monkey | monkey |
cities | citi | citi | city |
complexity | complex | complex | complexity |
Reades | read | read | Reades |
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
for w in ['monkeys','cities','complexity','Reades']:
print(f"Porter: {PorterStemmer().stem(w)}")
print(f"Snowball: {SnowballStemmer('english').stem(w)}")
print(f"Lemmatisation: {wnl.lemmatize(w)}")
Cleaning Text • Jon Reades