Foundations of Spatial Data Science

Dummy Variables

The concept of ‘dummy variables’ in economics/regression is a useful point to start thinking about text:

Topic	Dummy
News	0
Culture	1
Politics	2
Entertainment	3

One-Hot Encoders

Document	UK	Top	Pop	Coronavirus
News item	1	1	0	1
Culture item	0	1	1	0
Politics item	1	0	0	1
Entertainment item	1	1	1	1

The ‘Bag of Words’

Just like a one-hot (binarised approach) on preceding slide but now we count occurences:

Document	UK	Top	Pop	Coronavirus
News item	4	2	0	6
Culture item	0	4	7	0
Politics item	3	0	0	3
Entertainment item	3	4	8	1

BoW in Practice

Enter, stage left, scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

# Non-reusable transformer
vectors = vectorizer.fit_transform(texts)

# Reusable transformer
vectorizer.fit(texts)
vectors1 = vectorizer.transform(texts1)
vectors2 = vectorizer.transform(texts2)

print(f'Vocabulary: {vectorizer.vocabulary_}')
print(f'All vectors: {vectors.toarray()}')

TF/IDF

Builds on Count Vectorisation by normalising the document frequency measure by the overall corpus frequency. Common words receive a large penalty:

\[ W(t,d) = TF(t,d) / log(N/DF_{t}) \]

For example:

If the term ‘cat’ appears 3 times in a document of 100 words then Term Frequency given by: \(TF(t,d)=3/100\), and
If there are 10,000 documents and cat appears in 1,000 documents then Normalised Document Frequency given by: \(N/DF_{t}=10000/1000\) so the Inverse Document Frequency is \(log(10)=1\),
So IDF=1 and TF/IDF=0.03.

TF/IDF in Practice

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

# Non-reusable form:
vectors=vectorizer.fit_transform(texts)

# Reusable form:
vectorizer.fit(texts)
vectors = vectorizer.transform(texts)

print(f'Vocabulary: {vectorizer.vocabulary_}')
print(f'Full vector: {vectors.toarray()}')

Term Co-Occurence Matrix (TCM)

Three input texts with a distance weighting (\(d/2\), where \(d<3\)):

the cat sat on the mat
the cat sat on the fluffy mat
the fluffy ginger cat sat on the mat

	mat	ginger	sat	on	cat	the
fluffy	1	1		0.5	0.5	2.0
mat				0.5		1.5
ginger			0.5	0.5	1.0	1.5
sat				3.0	3.0	2.5
on					1.5	3.0
cat						2.0
the

How Big is a TCM?

The problem:

A corpus with 10,000 words has a TCM of size \(10,000^{2}\) (100,000,000)
A corpus with 50,000 words has a TCM of size \(50,000^{2}\) (2,500,000,000)

Cleaning is necessary, but it’s not sufficient to create a tractable TCM on a large corpus.

Enter Document Embeddings

Typically, some kind of 2 or 3-layer neural network that ‘learns’ how to embed the TCM into a lower-dimension representation: from \(m \times m\) to \(m \times n, n << m\).

Similar to PCA in terms of what we’re trying to achieve, but the process is utterly different.

Sentiment Analysis

Requires us to deal in great detail with bi- and tri-grams because negation and sarcasm are hard. Also tends to require training/labelled data.

Source.

Clustering

Cluster	Geography	Earth Science	History	Computer Science	Total
1	126	310	104	11,018	11,558
2	252	10,673	528	126	11,579
3	803	485	6,730	135	8,153
4	100	109	6,389	28	6,626
Total	1,281	11,577	13,751	11,307	37,916

Topic Modelling

Learning associations of words (or images or many other things) to hidden ‘topics’ that generate them:

Word Clouds

Resources

Basically any of the lessons on The Programming Historian.

Analysing Text