The concept of ‘dummy variables’ in economics/regression is a useful point to start thinking about text:
Topic | Dummy |
---|---|
News | 0 |
Culture | 1 |
Politics | 2 |
Entertainment | 3 |
Document | UK | Top | Pop | Coronavirus |
---|---|---|---|---|
News item | 1 | 1 | 0 | 1 |
Culture item | 0 | 1 | 1 | 0 |
Politics item | 1 | 0 | 0 | 1 |
Entertainment item | 1 | 1 | 1 | 1 |
Just like a one-hot (binarised approach) on preceding slide but now we count occurences:
Document | UK | Top | Pop | Coronavirus |
---|---|---|---|---|
News item | 4 | 2 | 0 | 6 |
Culture item | 0 | 4 | 7 | 0 |
Politics item | 3 | 0 | 0 | 3 |
Entertainment item | 3 | 4 | 8 | 1 |
Enter, stage left, scikit-learn:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# Non-reusable transformer
vectors = vectorizer.fit_transform(texts)
# Reusable transformer
vectorizer.fit(texts)
vectors1 = vectorizer.transform(texts1)
vectors2 = vectorizer.transform(texts2)
print(f'Vocabulary: {vectorizer.vocabulary_}')
print(f'All vectors: {vectors.toarray()}')
Builds on Count Vectorisation by normalising the document frequency measure by the overall corpus frequency. Common words receive a large penalty:
\[ W(t,d) = TF(t,d) / log(N/DF_{t}) \]
For example:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
# Non-reusable form:
vectors=vectorizer.fit_transform(texts)
# Reusable form:
vectorizer.fit(texts)
vectors = vectorizer.transform(texts)
print(f'Vocabulary: {vectorizer.vocabulary_}')
print(f'Full vector: {vectors.toarray()}')
Three input texts with a distance weighting (\(d/2\), where \(d<3\)):
fluffy | mat | ginger | sat | on | cat | the | |
---|---|---|---|---|---|---|---|
fluffy | 1 | 1 | 0.5 | 0.5 | 2.0 | ||
mat | 0.5 | 1.5 | |||||
ginger | 0.5 | 0.5 | 1.0 | 1.5 | |||
sat | 3.0 | 3.0 | 2.5 | ||||
on | 1.5 | 3.0 | |||||
cat | 2.0 | ||||||
the |
The problem:
Cleaning is necessary, but it’s not sufficient to create a tractable TCM on a large corpus.
Typically, some kind of 2 or 3-layer neural network that ‘learns’ how to embed the TCM into a lower-dimension representation: from \(m \times m\) to \(m \times n, n << m\).
Similar to PCA in terms of what we’re trying to achieve, but the process is utterly different.
Requires us to deal in great detail with bi- and tri-grams because negation and sarcasm are hard. Also tends to require training/labelled data.
Cluster | Geography | Earth Science | History | Computer Science | Total |
---|---|---|---|---|---|
1 | 126 | 310 | 104 | 11,018 | 11,558 |
2 | 252 | 10,673 | 528 | 126 | 11,579 |
3 | 803 | 485 | 6,730 | 135 | 8,153 |
4 | 100 | 109 | 6,389 | 28 | 6,626 |
Total | 1,281 | 11,577 | 13,751 | 11,307 | 37,916 |
Learning associations of words (or images or many other things) to hidden ‘topics’ that generate them:
Basically any of the lessons on The Programming Historian.
::: {column width=“50%”} - Introduction to Word Embeddings - The Current Best of Universal Word Embeddings and Sentence Embeddings - Using GloVe Embeddings - Working with Facebook’s FastText Library - Word2Vec and FastText Word Embedding with Gensim - Sentence Embeddings. Fast, please!
::: {column width=“50%”} - PlasticityAI Embedding Models - Clustering text documents using k-means - Topic extraction with Non-negative Matrix Factorization and LDA - Topic Modeling with LSA, pLSA, LDA, NMF, BERTopic, Top2Vec: a Comparison ::: ::::
Analysing Text • Jon Reades