Textual Data

Overview

Although the direct use of textual (both structured and unstructured) data is still relatively rare in spatial analyses, the growth of crowd-sourced and user-generated content points to the growing importance of this area. he tools and approaches in this area are also evolving quickly and changing rapidly, so this week is intended primarily to familiarise you with the basic landscape in preparation for you developing your skills further in your own time!

Learning Outcomes
  1. An awareness of the benefits of separating content from presentation.
  2. A basic understanding of pattern-matching in Python (you will have been exposed to this Week 2 of CASA0005)
  3. A basic understanding of how text can be ‘cleaned’ to make it more amenable for analysis
  4. An appreciation of parallelisation in the context of text processing.
  5. An appreciation of how text can be analysed.

The manipulation of text requires a high level of abstraction – of thinking about words as data in ways that are deeply counter-intuitive – but the ability to do forms a critical bridge between this block and the subsequent one, while also reinforcing the idea that numerical, spatial, and textual data analyses provide alternative (and often complementary) views into the data.

Lectures

Come to class prepared to present/discuss:

Session Video Presentation
Notebooks as Documents Video Slides
Patterns in Text Video Slides
Cleaning Text Video Slides
Analysing Text Video Slides

Other Prep

Connections

Conceptually, this is by far the hardest week of the entire term: there is very little upon which to draw from other modules, and the processing of text with computers rarely makes it beyond simple regular expressions; however, the growth in data that is ‘accidental, open, and everywhere’ (Arribas-Bel 2014) means that a lot more of it is unstructured and contains free-text written by humans as well as numerical and coordinate data generated by sensors or transactions. Using tutorial from the Programming Historian, we’re going to look at the foundations of text processing and how we can extract important terms from a document as well as, ultimately, the foundations upon which modern Large Language Models are built.

Practical

In the practical we will continue to work with the InsideAirbnb data, here focussing on the third ‘class’ of data in the data set: text. We will see how working with text is more complex than working with numeric or spatial data and, consequently, why the computational costs rise accordingly. This practical should suggest some new lines of inquiry for Group Project.

Connections

The practical focusses on:

  • Applying simple regular expressions to find patterns in text.
  • How to clean text in preparation for further analysis.
  • Simple transformations that allow you to analyse text (e.g. TF/IDF)
  • Ways of exploring groups/similarity in textual data.

To access the practical:

  1. Preview on GitHub
  2. Download the Notebook

References

Arribas-Bel, Daniel. 2014. “Accidental, Open and Everywhere: Emerging Data Sources for the Understanding of Cities.” Applied Geography 49. Elsevier:45–53.
Ladd, John R. 2020. “Understanding and Using Common Similarity Measures for Text Analysis.” The Programming Historian, no. 9. https://doi.org/10.46430/phen0089.
Lavin, Matthew J. 2019. “Analyzing Documents with TF-IDF.” The Programming Historian, no. 8. https://doi.org/10.46430/phen0082.
Reades, Jonathan, and Jennie Williams. 2023. “Clustering and Visualising Documents Using Word Embeddings.” Programming Historian. https://doi.org/10.46430/phen0111.