Textual Data
Overview
Although the direct use of textual (both structured and unstructured) data is still relatively rare in spatial analyses, the growth of crowd-sourced and user-generated content points to the growing importance of this area. he tools and approaches in this area are also evolving quickly and changing rapidly, so this week is intended primarily to familiarise you with the basic landscape in preparation for you developing your skills further in your own time!
- An awareness of the benefits of separating content from presentation.
- A basic understanding of pattern-matching in Python (you will have been exposed to this Week 2 of CASA0005)
- A basic understanding of how text can be ‘cleaned’ to make it more amenable for analysis
- An appreciation of parallelisation in the context of text processing.
- An appreciation of how text can be analysed.
The manipulation of text requires a high level of abstraction – of thinking about words as data in ways that are deeply counter-intuitive – but the ability to do forms a critical bridge between this block and the subsequent one, while also reinforcing the idea that numerical, spatial, and textual data analyses provide alternative (and often complementary) views into the data.
Preparatory Lectures
Come to class prepared to present/discuss:
Session | Video | Presentation |
---|---|---|
Notebooks as Documents | Video | Slides |
Patterns in Text | Video | Slides |
Cleaning Text | Video | Slides |
Analysing Text | Video | Slides |
Other Preparation
Readings
Come to class prepared to discuss the following readings:
Citation | Article | ChatGPT Summary |
---|---|---|
Miller and Goodchild (2015) | URL | N/A |
Delmelle and Nilsson (2021) | URL | N/A |
Reades et al. (in review) | URL | N/A |
Study Guide
Reading Miller and Goodchild (2015):
- How does “data-driven geography” differ from traditional geographic research?
- How can “data-driven approaches” be incorporated into geographic research, and what are their potential benefits and limitations?
Reflecting on Reades et al. (in review):
- Why has text become increasingly interesting to computational social scientists?
- What are the specific advantages of textual data for understanding cities?
- What are some of the key challenges and limitations of using textual data in urban research, and how can researchers address these challenges?
Connecting this to Delmelle and Nilsson (2021):
- What is the framework that Delmelle and Nilsson developed for understanding the language used to advertise properties, and how does it connect to the racial and income profiles of neighborhoods?
- What are the implications for understanding neighborhood change and (potential) discrimination in the housing market?
Collecitvely:
- How do these readings connect to the broader themes of the course, and what are the implications for your own research?
Conceptually, this is by far the hardest week of the entire term: there is very little upon which to draw from other modules, and the processing of text with computers rarely makes it beyond simple regular expressions; however, the growth in data that is ‘accidental, open, and everywhere’ (Arribas-Bel 2014) means that a lot more of it is unstructured and contains free-text written by humans as well as numerical and coordinate data generated by sensors or transactions.
If you’re feeling ambitious then you can use the tutorial from the Programming Historian to look at the foundations of text processing and how we can extract important terms from a document as well as, ultimately, the foundations upon which modern Large Language Models are built.
Practical
In the practical we will continue to work with the InsideAirbnb data, here focussing on the third ‘class’ of data in the data set: text. We will see how working with text is more complex than working with numeric or spatial data and, consequently, why the computational costs rise accordingly. This practical should suggest some new lines of inquiry for Group Project.
The practical focusses on:
- Applying simple regular expressions to find patterns in text.
- How to clean text in preparation for further analysis.
- Simple transformations that allow you to analyse text (e.g. TF/IDF)
- Ways of exploring groups/similarity in textual data.
To access the practical:
Bonus material (not necessary for the assessment, just ‘nice to know’ if you’re interested in the topic) containing material related to Natural Language Processing (NLP):