Building Foundations:
Reproducible (Geographic) Data Science

Context: Who is this guy?

Jon in five bullet points:

  1. Undergraduate degree in Comparative Literature.
  2. Dot.com startup web dev in New York/London for nearly 10 years.
  3. PhD and post-doc in Urban Planning at UCL (CASA).
  4. Lecturer/SL in Quantitative Geography at KCL for nearly 10 years.
  5. Associate Prof in Spatial Data Science at UCL (CASA).

Our Vision…

“Our vision for the modern teacher-scholar is having consistent reproducibility practices in how they conduct research, what they teach to students, and how they prepare teaching materials. We distinguish the three aspects as reproducible research, teaching reproducibility, and reproducible teaching…” (Dogucu and Çetinkaya-Rundel, 2022)

Some Benefits of Reproducible Workflows

For Students For Teacher-Scholars
Abstraction Abstraction
Employability Employability
Learning-by-Seeing Learning-by-Seeing
Learning-by-Breaking Learning-by-Breaking
Workload Management Workload Management

Context: Foundations in Spatial Data Science

This module provides students with an introduction to programming through a mix of discussion and coursework built around an applied spatial data science question using real-world data. The module is intended to… show how geographic and quantitative concepts are applied in a computational context as part of a piece of spatial data science analysis.

Technical Evolution

v1
(2013–2018)
v2
(2019–2021)
v3
(2022–2023)
Platform USB Key w/ Ubuntu Conda YAML Docker
Language Python 3.4 Python 3.6 Python 3.10
Versioning SVN/Dropbox GitHub GitHub
Content PowerPoint PowerPoint Quarto/GitHub.io
Environment Spyder Jupyter Quarto/JupyterLab
Assessment Word Jupyter Notebook Quarto

The v3 ‘Stack’

  • Why Docker?
  • Why Git/GitHub?
  • Why JupyterLab?
  • Why Quarto?

Module Structure

  1. Foundations
  2. Data
  3. Analysis

Module Structure

  1. Foundations
    • Setting Up
    • Foundations Part 1
    • Foundations Part 2
    • Objects & Classes
  2. Data
  3. Analysis

Module Structure

  1. Foundations
  2. Data
    • Numeric Data
    • Spatial Data
    • Textual Data
  3. Analysis

Module Structure

  1. Foundations
  2. Data
  3. Analysis
    • Dimensions in Data
    • Grouping Data
    • Visualising Data

Week-by-Week

Each week entails:

  • Assigned readings from a mix of academic and non-academic sources (e.g. Medium/Towards Data Science)
  • Pre-recorded short videos on specific concepts or topics.
  • An optional Moodle quiz to test understanding.
  • A live-coding session which incorporates discussion of week’s assigned readings (students selected using Python random-number generator 😅).
  • A small group practical using a Jupyter notebook.

Assessment

“… three pedagogical strategies that are particularly effective for teaching reproducibility successfully: 1. placing extra emphasis on motivation; 2. guided instruction; 3. lots of practice.” (Ostblom and Timbers, 2022)

Assessments

  1. Time-limited coding quiz (30%)
  2. Group critical data science project proposal (50%)
  3. Peer evaluation of contributions (20%)

Assessments

  1. Time-limited coding quiz (30%)
    • Hidden randomisation of data
    • More obvious randomisation of questions
  2. Group critical data science project proposal (50%)
  3. Peer evaluation of contributions (20%)

Assessments

  1. Time-limited coding quiz (30%)
  2. Group critical data science project proposal (50%)
    • Quarto document (incl. references)
    • Reproducibility (12%)
    • Output quality (6%)
    • Code ‘quality’ (6%)
    • Content (36%)
  3. Peer evaluation of contributions (20%)

Assessments

  1. Time-limited coding quiz (30%)
  2. Group critical data science project proposal (50%)
  3. Peer evaluation of contributions (20%)
    • GitHub history used in event of group meltdown
    • Contributions assessed across project dimensions

Towards the Promised Land?

  • Automate All the Things?
  • Multiple Docker images
  • REF (2028) it?

Logos of CASA and British Library

Jon / @jreades / jreades

Acknowledgements

The work presented here builds on the contributions of many (not least the FOSS community!), but I’m particularly indebted to Dani Arribas-Bel and Andy Maclachlan for pointing me towards critical pieces of the puzzle.

Module content jreades.github.io/fsds/

References

Arribas-Bel, D. (2019) A course on Geographic Data Science,” Journal of Open Source Education, 2(16), p. 42. doi: 10.21105/jose.00042.
Dogucu, M. and Çetinkaya-Rundel, M. (2022) “Tools and recommendations for reproducible teaching,” Journal of Statistics and Data Science Education, 30(3), pp. 251–260. doi: 10.1080/26939169.2022.2138645.
Kernohan, D. (2023) “REF 2028 is coming.” Available at: https://wonkhe.com/blogs/ref-2028-is-coming/.
MacLachlan, A. and Dennett, A. (2022) An Applied Geographic Information Systems and Science Course in R,” Journal of Open Source Education, 5(50). doi: 10.21105/jose.00141.
Ostblom, J. and Timbers, T. (2022) Opinionated Practices for Teaching Reproducibility: Motivation, Guided Instruction and Practice,” Journal of Statistics and Data Science Education, 30(3), pp. 241–250. doi: 10.1080/26939169.2022.2074922.
Reades, J. (2020) Teaching on Jupyter,” Region, 7(1), pp. 21–34. doi: 10.18335/region.v7i1.282.
Reades, J. and Rey, S. J. (2021) Geographical Python Teaching Resources: GeoPyTer,” Journal of Geographical Systems, 23(4), pp. 579–597. doi: 10.1007/s10109-021-00346-6.