Grouping Data

Overview

This week we will be looking at various ways of grouping data, whether it is by variable or by algorithm. So we begin by covering how data can be aggregated in Python using Pandas before turning to the practical challenges of classification (labeled data) and clustering (unlabeled data).

We hare now ‘completing’ the pipeline begun in Week 5 using the concepts introduced in Weeks 1–4, but if you remember your ‘epicycles of analysis’ then you’ll realise that this is, at best, a first pass through the data science process and there are multiple places where insights derived from the practicals (on outliers/problematic records, on data quality issues, on data selection, etc.) could be fed back through the pipeline to adjust and improve the analytical outputs.

We will also be shifting our focus in the live session to the final parts of the group submission, but you should also be looking at how this module connects and integrates ideas covered in CASA0001 (UST), CASA0005 (GIS), and CASA0007 (QM). So there will be only a minimal live-coding session in order to leave as much time as possible for the groups to meet and start working on their final projects.

Learning Objectives
  1. An understanding of the differences between aggregation, classification, and clustering.
  2. An appreciation of the utility of deriving grouped variables and proxies from raw data.
  3. An appreciation of how clustering as part of an analytical pipeline differs from the material covered in CASA0007 and so enhances our understanding of ‘paradigms’ in CASA0001.
  4. A general appreciation of how different clustering algorithms work and how this differs from classifcation.

Lectures

You should, by now, be familiar with the concept of how to cluster data from the QM module (CASA0007), so this week is actually focussed on how to move beyond k-means. The point is to contextualise these approaches as part of a data science ‘pipeline’ and to contrast to them with the more theoretical aspects covered elsewhere. We are less interested in the mathematical and technical aspects, and more interested in how one might go about selecting the appropriate algorithm for a particular problem.

Session Video Presentation
Grouping Data Video Slides
Classification Video Slides
Clustering Video Slides
Clustering and Geography Video Slides

Other Prep

Connections

We’re trying to move between technical and critical representations of data and methods – showing (again) how all data analysis represents a series of choices about what matters. Ultimately, it’s up to us whether we make these consciously or unconsciously: being a ‘critical’ (spatial) data scientist positions us to question the data constructively to ensure that it is ‘fit for purpose’ – that it is appropriate and adequate to the the processes or behaviours that we wish to study – be it for profit, policy, and public engagement.

  • You should come to class prepared to present/discuss:
    • D’Ignazio and Klein (2020), chap. 3, On Rational, Scientific, Objective Viewpoints from Mythical, Imaginary, Impossible Standpoints <URL>
    • Badger, Bui, and Gebeloff (2019) <URL>
    • Massey (1996) <URL>

Practical

The previous week has set up nicely for approaching aggregation, classification, and clustering as functions of the (transformed and reduced) data space. With this, you have essentially covered a full data science analytical pipeline from start (setting up) to finish (cluster/classification analysis) and can hopefully see how these pieces fit together to support one another, and how there is no such thing as a ‘right’ way to approach an analysis… but that there are better and worse ways.

Note that, while you should be trying to advance your understanding of clustering and classification in Python, these final practicals are also a very good time to be working on your group project. So look at whether the techniques covered this week can help (or distract) you on this work and adjust the time given accordingly.

Connections

The practical focusses on:

  • How to group and aggregate data.
  • The connections between classification and clustering.
  • The use of classification as a predictive process with labeled data.
  • The choice of k in k-means and extraction of representative centroids.
  • The use of alternative clustering algorithms (DBSCAN, OPTICS, Self-Organising Maps, and ADBSCAN).

To access the practical:

  1. Preview on GitHub
  2. Download the Notebook

References

Badger, E., Q. Bui, and R. Gebeloff. 2019. “Neighborhood Is Mostly Black. The Home Buyers Are Mostly White. New York Times.” New York Times. https://www.nytimes.com/interactive/2019/04/27/upshot/diversity-housing-maps-raleigh-gentrification.html.
D’Ignazio, Catherine, and Lauren F. Klein. 2020. Data Feminism. MIT Press. https://bookbook.pubpub.org/data-feminism.
Massey, Doreen. 1996. “Politicising Space and Place.” Scottish Geographical Magazine 112 (2). Routledge:117–23. https://doi.org/10.1080/14702549608554458.