Grouping Data

Overview

Supplemental

This session is ‘supplemental’, meaning that it is here to help you integrate ideas seen across Term 1 (and which will be encountered again in Term 2) in a way that supports long-term learning. It is not essential to passing the course and there are no ‘bonus points’ for using methods found in this session.

This week we will be looking at various ways of grouping data, whether it is by variable or by algorithm. So we begin by covering how data can be aggregated in Python using Pandas before turning to the practical challenges of classification (labeled data) and clustering (unlabeled data).

We hare now ‘completing’ the pipeline begun in Week 5 using the concepts introduced in Weeks 1–4, but if you remember your ‘epicycles of analysis’ then you’ll realise that this is, at best, a first pass through the data science process and there are multiple places where insights derived from the practicals (on outliers/problematic records, on data quality issues, on data selection, etc.) could be fed back through the pipeline to adjust and improve the analytical outputs.

We will also be shifting our focus in the live session to the final parts of the group submission, but you should also be looking at how this module connects and integrates ideas covered in CASA0001 (UST), CASA0005 (GIS), and CASA0007 (QM). So there will be only a minimal live-coding session in order to leave as much time as possible for the groups to meet and start working on their final projects.

Learning Objectives

An understanding of the differences between aggregation, classification, and clustering.
An appreciation of the utility of deriving grouped variables and proxies from raw data.
An appreciation of how clustering as part of an analytical pipeline differs from the material covered in CASA0007 and so enhances our understanding of ‘paradigms’ in CASA0001.
A general appreciation of how different clustering algorithms work and how this differs from classifcation.

Preparatory Lectures

You should, by now, be familiar with the concept of how to cluster data from the QM module (CASA0007), so this week is actually focussed on how to move beyond k-means. The point is to contextualise these approaches as part of a data science ‘pipeline’ and to contrast to them with the more theoretical aspects covered elsewhere. We are less interested in the mathematical and technical aspects, and more interested in how one might go about selecting the appropriate algorithm for a particular problem.

Session	Video	Presentation
Grouping Data	Video	Slides
Classification	Video	Slides
Clustering	Video	Slides
Clustering and Geography	Video	Slides

Other Preparation

Connections

We’re trying to move between technical and critical representations of data and methods – showing (again) how all data analysis represents a series of choices about what matters. Ultimately, it’s up to us whether we make these consciously or unconsciously: being a ‘critical’ (spatial) data scientist positions us to question the data constructively to ensure that it is ‘fit for purpose’ – that it is appropriate and adequate to the the processes or behaviours that we wish to study – be it for profit, policy, and public engagement.

The follow readings may be useful for reflecting on the topics covered in this session:
- Badger, Bui, and Gebeloff (2019) <URL>
- Shapiro and Yavuz (2017) <URL>

Practical

The previous week has set up nicely for approaching aggregation, classification, and clustering as functions of the (transformed and reduced) data space. With this, you have essentially covered a full data science analytical pipeline from start (setting up) to finish (cluster/classification analysis) and can hopefully see how these pieces fit together to support one another, and how there is no such thing as a ‘right’ way to approach an analysis… but that there are better and worse ways.

Note that, while you should be trying to advance your understanding of clustering and classification in Python, these final practicals are also a very good time to be working on your group project. So look at whether the techniques covered this week can help (or distract) you on this work and adjust the time given accordingly.

Connections

The practical focusses on:

How to group and aggregate data.
The connections between classification and clustering.
The use of classification as a predictive process with labeled data.
The choice of k in k-means and extraction of representative centroids.
The use of alternative clustering algorithms (DBSCAN, OPTICS, Self-Organising Maps, and ADBSCAN).

To access the practical:

References

Badger, E., Q. Bui, and R. Gebeloff. 2019. “Neighborhood Is Mostly Black. The Home Buyers Are Mostly White. New York Times.” New York Times. https://www.nytimes.com/interactive/2019/04/27/upshot/diversity-housing-maps-raleigh-gentrification.html.

Shapiro, W., and M. Yavuz. 2017. “Rethinking ’distance’ in New York City.” Medium. https://medium.com/topos-ai/rethinking-distance-in-new-york-city-d17212d24919.