Dimensions in Data

Supplemental

This session is ‘supplemental’, meaning that it is here to help you integrate ideas seen across Term 1 (and which will be encountered again in Term 2) in a way that sup ports long-term learning. It is not essential to passing the course and there are no ‘bonus points’ for using methods found in this session.

Overview

This is the most profoundly abstract aspect of data analysis: how to conceive of your data as a multi-dimensional space that can be reshaped and transformed to support your analytical objectives. This foregrounds the importance of judgement since, as the economist Ronald Coase is reputed to have said:

“If you torture the data long enough, it will confess.”

By which you should understand that transformation is a form of ‘torture’¹: it can force the data to reveal relationships that were previously hidden from the data scientist. However, taken too far the data will confess to whatever you want, which isn’t the purpose of critical, reproducible, sound data science!

Learning Outcomes

A deeper understanding of the issues surrounding clustering that were covered in Week 6 of CASA0005 (GIS) and CASA0007 (QM).
An understanding of how data transformation works and the reasons for choosing one transform over another.
An appreciation of the pros and cons of at least two dimensionality reduction techniques.

Preparatory Lectures

Come to class prepared to present/discuss:

Session	Video	Presentation
The Data Space	Video	Slides
Transformation	Video	Slides
Dimensionality	Video	Slides

Other Preparation

The following readings may be useful for reflecting on the topics covered in this session:
- Bunday (n.d.) <URL>
- Harris (n.d.) <URL>
- Cima (n.d.) <URL, PDF with Figures>

Connections

These readings provide very practical insights into the ways that data transformation can ‘torture the data until it confesses’ as well as how we can use data transformation to generate new ways of seeing our data and, consequently, new ways of understanding it. You should be coming out of these readings with a clearer understanding of why there’s rarely a ‘right’ or ‘wrong’ approach to a real-world data set, but there are ‘better’ and ‘worse’ approaches. These readings are predominantly non-academic so they should (I hope) be fairly accessible and quick to read despite the potential dryness of the topics.

Practical

This practical will show you how data transformation is an essential, but often overlooked, aspect of data analysis: depending on the choices we make here, we can reduce (or increase) the dimensionality of the data and make it more (or less) tractable for subsequent analysis. This approach to the pipeline relies on you being able to see your data as existing in an abstract ‘space’ that can be manipulated in order to foreground, compress, or even mask attributes.

Connections

The practical focusses on:

Working with a more complex data structure to create new ‘grouped’ variables (as the simplest form of transformation)
Using sklearn to fit and transform data in a flexible manner.
Doing two types of dimensionality reduction to demonstrate how different linear and non-linear dimensionality reduction are.

To access the practical:

References

Bunday, B. D. n.d. “A Final Tale or You Can Prove Anything with Figures.” https://www.ucl.ac.uk/~ucahhwi/AFinalTale.pdf.

Cima, R. n.d. “The Most and Least Diverse Cities in America.” Priceonomics. https://priceonomics.com/the-most-and-least-diverse-cities-in-america/.

Harris, R. n.d. “The Certain Uncertainty of University Rankings.” RPubs. https://rpubs.com/profrichharris/uni-rankings.

Footnotes

To be clear, this is a metaphor only!↩︎