Dimensions in Data

Overview

This is the most profoundly abstract aspect of data analysis: how to conceive of your data as a multi-dimensional space that can be reshaped and transformed to support your analytical objectives. This foregrounds the importance of judgement since, as the economist Ronald Coase is reputed to have said:

“If you torture the data long enough, it will confess.”

By which you should understand that transformation is a form of ‘torture’1: it can force the data to reveal relationships that were previously hidden from the data scientist. However, taken too far the data will confess to whatever you want, which isn’t the purpose of critical, reproducible, sound data science!

Learning Outcomes
  1. A deeper understanding of the issues surrounding clustering that were covered in Week 6 of CASA0005 (GIS) and CASA0007 (QM).
  2. An understanding of how data transformation works and the reasons for choosing one transform over another.
  3. An appreciation of the pros and cons of at least two dimensionality reduction techniques.

Lectures

Come to class prepared to present/discuss:

Session Video Presentation
The Data Space Video Slides
Transformation Video Slides
Dimensionality Video Slides

Other Prep

Connections

These readings provide very practical insights into the ways that data transformation can ‘torture the data until it confesses’ as well as how we can use data transformation to generate new ways of seeing our data and, consequently, new ways of understanding it. You should be coming out of these readings with a clearer understanding of why there’s rarely a ‘right’ or ‘wrong’ approach to a real-world data set, but there are ‘better’ and ‘worse’ approaches. These readings are predominantly non-academic so they should (I hope) be fairly accessible and quick to read despite the potential dryness of the topics.

Practical

This practical will show you how data transformation is an essential, but often overlooked, aspect of data analysis: depending on the choices we make here, we can reduce (or increase) the dimensionality of the data and make it more (or less) tractable for subsequent analysis. This approach to the pipeline relies on you being able to see your data as existing in an abstract ‘space’ that can be manipulated in order to foreground, compress, or even mask attributes.

Connections

The practical focusses on:

  • Working with a more complex data structure to create new ‘grouped’ variables (as the simplest form of transformation)
  • Using sklearn to fit and transform data in a flexible manner.
  • Doing two types of dimensionality reduction to demonstrate how different linear and non-linear dimensionality reduction are.

To access the practical:

  1. Preview on GitHub
  2. Download the Notebook

References

Bunday, B. D. n.d. A Final Tale or You Can Prove Anything with Figures.” https://www.ucl.ac.uk/~ucahhwi/AFinalTale.pdf.
Cima, R. n.d. The Most and Least Diverse Cities in America.” Priceonomics. https://priceonomics.com/the-most-and-least-diverse-cities-in-america/.
Harris, R. n.d. “The Certain Uncertainty of University Rankings.” RPubs. https://rpubs.com/profrichharris/uni-rankings.
Lu, Yonggang, and Kevin SS Henning. 2013. “Are Statisticians Cold-Blooded Bosses? A New Perspective on the ‘Old’concept of Statistical Population.” Teaching Statistics 35 (1). Wiley Online Library:66–71. https://doi.org/10.1111/j.1467-9639.2012.00524.x.

Footnotes

  1. To be clear, this is a metaphor only!↩︎