Data transformation just means changing the raw data in some way to make it more tractable for analysis.
For example:
\[ x-\bar{x} \]
Input | Output |
---|---|
12 | -2 |
13 | -1 |
14 | 0 |
15 | +1 |
16 | +2 |
Transformations are mathematical operations applied to every observation in a data set that preserve some of the relationships between them.
If we subtract the mean from everyone’s height then we can immediately tell if someone is taller or shorter than we would expect.
If we subtract the mean from everyone’s income then we cannot immediately tell if someone is earning more or less that we would expect.
So what is a useful transformation in one context, may not be in another!
Question: How can you tell if you did better than everyone else on the Quiz or on the Final Report?
Answer: Just subtracting the mean is not enough because the distributions are not the same. For that we also need to standardise the data in some way.
\[ z = \dfrac{x-\bar{x}}{\sigma} \]
Divide through by the distribution!
\[ \dfrac{x-\bar{x}}{\sigma} \]
\[ \dfrac{x_{i}-x_{Q2}}{x_{Q3}-x_{Q1}} \]
\[ \dfrac{x_{i}-x_{50^{th}}}{x_{90^{th}}-x_{10^{th}}} \]
\[ x'_{a,i} = \dfrac{x_{ai}}{\sum_{g} r_{N,g} P_{a,g}} \]
Details:
\[ \dfrac{x_{i}}{\sum{x}_{i=1}^{n}} \]
\[ \dfrac{x_{i}-x_{min}}{x_{max}-x_{min}} \]
Recall: logs are the inverse of exponentiation!
Let’s assume that \(x = \{10, 100, 1000, 10000\}\), consider what happens if:
The Natural Log (\(e\)) has certain advantages over other logs and should probably be your default choice for log transformations.
Arbitrarily transforming data isn’t a panacea. ‘Robust’ tests can be another approach when all else fails and two common approaches are:
The term normalization is used in many contexts, with distinct, but related, meanings. Basically, normalizing means transforming so as to render normal. When data are seen as vectors, normalizing means transforming the vector so that it has unit norm. When data are though of as random variables, normalizing means transforming to normal distribution. When the data are hypothesized to be normal, normalizing means transforming to unit variance.
Source: Stack Exchange
Are spatial (a.k.a. geometrical) transformations any different from the other mathematical transformations covered in this session?
Data exists in a ‘space’ that we can transform and manipulate in various ways using functions to serve our exploratory and analytical purposes.
e
?Transformation • Jon Reades