Question: Retail cashier annual salaries have a Normal distribution with a mean equal to $25,000 and a standard deviation equal to $2,000. What is the probability that a randomly selected retail cashier earns more than $27,000?
Answer: 15.87%
Result: All models are wrong, but some are useful (George Box)
Instinctively, we know that Bill Gates’ wealth is much further from ‘normal’ than is his height. But how?
We need:
\[ d(i,j) = |(i_{1}-j_{1})| \]
\[ d(i,j) = \sqrt{(i_{1}-j_{1})^{2}+(i_{2}-j_{2})^{2}} \]
We can keep adding dimensions…
\[ d(i,j) = \sqrt{(i_{1}-j_{1})^{2}+(i_{2}-j_{2})^{2}+(i_{3}-j_{3})^{2}} \]
You continue adding dimensions indefinitely, but from here on out you are dealing with hyperspaces!
We can write the coordinates of an observation with 3 attributes (e.g. height, weight, income) as:
\[ x_{i} = { {x_{i1}, x_{i2}, x_{i3} } } \]
Something with 8 attributes (e.g. height, weight, income, age, year of birth, …) ‘occupies’ an 8-dimensional space…
If you can shift from thinking in columns of data, to thinking of a data space then you’ll have a much easier time dealing with dimensionality reduction and clustering.
The Data Space • Jon Reades