Foundations of Spatial Data Science

Useful, But Limited?

Method	Achieves
`count()`	Total number of items
`first()`, `last()`	First and last item
`mean()`, `median()`	Mean and median
`min()`, `max()`	Minimum and maximum
`std()`, `var()`	Standard deviation and variance
`mad()`	Mean absolute deviation
`prod()`	Product of all items
`sum()`	Sum of all items

Grouping Operations

In Pandas these follow a split / apply / combine approach:

In Practice

grouped_df = df.groupby(<fields>).<function>

For instance, if we had a Local Authority (LA) field:

grouped_df = df.groupby('LA').sum()

Using apply the function could be anything:

def norm_by_data(x): # x is a column from the grouped df
    x['d1'] /= x['d2'].sum() 
    return x

df.groupby('LA').apply(norm_by_data)

Grouping by Arbitrary Mappings

mapping = {'HAK':'Inner', 'TH':'Outer', 'W':'Inner'}
df.set_index('LA', inplace=True)
df.groupby(mapping).sum()

Pivot Tables

A ‘special case’ of Group By features:

Commonly-used in business to summarise data for reporting.
Grouping (summarisation) happens along both axes (Group By operates only on one).
pandas.cut(<series>, <bins>) can be a useful feature here since it chops a continuous feature into bins suitable for grouping.

In Practice

age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

Deriving Measures of Diversity

One of the benefits of grouping is that it enables us to derive measures of density and diversity; here are just a few… Location Quotient (LQ), Herfindah-Hirschman Index (HHI), Shanon Entropy.

Location Quotient

The LQ for industry i in zone z is the share of employment for i in z divided by the share of employment of i in the entire region R. \[ LQ_{zi} = \dfrac{Emp_{zi}/Emp_{z}}{Emp_{Ri}/Emp_{R}} \]

	High Local Share	Low Local Share
High Regional Share	\[\approx 1\]	\[< 1\]
Low Regional Share	\[> 1\]	\[\approx 1\]

Herfindahl-Hirschman index

The HHI for an industry i is the sum of squared market shares for each company in that industry: \[ H = \sum_{i=1}^{N} s_{i}^{2} \]

Concentration Level	HHI
Monopolistic: one firm accounts for 100% of the market	\[1.0\]
Oligopolistic: top five firms account for 60% of the market	\[\approx 0.8\]
Competitive: anything else?	\[< 0.5\]?

Shannon Entropy

Shannon Entropy is an information-theoretic measure: \[ H(X) = - \sum_{i=1}^{n} P(x_{i}) log P(x_{i}) \]

Resources

Stop aggregating away the signal in your data

Grouping Data

Useful, But Limited?

Grouping Operations

In Practice

Grouping by Arbitrary Mappings

Pivot Tables

In Practice

Deriving Measures of Diversity

Location Quotient

Herfindahl-Hirschman index

Shannon Entropy

Resources