Pandas v Ducks

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

Loading

Summarising

Useful, But Limited?

Method Achieves
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items

Grouping Operations

In Pandas these follow a split / apply / combine approach:

In Practice

grouped_df = df.groupby(<fields>).<function>

For instance, if we had a Local Authority (LA) field:

grouped_df = df.groupby('LA').sum()

Using apply the function could be anything:

def norm_by_data(x): # x is a column from the grouped df
    x['d1'] /= x['d2'].sum() 
    return x

df.groupby('LA').apply(norm_by_data)

Grouping by Arbitrary Mappings

mapping = {'HAK':'Inner', 'TH':'Outer', 'W':'Inner'}
df.set_index('LA', inplace=True)
df.groupby(mapping).sum()

Pivot Tables

A ‘special case’ of Group By features:

  • Commonly-used in business to summarise data for reporting.
  • Grouping (summarisation) happens along both axes (Group By operates only on one).
  • pandas.cut(<series>, <bins>) can be a useful feature here since it chops a continuous feature into bins suitable for grouping.

In Practice

age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

Counts

Pivots & Groups

Extracting