Jon Reades - j.reades@ucl.ac.uk
1st October 2025
Pros:
Cons:
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
We can compare performance using profiling1:
Pandas execution time: 1.176813 seconds
Polars execution time: 1.050749 seconds
10.71% faster than pandas.
DuckDB execution time: 0.271305 seconds
76.95% faster than pandas and 74.18% faster than polars.
Method | Achieves |
---|---|
count() |
Total number of items |
first() , last() |
First and last item |
mean() , median() |
Mean and median |
min() , max() |
Minimum and maximum |
std() , var() |
Standard deviation and variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
In Pandas these follow a split / apply / combine approach:
class | sex | age | |
---|---|---|---|
0 | Third | m | 22.0 |
1 | First | f | 38.0 |
2 | Third | f | 26.0 |
3 | First | f | 35.0 |
4 | Third | m | 35.0 |
5 | Third | m | NaN |
6 | First | m | 54.0 |
7 | Third | m | 2.0 |
class | sex | age | |
---|---|---|---|
1 | First | f | 38.0 |
3 | First | f | 35.0 |
6 | First | m | 54.0 |
class | sex | age | |
---|---|---|---|
0 | Third | m | 22.0 |
2 | Third | f | 26.0 |
4 | Third | m | 35.0 |
5 | Third | m | NaN |
7 | Third | m | 2.0 |
class | age | |
---|---|---|
First | 42.33 |
class | age | |
---|---|---|
Third | 21.25 |
class | age | |
---|---|---|
First | 42.33 | |
Third | 21.25 |
For instance, to get multiple summary values:
sum | mean | |
---|---|---|
class | ||
First | 7111.42 | 38.233441 |
Second | 5168.83 | 29.877630 |
Third | 8924.92 | 25.140620 |
In cases where we want to group by a derived value that doesn’t exist in the data we can apply a mapping as part of the process:
You can take this one step further by ‘applying’ an arbitrary function in a row- or column-wise fashion.
A ‘special case’ of Group By features:
pandas.cut(<series>, <bins>)
can be a useful feature here since it chops a continuous feature into bins suitable for grouping.class | First | Second | Third | |
---|---|---|---|---|
sex | age | |||
female | (0, 18] | 0.909 | 1.000 | 0.512 |
(18, 80] | 0.973 | 0.900 | 0.424 | |
male | (0, 18] | 0.800 | 0.600 | 0.216 |
(18, 80] | 0.375 | 0.071 | 0.134 |
sql = """PIVOT (
SELECT sex, survived, class, CASE WHEN age < 18 then '(0, 18]' else '(18, 80]' end as age
FROM read_parquet('titanic.pq'))
ON class USING SUM(survived) AS survived GROUP BY sex, age"""
pd.set_option('display.float_format', '{:,.0f}'.format)
df = duck.sql(sql).to_df().set_index('sex')
df.columns = ['Age','First','Second','Third']
df
Age | First | Second | Third | |
---|---|---|---|---|
sex | ||||
female | (18, 80] | 84 | 58 | 53 |
male | (0, 18] | 4 | 9 | 10 |
male | (18, 80] | 41 | 8 | 37 |
female | (0, 18] | 7 | 12 | 19 |
Other ways to summarise:
survived | n | ||
---|---|---|---|
class | sex | ||
First | female | Y | 91 |
female | N | 3 | |
Second | female | Y | 70 |
female | N | 6 | |
Third | female | Y | 72 |
female | N | 72 |