Pandas

Jon Reades

Why Pandas?

Pandas is probably (together with scipy, numpy, and sklearn) the main reason that Python has become popular for data science. According to ‘Learn Data Sci’ it accounts for 1% of all Stack Overflow question views!

You will want to bookmark these:

Pandas Terminology (Data Frame)

Pandas Terminology (Index)

Pandas Terminology (Series)

Pandas Terminology (Slice)

Using Pandas

Here’s code to read a (remote) CSV file:

import pandas as pd      # import package
# Bitly: https://raw.githubusercontent.com/jreades/fsds/master/data/2019-sample-crime.csv
url='https://bit.ly/39SJpfp'
df = pd.read_csv(url)       # load a (remote) CSV
print(type(df))             # not simple data type
print(df.columns.to_list()) # column names
print(df.columns.values)    # Also works but deprecated

Output:

<class 'pandas.core.frame.DataFrame'>
['ID' 'Case Number' 'Date' 'Primary Type' 'Description'
 'Location Description' 'Arrest' 'Domestic' 'Year' 'Latitude' 'Longitude']

Summarise a Data Frame

df.describe() # Information about each Series
df.info()     # Information about each Series and the df

df.info is more about data types and memory usage. df.describe is for summarising information about the distribution of values in every series.

Familiar?

This should be looking eerily familiar:

print(type(df['Latitude']))          # type for column
print(type(df['Latitude'].array))    # type for values
print(df['Latitude'].array[:5])     # first five values
print(f"1: {df['Latitude'].mean()}") # summarise a series/column
print(f"2: {df.Latitude.mean()}")    # if no spaces in name

Produces:

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
[41.75130706 41.90399688 41.88032861 41.92438396 41.75579713]
1: 41.84550008439
2: 41.84550008439

Jupyter Formatting

Pandas is also ‘Jupyter-aware’, meaning that output can displayed directly in Jupyter in ‘fancy’ ways:

Familiar?

df.head(3)                       # First 3 rows of df
df[['ID','Date','Year']].tail(3) # Last 3 rows of selected columns
df.sample(frac=0.3)              # A random 30% sample
df.sample(3, random_state=42)    # A random sample with a seed
df.sample(3, random_state=42)    # Same sample!

Data Frames vs Series

Pandas operates on two principles:

  1. Any operation on a Data Frame returns a Data Frame.
  2. Any operation on a Series returns a Series.

‘Shallow’ Copies

More subtly, operations on a Series or Data Frame return a shallow copy, which is like a ‘view’ in a database…

  1. The original is unchanged unless you specify inplace=True (where supported).
  2. Attempts to change a subset of the data frame will often trigger a SettingWithCopyWarning warning.

If you need a full copy then use the copy() method (e.g. df.copy() or df.Series.copy()).

Putting These Ideas Together

# Returns a series but not a column
df.Latitude - 1 
# Saves returned series as a new column
df['lat'] = df.Latitude - 1
# Returns a new data frame w/o 'lat' 
df.drop(columns=['lat']) 
# Modifies df directly
df.drop(columns=['lat'], inplace=True) 
# Try to modify a view of df (triggers warning)
df[df['Primary Type']=='BURGLARY'].Latitude = 41.7 

What Can We Do?

Chaining

Operations on a Data Frame return a DataFrame and operations on a Series return a Series, allowing us to ‘chain’ steps together:

df.sort_values(by=['Year','ID'], ascending=False).sample(frac=0.5).head(20).median()

Selection

# Returns a selection (Boolean series)
df['Primary Type']=='ASSAULT'

# All rows where Primary Type is ASSAULT
df[ df['Primary Type']=='ASSAULT' ]

# Calculations on a slice (returns mean centroid!)
df[df['Primary Type']=='ASSAULT'][['Longitude','Latitude']].mean()

# Two conditions with a bit-wise AND
df[
  (df['Primary Type']=='ASSAULT') &
  (df['Description']=='AGGRAVATED: HANDGUN')
]

# Two conditions with a bit-wise OR
df[
  (df['Primary Type']=='ASSAULT') |
  (df['Primary Type']=='THEFT')
]

Now we can automate… data anlysis!

Dealing with Types

A Data Series can only be of one type:

Pandas Dtype Python Type Usage
object str or mixed Text or mixed columns (including arrays)
int64 int Integer columns
float64 float Floating point columns
bool bool True/False columns
datetime64 N/A (datetime) Date and time columns
timedelta[ns] N/A (datetime) Datetime difference columns
category N/A (set) Categorical columns

Changing the Type

print(df['Primary Type'].unique())   # Find unique values
print(df['Primary Type'].dtype.name) # Confirm is 'object'
df['Primary Type'] = df['Primary Type'].astype('category')
print(df['Primary Type'].dtype.name) # Confirm is 'category'
print(df['Primary Type'].describe()) # Category column info

Outputs:

['BURGLARY' 'DECEPTIVE PRACTICE' 'BATTERY'...]
object   # < before `as type`
category # < after `as type`
count       100
unique       15
top       THEFT
freq         28
Name: Primary Type, dtype: object # category==special class of object

Datetime Data

What do we do here?

print(df.Date.dtype.name)
# object
df.Date.to_list()[:3]
# ['04/20/2019 11:00:00 PM', '12/02/2019 10:35:00 AM', '10/06/2019 04:50:00 PM']

This shows that Date is currently a string of dates+times.

Pandas handles date and times using a datetime type that also works as an index (more on these later):

df['dt'] = pd.to_datetime(df.Date.array, 
              format="%m/%d/%Y %H:%M:%S %p")
print(df.dt.dtype.name)
# datetime64[ns]
df.dt.to_list()[:3]
# [Timestamp('2019-04-20 11:00:00'), Timestamp('2019-12-02 10:35:00'), Timestamp('2019-10-06 04:50:00')]

These follow the formatting conventions of strftime (string format time) for conversion.

Datetime Formats

Examples of strftime conventions include:

Format Applies To
%d 2-digit day
%m 2-digit month
%y 2-digit year
%Y 4-digit year
%p AM/PM

So that is why:

pd.to_datetime(df.Date.array, format="%m/%d/%Y %H:%M:%S %p")

Note the other things happening here:

  1. pd.to_datetime(...) is not a method, it’s a function from the pandas package.
  2. df.Date.array (and df.Date.to_numpy() and df.Data.tolist()) gives access to the data directly, whereas df.Date gives access to the Series.

Deprecation Warning!

From time to time, real-world software projects will change the way things work. Pandas is just such a project!

Warning

We recommend using Series.array or Series.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array. See API Documenation.

So while Series.values still works, and will continue to work for some time, you are being advised to start using Series.array or Series.to_numpy() instead. Meaning, we should consider using df.Date.array.

Tidying Up

This is one way, there are many options and subtleties…

# Fix categories
mapping = {}

# df['Primary Type'].unique().to_list() also works
for x in df['Primary Type'].cat.categories.to_list():
  mapping[x]=x.title()

# And update
df['Primary Type'] = df['Primary Type'].cat.rename_categories(mapping)

How would you work out what this code does? 1

To deal with pricing information treated as a string:

df2['price'].str.replace('$','').astype(float)

Many more examples accessible via Google!

Dropping Rows and Columns

There are multiple ways to drop ‘stuff’:

df2 = df.copy()
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df2.drop(index=range(5,10), inplace=True) # Row 'numbers' or index values
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")
df.drop(columns=['Year'], inplace=True)   # Column name(s)
print(f"The data frame has {df2.shape[0]:,} rows and {df.shape[1]:,} cols.")

There is also df.dropna() which can apply to rows or columns with NULL or np.nan values.

I often prefer df = df[df.index > 15] (negative selection) to df.drop(index=range(0,14)) (positive selection).

Accessing Data by Location

Index 0 1 2 3
ID Case Number Date Primary Type
0 11667185 JC237601 04/20/2020 11:00:00PM BURGLARY
1 11998178 JC532226 12/02/2020 10:35:00AM DECEPTIVE PRACTICE
2 11852571 JC462365 10/06/2020 04:50:00PM BATTERY

We can interact with rows and columns by position or name:

df.iloc[0:2,0:2] # List selection! (':' means 'all')
df.loc[0:2,['ID','Case Number']] # Dict selection

These actually return different results because of the index:

  • df.loc returns the rows labeled 0, 1, and 2 ([0..2]), whereas
  • df.iloc returns the range 0..2 ([0..2))!

Indexes

So by default, pandas creates a row index index whose values are 0..n and column index whose values are the column names. You will see this if you print out the head:

df.head(3)

The left-most column (without) a name is the index.

df.set_index('ID', inplace=True)
df.head(3)

Now we see:

         Case Number                    Date  ...  Longitude                  dt
ID                                            ...
11667185    JC237601  04/20/2019 11:00:00 PM  ... -87.603468 2019-04-20 11:00:00
11909178    JC532226  12/02/2019 10:35:00 AM  ... -87.643230 2019-12-02 10:35:00
11852571    JC462365  10/06/2019 04:50:00 PM  ... -87.758473 2019-10-06 04:50:00

So ID is now the index and is not accessible as a column: df.ID will now throw an error because it’s not longer part of the Column Index.

Indexes (cont’d)

Notice the change to the data frame:

0 1 2
ID Case Number Date Primary Type
11667185 JC237601 04/20/2020 11:00:00PM BURGLARY
11998178 JC532226 12/02/2020 10:35:00AM DECEPTIVE PRACTICE
11852571 JC462365 10/06/2020 04:50:00PM BATTERY

And now:

print(df.loc[11667185,:])
print(df.loc[11667185:11852571,'Case Number':'Date'])

Mnemonic: we used iloc to select rows/cols based on integer location and we use loc to select rows/cols based on name location.

P.S. You can reset the data frame using df.reset_index(inplace=True).

Saving

Pandas can write to a wide range of file types, here are some of the more popular ones:

Command Saved As…
df.to_csv(<path>) CSV file. But note the options to change sep (default is ',') and to suppress index output (index=False).
df.to_excel(<path>) XLSX file. But note the options to specify a sheet_name, na_rep, and so on, as well as to suppress the index (index=False).
df.to_feather(<path>) Directly usable by R. Requires pyarrow to be installed to access the options.
df.to_parquet(<path>) Directly usable by many languages. Requires pyarrow to be installed to access the options.
df.to_latex(<path>)) Write a LaTeX-formatted table to a file. Display requires booktabs. Could do copy+paste with print(df.to_latex()).
df.to_markdown(<path>) Write a Markdown-formatted table to a file. Requires tabulate. Could do copy+paste with print(df.to_markdown()).

In most cases compression is detected automatically (e.g. df.to_csv('file.csv.gz')) but you can also specify it (e.g. df.to_csv('file.csv.gz', compression='gzip')).1

Resources