Linking Data

Jon Reades

> Data sets in isolation are of limited use, but the ability to link data unlocks all kinds of new possibilities!

Non-Spatial Linkages

Data Set 1

SensorID	Place	Owner
1	LHR	BAA
2	LGA	GIP
3	STA	MAG
4	LUT	Luton LA
5	SEN	Stobart

df1 = pd.DataFrame({
  'SensorID': [1,2,3,4,5],
  'Place': ['LHR','LGA','STA','LUT','SEN'],
  'Owner': ['BAA','GIP','MAG','Luton LA','Stobart']
})

Data Set 2

SensorID	Parameter	Value
1	Temperature	5ºC
1	Humidity	15%
3	Temperature	7ºC
4	Temperature	7ºC
6	Humidity	18%

df2 = pd.DataFrame({
  'SensorID': [1,2,3,4,5],
  'Parameter': ['Temperature','Humidity','Temperature','Temperature','Humidity'],
  'Value': ['5ºC','15%','7ºC','7ºC','18%']
})

General Join Syntax

A join refers to the merging of two (or more) data tables using one (or more) matching columns:

pd.merge(df1, df2, on='SensorID')

Inner Join

Data Set 1

SensorID	Place	Owner
1 ⇒	LHR	BAA
2 ✘	LGA	GIP
3 ⇒	STA	MAG
4 ⇒	LUT	Luton LA
5 ✘	SEN	Stobart

Data Set 2

SensorID	Parameter	Value
1 ⇐	Temperature	5ºC
1 ⇐	Humidity	15%
3 ⇐	Temperature	7ºC
4 ⇐	Temperature	7ºC
6 ✘	Humidity	18%

Inner Join Result

On an Inner Join all non-matching rows are dropped:

pd.merge(df1, df2, 
         how = 'inner',
         on  = 'SensorID')

SensorID	Place	Owner	Parameter	Value
1	LHR	BAA	Temperature	5ºC
1	LHR	BAA	Humidity	15%
3	STA	MAG	Temperature	7ºC
4	LUT	Luton LA	Temperature	7ºC

But What If…

If Data Set 2 had a SensorKey instead of a SensorID then:

pd.merge(df1, df2, 
         how      = 'inner',
         left_on  = 'SensorID',
         right_on = 'SensorKey')

We will get an ‘extra’ field:

SensorID	Place	Owner	SensorKey	Parameter	Value
1	LHR	BAA	1	Temperature	5ºC
1	LHR	BAA	1	Humidity	15%
3	STA	MAG	3	Temperature	7ºC
4	LUT	Luton LA	4	Temperature	7ºC

Outer Join

On an Outer Join all rows are retained, including ones with no match:

pd.merge(df1, df2,
         how = 'outer',
         on  = 'SensorID')

SensorID	Place	Owner	Parameter	Value
1	LHR	BAA	Temp.	5℃
1	LHR	BAA	Hum.	15%
2	LGA	GIP	NaN	NaN
3	STA	MAG	Temp.	7℃
4	LUT	Luton Borough	NaN	NaN
5	SEN	Stobart	NaN	NaN
6	NaN	NaN	Hum.	20%

Left Join

On a Left Join all rows on the left table are retained, including ones with no match, but unmatched right rows are dropped:

pd.merge(df1, df2, 
        how = 'left',
        on  = 'SensorID')

SensorID	Place	Owner	Parameter	Value
1	LHR	BAA	Temp.	5℃
1	LHR	BAA	Hum.	15%
2	LGA	GIP	NaN	NaN
3	STA	MAG	Temp.	7℃
4	LUT	Luton Borough	NaN	NULL
5	SEN	Stobart	NaN	NaN

> Any guesses for the fourth type?

Append & Concat

Pandas has two additional join-like functions:

Append: can be used to add a dict, Series, or DataFrame to the ‘bottom’ of an existing df. It’s not advisable to extend a df one row at a time (do bulk concatenations instead).
Concat: can be used to concatenate two dfs together along either axis (rows or columns) “while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.”

Concat

df3 = pd.DataFrame.from_dict({
    'SensorID': [2,3,8,9,10],
    'Place': ['STA','LUT','BHX','MAN','INV'],
    'Owner': ['BAA','Luton LA','???','???','???']
})
pd.concat([df1, df3], ignore_index=True)

Outputs:

	SensorID	Place	Owner
0	1	LHR	BAA
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
4	5	SEN	Stobart
5	2	STA	BAA
6	3	LUT	GIP
7	8	BHX	???
8	9	MAN	???
9	10	INV	???

Append

to_append = [
    {'SensorID': 0, 'Parameter': 5,  'Humidity', 'Value': 0.45},
    {'SensorID': 1, 'Parameter': 5,  'Humidity', 'Value': 0.31},
    {'SensorID': 2, 'Parameter': 4, 'Temperature', 'Value': 2},
    {'SensorID': 3, 'Parameter': 3, 'Temperature', 'Value': 3}]
df2.append(to_append)

Outputs:

SensorID	Parameter	Value
0	1	Temperature	5.00
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
4	6	Humidity	0.18
0	5	Humidity	0.45
1	5	Humidity	0.31
2	4	Temperature	2.00
3	3	Temperature	3.00

Merge vs. Append vs. Concat vs. Join?

As usual, Stack Overflow to the rescue:

A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

Hint: axis=0 refers to the row index & axis=1 to the column index.

Merge vs. Append vs. Concat vs. Join?

These achieve the same thing, but they are not always equivalent:

pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)
df1.join(df2)

Generally:

Concat expects the number of columns in all data frames to match (if concatenating vertically) and the number of rows in all data frames to match (if concatenating horizontally). It does not deal well with linking.
Append assumes that either the columns or the rows will match.
Join is basically a functionality-restricted merge.

Linking Data

> Data sets in isolation are of limited use, but the ability to link data unlocks all kinds of new possibilities!

Non-Spatial Linkages

Data Set 1

Data Set 2

General Join Syntax

Inner Join

Data Set 1

Data Set 2

Inner Join Result

But What If…

Outer Join

Left Join

> Any guesses for the fourth type?

Append & Concat

Concat

Append

Merge vs. Append vs. Concat vs. Join?

Merge vs. Append vs. Concat vs. Join?

Resources