Linking Data

Jon Reades

Non-Spatial Linkages

Data Set 1

SensorID Place Owner
1 LHR BAA
2 LGA GIP
3 STA MAG
4 LUT Luton LA
5 SEN Stobart
df1 = pd.DataFrame({
  'SensorID': [1,2,3,4,5],
  'Place': ['LHR','LGA','STA','LUT','SEN'],
  'Owner': ['BAA','GIP','MAG','Luton LA','Stobart']
})

Data Set 2

SensorID Parameter Value
1 Temperature 5ºC
1 Humidity 15%
3 Temperature 7ºC
4 Temperature 7ºC
6 Humidity 18%
df2 = pd.DataFrame({
  'SensorID': [1,2,3,4,5],
  'Parameter': ['Temperature','Humidity','Temperature','Temperature','Humidity'],
  'Value': ['5ºC','15%','7ºC','7ºC','18%']
})

General Join Syntax

A join refers to the merging of two (or more) data tables using one (or more) matching columns:

pd.merge(df1, df2, on='SensorID')

Inner Join

Data Set 1

SensorID Place Owner
1 ⇒ LHR BAA
2 ✘ LGA GIP
3 ⇒ STA MAG
4 ⇒ LUT Luton LA
5 ✘ SEN Stobart

Data Set 2

SensorID Parameter Value
1 ⇐ Temperature 5ºC
1 ⇐ Humidity 15%
3 ⇐ Temperature 7ºC
4 ⇐ Temperature 7ºC
6 ✘ Humidity 18%

Inner Join Result

On an Inner Join all non-matching rows are dropped:

pd.merge(df1, df2, 
         how = 'inner',
         on  = 'SensorID')

 

SensorID Place Owner Parameter Value
1 LHR BAA Temperature 5ºC
1 LHR BAA Humidity 15%
3 STA MAG Temperature 7ºC
4 LUT Luton LA Temperature 7ºC

But What If…

If Data Set 2 had a SensorKey instead of a SensorID then:

pd.merge(df1, df2, 
         how      = 'inner',
         left_on  = 'SensorID',
         right_on = 'SensorKey')

 

We will get an ‘extra’ field:

SensorID Place Owner SensorKey Parameter Value
1 LHR BAA 1 Temperature 5ºC
1 LHR BAA 1 Humidity 15%
3 STA MAG 3 Temperature 7ºC
4 LUT Luton LA 4 Temperature 7ºC

Outer Join

On an Outer Join all rows are retained, including ones with no match:

pd.merge(df1, df2,
         how = 'outer',
         on  = 'SensorID')

 

SensorID Place Owner Parameter Value
1 LHR BAA Temp. 5℃
1 LHR BAA Hum. 15%
2 LGA GIP NaN NaN
3 STA MAG Temp. 7℃
4 LUT Luton Borough NaN NaN
5 SEN Stobart NaN NaN
6 NaN NaN Hum. 20%

Left Join

On a Left Join all rows on the left table are retained, including ones with no match, but unmatched right rows are dropped:

pd.merge(df1, df2, 
        how = 'left',
        on  = 'SensorID')

 

SensorID Place Owner Parameter Value
1 LHR BAA Temp. 5℃
1 LHR BAA Hum. 15%
2 LGA GIP NaN NaN
3 STA MAG Temp. 7℃
4 LUT Luton Borough NaN NULL
5 SEN Stobart NaN NaN

> Any guesses for the fourth type?

Append & Concat

Pandas has two additional join-like functions:

  • Append: can be used to add a dict, Series, or DataFrame to the ‘bottom’ of an existing df. It’s not advisable to extend a df one row at a time (do bulk concatenations instead).
  • Concat: can be used to concatenate two dfs together along either axis (rows or columns) “while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.”

Concat

df3 = pd.DataFrame.from_dict({
    'SensorID': [2,3,8,9,10],
    'Place': ['STA','LUT','BHX','MAN','INV'],
    'Owner': ['BAA','Luton LA','???','???','???']
})
pd.concat([df1, df3], ignore_index=True)

Outputs:

SensorID Place Owner
0 1 LHR BAA
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
4 5 SEN Stobart
5 2 STA BAA
6 3 LUT GIP
7 8 BHX ???
8 9 MAN ???
9 10 INV ???

Append

to_append = [
    {'SensorID': 0, 'Parameter': 5,  'Humidity', 'Value': 0.45},
    {'SensorID': 1, 'Parameter': 5,  'Humidity', 'Value': 0.31},
    {'SensorID': 2, 'Parameter': 4, 'Temperature', 'Value': 2},
    {'SensorID': 3, 'Parameter': 3, 'Temperature', 'Value': 3}]
df2.append(to_append)

Outputs:

SensorID Parameter Value
0 1 Temperature 5.00
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
4 6 Humidity 0.18
0 5 Humidity 0.45
1 5 Humidity 0.31
2 4 Temperature 2.00
3 3 Temperature 3.00

Merge vs. Append vs. Concat vs. Join?

As usual, Stack Overflow to the rescue:

A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

Hint: axis=0 refers to the row index & axis=1 to the column index.

Merge vs. Append vs. Concat vs. Join?

These achieve the same thing, but they are not always equivalent:

pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)
df1.join(df2)

Generally:

  • Concat expects the number of columns in all data frames to match (if concatenating vertically) and the number of rows in all data frames to match (if concatenating horizontally). It does not deal well with linking.
  • Append assumes that either the columns or the rows will match.
  • Join is basically a functionality-restricted merge.

Resources