Linking Data

Jon Reades - j.reades@ucl.ac.uk

1st October 2025

Non-Spatial Linkages

Data Set 1

df1 = pd.DataFrame({
  'SensorID': [1,2,3,4,5],
  'Place': ['LHR','LGA','STA','LUT','SEN'],
  'Owner': ['BAA','GIP','MAG','Luton LA','Stobart']
})
df1
SensorID Place Owner
0 1 LHR BAA
1 2 LGA GIP
2 3 STA MAG
3 4 LUT Luton LA
4 5 SEN Stobart

Data Set 2

df2 = pd.DataFrame({
  'SensorID': [1,1,3,4,6],
  'Parameter': ['Temperature','Humidity','Temperature','Temperature','Humidity'],
  'Value': ['5ºC','15%','7ºC','7ºC','18%']
})
df2
SensorID Parameter Value
0 1 Temperature 5ºC
1 1 Humidity 15%
2 3 Temperature 7ºC
3 4 Temperature 7ºC
4 6 Humidity 18%

General Join Syntax

A join refers to the merging of two (or more) data tables using one (or more) matching columns:

pd.merge(df1, df2, on='SensorID')
SensorID Place Owner Parameter Value
0 1 LHR BAA Temperature 5ºC
1 1 LHR BAA Humidity 15%
2 3 STA MAG Temperature 7ºC
3 4 LUT Luton LA Temperature 7ºC

This can also be written:

pd.merge(df1, df2, how='inner', on='SensorID')

Inner Join

Data Set 1

SensorID Place Owner
1 ⇒ LHR BAA
2 ✘ LGA GIP
3 ⇒ STA MAG
4 ⇒ LUT Luton LA
5 ✘ SEN Stobart

Data Set 2

SensorID Parameter Value
1 ⇐ Temperature 5ºC
1 ⇐ Humidity 15%
3 ⇐ Temperature 7ºC
4 ⇐ Temperature 7ºC
6 ✘ Humidity 18%

But What If…

If Data Set 2 had a SensorKey instead of a SensorID then:

pd.merge(df1, df2.rename(columns={'SensorID':'SensorKey'}), 
         how      = 'inner',
         left_on  = 'SensorID',
         right_on = 'SensorKey')
SensorID Place Owner SensorKey Parameter Value
0 1 LHR BAA 1 Temperature 5ºC
1 1 LHR BAA 1 Humidity 15%
2 3 STA MAG 3 Temperature 7ºC
3 4 LUT Luton LA 4 Temperature 7ºC

Outer Join

On an Outer Join all rows are retained, including ones with no match:

pd.merge(df1, df2,
         how = 'outer',
         on  = 'SensorID')
SensorID Place Owner Parameter Value
0 1 LHR BAA Temperature 5ºC
1 1 LHR BAA Humidity 15%
2 2 LGA GIP NaN NaN
3 3 STA MAG Temperature 7ºC
4 4 LUT Luton LA Temperature 7ºC
5 5 SEN Stobart NaN NaN
6 6 NaN NaN Humidity 18%

Left Join

On a Left Join all rows on the left table are retained, including ones with no match, but unmatched right rows are dropped:

pd.merge(df1, df2, 
         how = 'left',
         on  = 'SensorID')
SensorID Place Owner Parameter Value
0 1 LHR BAA Temperature 5ºC
1 1 LHR BAA Humidity 15%
2 2 LGA GIP NaN NaN
3 3 STA MAG Temperature 7ºC
4 4 LUT Luton LA Temperature 7ºC
5 5 SEN Stobart NaN NaN

> Any guesses for the fourth type?

Append & Concat

Pandas has two additional join-like functions:

  • Append: can be used to add a dict, Series, or DataFrame to the ‘bottom’ of an existing df. It’s not advisable to extend a df one row at a time (do bulk concatenations instead).
  • Concat: can be used to concatenate two dfs together along either axis (rows or columns) “while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.”

Concat

df3 = pd.DataFrame.from_dict({
    'SensorID': [2,3,8,9,10],
    'Place': ['STA','LUT','BHX','MAN','INV'],
    'Owner': ['BAA','Luton LA','???','???','???']
})
pd.concat([df1, df3], ignore_index=True)
SensorID Place Owner
0 1 LHR BAA
1 2 LGA GIP
2 3 STA MAG
3 4 LUT Luton LA
4 5 SEN Stobart
5 2 STA BAA
6 3 LUT Luton LA
7 8 BHX ???
8 9 MAN ???
9 10 INV ???

Append Also Concat

to_append = [
    {'SensorID': 0, 'Parameter': 'Humidity', 'Value': 0.45},
    {'SensorID': 1, 'Parameter': 'Humidity', 'Value': 0.31},
    {'SensorID': 2, 'Parameter': 'Temperature', 'Value': 2},
    {'SensorID': 3, 'Parameter': 'Temperature', 'Value': 3}]
pd.concat([df2, pd.DataFrame(to_append)], ignore_index=True)
SensorID Parameter Value
0 1 Temperature 5ºC
1 1 Humidity 15%
2 3 Temperature 7ºC
3 4 Temperature 7ºC
4 6 Humidity 18%
5 0 Humidity 0.45
6 1 Humidity 0.31
7 2 Temperature 2.0
8 3 Temperature 3.0

Merge vs. Concat vs. Join?

As usual, Stack Overflow to the rescue:

A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

Hint: axis=0 refers to the row index & axis=1 to the column index.

Merge vs. Concat vs. Join?

These achieve the same thing, but they are not always equivalent or what you want:

pd.merge(df1, df2, left_index=True, right_index=True)
pd.concat([df1, df2], axis=1)
df1.join(df2, rsuffix='_r')
SensorID Place Owner SensorID_r Parameter Value
0 1 LHR BAA 1 Temperature 5ºC
1 2 LGA GIP 1 Humidity 15%
2 3 STA MAG 3 Temperature 7ºC
3 4 LUT Luton LA 4 Temperature 7ºC
4 5 SEN Stobart 6 Humidity 18%

Generally:

  • Concat expects the number of columns in all data frames to match (if concatenating vertically) and the number of rows in all data frames to match (if concatenating horizontally). It does not deal well with linking (docs).
  • Join is basically a functionality-restricted merge (docs).
  • Append assumes that either the columns or the rows will match. [Append has been deprecated]

Additional Resources

Thank You

References