SensorID | Place | Owner |
---|---|---|
1 | LHR | BAA |
2 | LGA | GIP |
3 | STA | MAG |
4 | LUT | Luton LA |
5 | SEN | Stobart |
A join refers to the merging of two (or more) data tables using one (or more) matching columns:
SensorID | Place | Owner |
---|---|---|
1 ⇒ | LHR | BAA |
2 ✘ | LGA | GIP |
3 ⇒ | STA | MAG |
4 ⇒ | LUT | Luton LA |
5 ✘ | SEN | Stobart |
SensorID | Parameter | Value |
---|---|---|
1 ⇐ | Temperature | 5ºC |
1 ⇐ | Humidity | 15% |
3 ⇐ | Temperature | 7ºC |
4 ⇐ | Temperature | 7ºC |
6 ✘ | Humidity | 18% |
On an Inner Join all non-matching rows are dropped:
SensorID | Place | Owner | Parameter | Value |
---|---|---|---|---|
1 | LHR | BAA | Temperature | 5ºC |
1 | LHR | BAA | Humidity | 15% |
3 | STA | MAG | Temperature | 7ºC |
4 | LUT | Luton LA | Temperature | 7ºC |
If Data Set 2 had a SensorKey
instead of a SensorID
then:
We will get an ‘extra’ field:
SensorID | Place | Owner | SensorKey | Parameter | Value |
---|---|---|---|---|---|
1 | LHR | BAA | 1 | Temperature | 5ºC |
1 | LHR | BAA | 1 | Humidity | 15% |
3 | STA | MAG | 3 | Temperature | 7ºC |
4 | LUT | Luton LA | 4 | Temperature | 7ºC |
On an Outer Join all rows are retained, including ones with no match:
SensorID | Place | Owner | Parameter | Value |
---|---|---|---|---|
1 | LHR | BAA | Temp. | 5℃ |
1 | LHR | BAA | Hum. | 15% |
2 | LGA | GIP | NaN | NaN |
3 | STA | MAG | Temp. | 7℃ |
4 | LUT | Luton Borough | NaN | NaN |
5 | SEN | Stobart | NaN | NaN |
6 | NaN | NaN | Hum. | 20% |
On a Left Join all rows on the left table are retained, including ones with no match, but unmatched right rows are dropped:
SensorID | Place | Owner | Parameter | Value |
---|---|---|---|---|
1 | LHR | BAA | Temp. | 5℃ |
1 | LHR | BAA | Hum. | 15% |
2 | LGA | GIP | NaN | NaN |
3 | STA | MAG | Temp. | 7℃ |
4 | LUT | Luton Borough | NaN | NULL |
5 | SEN | Stobart | NaN | NaN |
Pandas has two additional join-like functions:
df3 = pd.DataFrame.from_dict({
'SensorID': [2,3,8,9,10],
'Place': ['STA','LUT','BHX','MAN','INV'],
'Owner': ['BAA','Luton LA','???','???','???']
})
pd.concat([df1, df3], ignore_index=True)
Outputs:
SensorID | Place | Owner | |
---|---|---|---|
0 | 1 | LHR | BAA |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
4 | 5 | SEN | Stobart |
5 | 2 | STA | BAA |
6 | 3 | LUT | GIP |
7 | 8 | BHX | ??? |
8 | 9 | MAN | ??? |
9 | 10 | INV | ??? |
to_append = [
{'SensorID': 0, 'Parameter': 5, 'Humidity', 'Value': 0.45},
{'SensorID': 1, 'Parameter': 5, 'Humidity', 'Value': 0.31},
{'SensorID': 2, 'Parameter': 4, 'Temperature', 'Value': 2},
{'SensorID': 3, 'Parameter': 3, 'Temperature', 'Value': 3}]
df2.append(to_append)
Outputs:
SensorID | Parameter | Value | |
---|---|---|---|
0 | 1 | Temperature | 5.00 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
4 | 6 | Humidity | 0.18 |
0 | 5 | Humidity | 0.45 |
1 | 5 | Humidity | 0.31 |
2 | 4 | Temperature | 2.00 |
3 | 3 | Temperature | 3.00 |
As usual, Stack Overflow to the rescue:
A very high level difference is that
merge()
is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, useleft_index=True
and/orright_index=True
), andconcat()
is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).
join()
is used to merge 2 dataframes on the basis of the index; instead of usingmerge()
with the optionleft_index=True
we can usejoin()
.
Hint: axis=0
refers to the row index & axis=1
to the column index.
These achieve the same thing, but they are not always equivalent:
Generally:
Linking Data • Jon Reades