Combining Datasets#
Download Rmd Version#
If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.
Exercises incomplete: Download Combining_datasets.Rmd Exercises complete: Download Combining_datasets_complete.Rmd
Learning Objectives#
Understand the purpose and methods of joining dataset in R using the
dplyr
packagePerform left joins to combine datasets based on common columns
Handle and resolve issues that arise from repeated keys during joins
Recognize and avoid pitfalls related to joining on columns with double data types
Joining datasets#
In your data wrangling journey, you will often find yourself wanting to combine one dataframe with some kind of supplementary or partner dataframe. In our case, we have the penguins and weather data stored separately, but if we ever wanted to explore any relationships between them, we’d ideally want them in a single dataframe. This requires lining up the observations and variables in the datasets appropriately, something which is accomplished by performing appropriate joins.
The key to joining is to identify the variables by which you want to join the data. That is, we want to ask the question: which columns in each data are the ones that link them together? In some cases these may be one-to-one matches (e.g. ID numbers to IDs numbers), or in other cases there is data at different levels that need to be lined up.
Left joins#
There are several kinds of join function in dplyr
, but we’ll just focus on
left_join
and leave you to explore the others for yourself.
Like all the join functions, left_join
takes three arguments: the two dataframes
you’d like to join, and the name of the column (or columns) by which to join.
dplyr::left_join(data_left, data_right, by = <cols_to_join_on>)
The way left_join
works is to match up the columns given in the by
column
and create a new dataframe by pasting new columns from data_right
alongside
the columns from data_left
. As many rows from data_right
are brought in as
possible. The ‘left’ in left_join
indicates that we’re keeping everything from
the ‘left’ dataframe (i.e. the first one) and joining the other dataframe onto
the ‘left’ one.
It’s helpful to use small, toy dataframes to explore how joins work (or remind yourself after a period of time away).
%%R
df1 <- data.frame(colA = c(1, 2, 3, 4),
colB = c(2021, 2022, 2023, 2024),
colC = c('a', 'b', 'c', 'd'))
df2 <- data.frame(colA = c(1, 3),
colD = c('foo', 'bar'),
colE = c('dog', 'cat'))
%%R
df1
colA colB colC
1 1 2021 a
2 2 2022 b
3 3 2023 c
4 4 2024 d
%%R
df2
colA colD colE
1 1 foo dog
2 3 bar cat
Now let’s join df2
onto df1
by the column colA
:
%%R
df1 |>
dplyr::left_join(df2, by = "colA")
colA colB colC colD colE
1 1 2021 a foo dog
2 2 2022 b <NA> <NA>
3 3 2023 c bar cat
4 4 2024 d <NA> <NA>
Notice:
All the data from the left-hand dataframe,
df1
, is kept (columnscolA
–colC
).The data from the right-hand dataframe
df2
has been brought over for the rows where the ‘by’ columncolA
has matching values in the left-hand dataframedf1
.The rows where there aren’t matching values in the ‘by’ column
colA
have missing values for the other right-hand dataframe columns (colD
andcolE
in this case).
Exercise: left join practice#
Suppose now the right hand dataframe is as follows:
%%R
df2 <- data.frame(colA = c(1, 3, 2, 4, 5),
colD = c('foo', 'bar', 'baz', 'qux', 'foo'),
colE = c('dog', 'cat', 'mouse', 'rabbit', 'horse'))
Can you guess what the output of dplyr::left_join(df1, df2, by = "colA")
will
be? Check your guess with code below.
Solution
%%R
df1 |>
dplyr::left_join(df2, by = "colA")
Now what if you swap the order of `df1` and `df2`? Guess the output of
`dplyr::left_join(df2, df1, by = "colA")` and check your answer below.
%%R
df2 |>
dplyr::left_join(df1, by = "colA")
colA colD colE colB colC
1 1 foo dog 2021 a
2 3 bar cat 2023 c
3 2 baz mouse 2022 b
4 4 qux rabbit 2024 d
5 5 foo horse NA <NA>
Optional exercise: repeated keys#
The values in the ‘by’ column(s) are sometimes called keys for the join. The rules we described in the above example are not the whole story and can be complicated by the presence of repeated keys.
What do you think happens if there are repeated keys? For example, try to guess the output of the following code:
%%R
df1 <- data.frame(colA = c(1, 2, 3, 3), # repeated key 3
colB = c(2021, 2022, 2023, 2024),
colC = c('a', 'b', 'c', 'd'))
df2 <- data.frame(colA = c(1, 3),
colD = c('foo', 'bar'),
colE = c('dog', 'cat'))
Solution
%%R
df1 |>
dplyr::left_join(df2, by = "colA")
Now what if we swap the roles of `df1` and `df2`?
%%R
df2 |>
dplyr::left_join(df1, by = "colA")
colA colD colE colB colC
1 1 foo dog 2021 a
2 3 bar cat 2023 c
3 3 bar cat 2024 d
Warning about doubles#
It’s generally not a good idea to join on a ‘by’ column (or columns)
that are of type double. This is because matching in the join will be done
by an exact equality test on the doubles, which can create strange
results due to numerical imprecision and be difficult to reproduce. Example: in
the following, two doubles that look distinct are considered equal to
1 / 3 = 0.333...
, so ‘match’ on the 1 / 3
entry in data frame x
.
%%R
x <- data.frame(colA = c(1 / 3, 2 / 3),
colB = c(1000, 2000))
y <- data.frame(colA = c(0.33333333333333331, 0.33333333333333334),
colB = c("foo", "bar"))
dplyr::left_join(x, y, by = "colA")
colA colB.x colB.y
1 0.3333333 1000 bar
2 0.6666667 2000 <NA>
Exercise: joining penguin and weather data#
Recall the penguin data from the palmerpenguins
package. We’re going to join
this with some annual weather data, taken from the Palmer Station in Antarctica
from 1989 - 2019 [LTE23].
%%R
library(palmerpenguins) # loads `penguins` data
weather <- readr::read_csv("./data/weather_annual.csv")
penguins
weather
Rows: 35 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): Year, Average_temp, Max_temp, Min_temp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 35 × 4
Year Average_temp Max_temp Min_temp
<dbl> <dbl> <dbl> <dbl>
1 1989 NA 7.8 -11.1
2 1990 NA 8.7 -20.6
3 1991 NA 9.8 -18
4 1992 NA 10.3 -24.7
5 1993 NA 10.1 -17.6
6 1994 -7.2 9 -25.6
7 1995 0.0613 9 -26
8 1996 -0.642 9.6 -22.7
9 1997 -2.43 9.5 -25.2
10 1998 -0.989 8.5 -21.7
# ℹ 25 more rows
# ℹ Use `print(n = ...)` to see more rows
Use a join to create a single dataframe that has all the penguin data and weather data combined. Hint: there are some subtleties to be aware of:
First think about which column(s) to join on – a call to
dplyr::rename
might be in order!We should make sure that the ‘by’ column(s) to join on are of the same type. Examine the kind of data in each dataframe closely and coerce if necessary!
%%R
weather_cleaned <- weather |>
dplyr::rename(year = Year) |>
dplyr::mutate(year = as.integer(year))
Solution
penguins |>
dplyr::left_join(weather_cleaned, by = "year")
Acknowledgement#
The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.