Combining Datasets#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Combining_datasets.Rmd Exercises complete: Download Combining_datasets_complete.Rmd

Learning Objectives#

  • Understand the purpose and methods of joining dataset in R using the dplyr package

  • Perform left joins to combine datasets based on common columns

  • Handle and resolve issues that arise from repeated keys during joins

  • Recognize and avoid pitfalls related to joining on columns with double data types

Joining datasets#

In your data wrangling journey, you will often find yourself wanting to combine one dataframe with some kind of supplementary or partner dataframe. In our case, we have the penguins and weather data stored separately, but if we ever wanted to explore any relationships between them, we’d ideally want them in a single dataframe. This requires lining up the observations and variables in the datasets appropriately, something which is accomplished by performing appropriate joins.

The key to joining is to identify the variables by which you want to join the data. That is, we want to ask the question: which columns in each data are the ones that link them together? In some cases these may be one-to-one matches (e.g. ID numbers to IDs numbers), or in other cases there is data at different levels that need to be lined up.

Left joins#

There are several kinds of join function in dplyr, but we’ll just focus on left_join and leave you to explore the others for yourself.

Like all the join functions, left_join takes three arguments: the two dataframes you’d like to join, and the name of the column (or columns) by which to join.

dplyr::left_join(data_left, data_right, by = <cols_to_join_on>)

The way left_join works is to match up the columns given in the by column and create a new dataframe by pasting new columns from data_right alongside the columns from data_left. As many rows from data_right are brought in as possible. The ‘left’ in left_join indicates that we’re keeping everything from the ‘left’ dataframe (i.e. the first one) and joining the other dataframe onto the ‘left’ one.

It’s helpful to use small, toy dataframes to explore how joins work (or remind yourself after a period of time away).

%%R
df1 <- data.frame(colA = c(1, 2, 3, 4),
                  colB = c(2021, 2022, 2023, 2024),
                  colC = c('a', 'b', 'c', 'd'))

df2 <- data.frame(colA = c(1, 3),
                  colD = c('foo', 'bar'),
                  colE = c('dog', 'cat'))
%%R
df1
  colA colB colC
1    1 2021    a
2    2 2022    b
3    3 2023    c
4    4 2024    d
%%R
df2
  colA colD colE
1    1  foo  dog
2    3  bar  cat

Now let’s join df2 onto df1 by the column colA:

%%R
df1 |>
  dplyr::left_join(df2, by = "colA")
  colA colB colC colD colE
1    1 2021    a  foo  dog
2    2 2022    b <NA> <NA>
3    3 2023    c  bar  cat
4    4 2024    d <NA> <NA>

Notice:

  • All the data from the left-hand dataframe, df1, is kept (columns colAcolC).

  • The data from the right-hand dataframe df2 has been brought over for the rows where the ‘by’ column colA has matching values in the left-hand dataframe df1.

  • The rows where there aren’t matching values in the ‘by’ column colA have missing values for the other right-hand dataframe columns (colD and colE in this case).

Exercise: left join practice#

Suppose now the right hand dataframe is as follows:

%%R
df2 <- data.frame(colA = c(1, 3, 2, 4, 5),
                  colD = c('foo', 'bar', 'baz', 'qux', 'foo'),
                  colE = c('dog', 'cat', 'mouse', 'rabbit', 'horse'))

Can you guess what the output of dplyr::left_join(df1, df2, by = "colA") will be? Check your guess with code below.


Now what if you swap the order of `df1` and `df2`? Guess the output of
`dplyr::left_join(df2, df1, by = "colA")` and check your answer below.
%%R
df2 |>
  dplyr::left_join(df1, by = "colA")
  colA colD   colE colB colC
1    1  foo    dog 2021    a
2    3  bar    cat 2023    c
3    2  baz  mouse 2022    b
4    4  qux rabbit 2024    d
5    5  foo  horse   NA <NA>

Optional exercise: repeated keys#

The values in the ‘by’ column(s) are sometimes called keys for the join. The rules we described in the above example are not the whole story and can be complicated by the presence of repeated keys.

What do you think happens if there are repeated keys? For example, try to guess the output of the following code:

%%R
df1 <- data.frame(colA = c(1, 2, 3, 3),  # repeated key 3
                  colB = c(2021, 2022, 2023, 2024),
                  colC = c('a', 'b', 'c', 'd'))

df2 <- data.frame(colA = c(1, 3),
                  colD = c('foo', 'bar'),
                  colE = c('dog', 'cat'))

Now what if we swap the roles of `df1` and `df2`?
%%R
df2 |>
  dplyr::left_join(df1, by = "colA")
  colA colD colE colB colC
1    1  foo  dog 2021    a
2    3  bar  cat 2023    c
3    3  bar  cat 2024    d

Warning about doubles#

It’s generally not a good idea to join on a ‘by’ column (or columns) that are of type double. This is because matching in the join will be done by an exact equality test on the doubles, which can create strange results due to numerical imprecision and be difficult to reproduce. Example: in the following, two doubles that look distinct are considered equal to 1 / 3 = 0.333..., so ‘match’ on the 1 / 3 entry in data frame x.

%%R
x <- data.frame(colA = c(1 / 3, 2 / 3),
                colB = c(1000, 2000))

y <- data.frame(colA = c(0.33333333333333331, 0.33333333333333334),
                colB = c("foo", "bar"))

dplyr::left_join(x, y, by = "colA")
       colA colB.x colB.y
1 0.3333333   1000    bar
2 0.6666667   2000   <NA>

Exercise: joining penguin and weather data#

Recall the penguin data from the palmerpenguins package. We’re going to join this with some annual weather data, taken from the Palmer Station in Antarctica from 1989 - 2019 [LTE23].

%%R
library(palmerpenguins)  # loads `penguins` data
weather <- readr::read_csv("./data/weather_annual.csv")

penguins
weather
Rows: 35 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): Year, Average_temp, Max_temp, Min_temp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 35 × 4
    Year Average_temp Max_temp Min_temp
   <dbl>        <dbl>    <dbl>    <dbl>
 1  1989      NA           7.8    -11.1
 2  1990      NA           8.7    -20.6
 3  1991      NA           9.8    -18  
 4  1992      NA          10.3    -24.7
 5  1993      NA          10.1    -17.6
 6  1994      -7.2         9      -25.6
 7  1995       0.0613      9      -26  
 8  1996      -0.642       9.6    -22.7
 9  1997      -2.43        9.5    -25.2
10  1998      -0.989       8.5    -21.7
# ℹ 25 more rows
# ℹ Use `print(n = ...)` to see more rows

Use a join to create a single dataframe that has all the penguin data and weather data combined. Hint: there are some subtleties to be aware of:

  • First think about which column(s) to join on – a call to dplyr::rename might be in order!

  • We should make sure that the ‘by’ column(s) to join on are of the same type. Examine the kind of data in each dataframe closely and coerce if necessary!

%%R
weather_cleaned <- weather |>
  dplyr::rename(year = Year) |>
  dplyr::mutate(year = as.integer(year))

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.