Combining Datasets#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Combining_datasets.Rmd Exercises complete: Download Combining_datasets_complete.Rmd

Learning Objectives#

  • Understand the purpose and methods of joining dataset in R using the dplyr package

  • Perform left joins to combine datasets based on common columns

  • Handle and resolve issues that arise from repeated keys during joins

  • Recognize and avoid pitfalls related to joining on columns with double data types

Joining datasets#

In your data wrangling journey, you will often find yourself wanting to combine one dataframe with some kind of supplementary or partner dataframe. In our case, we have the penguins and weather data stored separately, but if we ever wanted to explore any relationships between them, we’d ideally want them in a single dataframe. This requires lining up the observations and variables in the datasets appropriately, something which is accomplished by performing appropriate joins.

The key to joining is to identify the variables by which you want to join the data. That is, we want to ask the question: which columns in each data are the ones that link them together? In some cases these may be one-to-one matches (e.g. ID numbers to IDs numbers), or in other cases there is data at different levels that need to be lined up.

Left joins#

There are several kinds of join function in dplyr, but we’ll just focus on left_join and leave you to explore the others for yourself.

Like all the join functions, left_join takes three arguments: the two dataframes you’d like to join, and the name of the column (or columns) by which to join.

dplyr::left_join(data_left, data_right, by = <cols_to_join_on>)

The way left_join works is to match up the columns given in the by column and create a new dataframe by pasting new columns from data_right alongside the columns from data_left. As many rows from data_right are brought in as possible. The ‘left’ in left_join indicates that we’re keeping everything from the ‘left’ dataframe (i.e. the first one) and joining the other dataframe onto the ‘left’ one.

It’s helpful to use small, toy dataframes to explore how joins work (or remind yourself after a period of time away).

df1 <- data.frame(colA = c(1, 2, 3, 4),
                  colB = c(2021, 2022, 2023, 2024),
                  colC = c('a', 'b', 'c', 'd'))

df2 <- data.frame(colA = c(1, 3),
                  colD = c('foo', 'bar'),
                  colE = c('dog', 'cat'))
df1
A data.frame: 4 × 3
colAcolBcolC
<dbl><dbl><chr>
12021a
22022b
32023c
42024d
df2
A data.frame: 2 × 3
colAcolDcolE
<dbl><chr><chr>
1foodog
3barcat

Now let’s join df2 onto df1 by the column colA:

df1 |>
  dplyr::left_join(df2, by = "colA")
A data.frame: 4 × 5
colAcolBcolCcolDcolE
<dbl><dbl><chr><chr><chr>
12021afoodog
22022bNA NA
32023cbarcat
42024dNA NA

Notice:

  • All the data from the left-hand dataframe, df1, is kept (columns colAcolC).

  • The data from the right-hand dataframe df2 has been brought over for the rows where the ‘by’ column colA has matching values in the left-hand dataframe df1.

  • The rows where there aren’t matching values in the ‘by’ column colA have missing values for the other right-hand dataframe columns (colD and colE in this case).

Exercise: left join practice#

Suppose now the right hand dataframe is as follows:

df2 <- data.frame(colA = c(1, 3, 2, 4, 5),
                  colD = c('foo', 'bar', 'baz', 'qux', 'foo'),
                  colE = c('dog', 'cat', 'mouse', 'rabbit', 'horse'))

Can you guess what the output of dplyr::left_join(df1, df2, by = "colA") will be? Check your guess with code below.


Now what if you swap the order of `df1` and `df2`? Guess the output of
`dplyr::left_join(df2, df1, by = "colA")` and check your answer below.
df2 |>
  dplyr::left_join(df1, by = "colA")
A data.frame: 5 × 5
colAcolDcolEcolBcolC
<dbl><chr><chr><dbl><chr>
1foodog 2021a
3barcat 2023c
2bazmouse 2022b
4quxrabbit2024d
5foohorse NANA

Optional exercise: repeated keys#

The values in the ‘by’ column(s) are sometimes called keys for the join. The rules we described in the above example are not the whole story and can be complicated by the presence of repeated keys.

What do you think happens if there are repeated keys? For example, try to guess the output of the following code:

df1 <- data.frame(colA = c(1, 2, 3, 3),  # repeated key 3
                  colB = c(2021, 2022, 2023, 2024),
                  colC = c('a', 'b', 'c', 'd'))

df2 <- data.frame(colA = c(1, 3),
                  colD = c('foo', 'bar'),
                  colE = c('dog', 'cat'))

Now what if we swap the roles of `df1` and `df2`?
df2 |>
  dplyr::left_join(df1, by = "colA")
A data.frame: 3 × 5
colAcolDcolEcolBcolC
<dbl><chr><chr><dbl><chr>
1foodog2021a
3barcat2023c
3barcat2024d

Warning about doubles#

It’s generally not a good idea to join on a ‘by’ column (or columns) that are of type double. This is because matching in the join will be done by an exact equality test on the doubles, which can create strange results due to numerical imprecision and be difficult to reproduce. Example: in the following, two doubles that look distinct are considered equal to 1 / 3 = 0.333..., so ‘match’ on the 1 / 3 entry in data frame x.

x <- data.frame(colA = c(1 / 3, 2 / 3),
                colB = c(1000, 2000))

y <- data.frame(colA = c(0.33333333333333331, 0.33333333333333334),
                colB = c("foo", "bar"))

dplyr::left_join(x, y, by = "colA")
A data.frame: 2 × 3
colAcolB.xcolB.y
<dbl><dbl><chr>
0.33333331000bar
0.66666672000NA

Exercise: joining penguin and weather data#

Recall the penguin data from the palmerpenguins package. We’re going to join this with some annual weather data, taken from the Palmer Station in Antarctica from 1989 - 2019 [LTE23].

library(palmerpenguins)  # loads `penguins` data
weather <- readr::read_csv("./data/weather_annual.csv")

penguins
weather
Rows: 35 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): Year, Average_temp, Max_temp, Min_temp

 Use `spec()` to retrieve the full column specification for this data.
 Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 344 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen NA NA NA NANA 2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male 2007
AdelieTorgersen34.118.11933475NA 2007
AdelieTorgersen42.020.21904250NA 2007
AdelieTorgersen37.817.11863300NA 2007
AdelieTorgersen37.817.31803700NA 2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen38.621.21913800male 2007
AdelieTorgersen34.621.11984400male 2007
AdelieTorgersen36.617.81853700female2007
AdelieTorgersen38.719.01953450female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen34.418.41843325female2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 37.818.31743400female2007
AdelieBiscoe 37.718.71803600male 2007
AdelieBiscoe 35.919.21893800female2007
AdelieBiscoe 38.218.11853950male 2007
AdelieBiscoe 38.817.21803800male 2007
AdelieBiscoe 35.318.91873800female2007
AdelieBiscoe 40.618.61833550male 2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 37.918.61723150female2007
AdelieBiscoe 40.518.91803950male 2007
ChinstrapDream46.916.61922700female2008
ChinstrapDream53.519.92054500male 2008
ChinstrapDream49.019.52103950male 2008
ChinstrapDream46.217.51873650female2008
ChinstrapDream50.919.11963550male 2008
ChinstrapDream45.517.01963500female2008
ChinstrapDream50.917.91963675female2009
ChinstrapDream50.818.52014450male 2009
ChinstrapDream50.117.91903400female2009
ChinstrapDream49.019.62124300male 2009
ChinstrapDream51.518.71873250male 2009
ChinstrapDream49.817.31983675female2009
ChinstrapDream48.116.41993325female2009
ChinstrapDream51.419.02013950male 2009
ChinstrapDream45.717.31933600female2009
ChinstrapDream50.719.72034050male 2009
ChinstrapDream42.517.31873350female2009
ChinstrapDream52.218.81973450male 2009
ChinstrapDream45.216.61913250female2009
ChinstrapDream49.319.92034050male 2009
ChinstrapDream50.218.82023800male 2009
ChinstrapDream45.619.41943525female2009
ChinstrapDream51.919.52063950male 2009
ChinstrapDream46.816.51893650female2009
ChinstrapDream45.717.01953650female2009
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009
A spec_tbl_df: 35 × 4
YearAverage_tempMax_tempMin_temp
<dbl><dbl><dbl><dbl>
1989 NA 7.8-11.1
1990 NA 8.7-20.6
1991 NA 9.8-18.0
1992 NA10.3-24.7
1993 NA10.1-17.6
1994-7.20000000 9.0-25.6
1995 0.06129032 9.0-26.0
1996-0.64180328 9.6-22.7
1997-2.43150685 9.5-25.2
1998-0.98931507 8.5-21.7
1999-1.49753425 9.2-23.4
2000-1.3680327910.8-19.6
2001-1.58657534 9.5-17.6
2002-2.62191781 8.5-21.6
2003-1.18931507 9.4-19.7
2004-1.62868852 9.8-22.9
2005-1.79479452 9.8-19.9
2006-0.9971988810.6-15.2
2007-1.62876712 9.9-16.3
2008-0.79207650 6.8-18.9
2009-1.78862275 7.7-14.1
2010-0.7208219211.6-17.7
2011-1.88767123 8.1-19.1
2012-1.42841530 8.9-15.9
2013-1.94904110 8.7-24.7
2014-1.57616438 8.5-16.3
2015-3.31808219 8.9-23.3
2016-1.86393443 9.4-18.7
2017-1.52170330 9.3-20.9
2018-1.65808219 9.7-21.9
2019-1.81095890 9.0-21.4
2020-1.1293956010.6-15.3
2021-0.9621917810.9-18.3
2022-0.2071232911.6 -9.0
2023 0.43259669 8.7 -9.2

Use a join to create a single dataframe that has all the penguin data and weather data combined. Hint: there are some subtleties to be aware of:

  • First think about which column(s) to join on – a call to dplyr::rename might be in order!

  • We should make sure that the ‘by’ column(s) to join on are of the same type. Examine the kind of data in each dataframe closely and coerce if necessary!

weather_cleaned <- weather |>
  dplyr::rename(year = Year) |>
  dplyr::mutate(year = as.integer(year))

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.

What is the primary purpose of joining datasets?

In a left join, what happens to the rows in the left dataframe that do not have matching values in the right dataframe?

What is a potential issue with joining datasets on columns of type double?