Combining Datasets

Combining Datasets#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Combining_datasets.Rmd Exercises complete: Download Combining_datasets_complete.Rmd

Learning Objectives#

Understand the purpose and methods of joining dataset in R using the dplyr package
Perform left joins to combine datasets based on common columns
Handle and resolve issues that arise from repeated keys during joins
Recognize and avoid pitfalls related to joining on columns with double data types

Joining datasets#

In your data wrangling journey, you will often find yourself wanting to combine one dataframe with some kind of supplementary or partner dataframe. In our case, we have the penguins and weather data stored separately, but if we ever wanted to explore any relationships between them, we’d ideally want them in a single dataframe. This requires lining up the observations and variables in the datasets appropriately, something which is accomplished by performing appropriate joins.

The key to joining is to identify the variables by which you want to join the data. That is, we want to ask the question: which columns in each data are the ones that link them together? In some cases these may be one-to-one matches (e.g. ID numbers to IDs numbers), or in other cases there is data at different levels that need to be lined up.

Left joins#

There are several kinds of join function in dplyr, but we’ll just focus on left_join and leave you to explore the others for yourself.

Like all the join functions, left_join takes three arguments: the two dataframes you’d like to join, and the name of the column (or columns) by which to join.

dplyr::left_join(data_left, data_right, by = <cols_to_join_on>)

The way left_join works is to match up the columns given in the by column and create a new dataframe by pasting new columns from data_right alongside the columns from data_left. As many rows from data_right are brought in as possible. The ‘left’ in left_join indicates that we’re keeping everything from the ‘left’ dataframe (i.e. the first one) and joining the other dataframe onto the ‘left’ one.

It’s helpful to use small, toy dataframes to explore how joins work (or remind yourself after a period of time away).

df1 <- data.frame(colA = c(1, 2, 3, 4),
                  colB = c(2021, 2022, 2023, 2024),
                  colC = c('a', 'b', 'c', 'd'))

df2 <- data.frame(colA = c(1, 3),
                  colD = c('foo', 'bar'),
                  colE = c('dog', 'cat'))

df1

A data.frame: 4 × 3
colA	colB	colC
<dbl>	<dbl>	<chr>
1	2021	a
2	2022	b
3	2023	c
4	2024	d

df2

A data.frame: 2 × 3
colA	colD	colE
<dbl>	<chr>	<chr>
1	foo	dog
3	bar	cat

Now let’s join df2 onto df1 by the column colA:

df1 |>
  dplyr::left_join(df2, by = "colA")

A data.frame: 4 × 5
colA	colB	colC	colD	colE
<dbl>	<dbl>	<chr>	<chr>	<chr>
1	2021	a	foo	dog
2	2022	b	NA	NA
3	2023	c	bar	cat
4	2024	d	NA	NA

Notice:

All the data from the left-hand dataframe, df1, is kept (columns colA – colC).
The data from the right-hand dataframe df2 has been brought over for the rows where the ‘by’ column colA has matching values in the left-hand dataframe df1.
The rows where there aren’t matching values in the ‘by’ column colA have missing values for the other right-hand dataframe columns (colD and colE in this case).

Exercise: left join practice#

Suppose now the right hand dataframe is as follows:

df2 <- data.frame(colA = c(1, 3, 2, 4, 5),
                  colD = c('foo', 'bar', 'baz', 'qux', 'foo'),
                  colE = c('dog', 'cat', 'mouse', 'rabbit', 'horse'))

Can you guess what the output of dplyr::left_join(df1, df2, by = "colA") will be? Check your guess with code below.

Solution

df1 |>
  dplyr::left_join(df2, by = "colA")

Now what if you swap the order of `df1` and `df2`? Guess the output of
`dplyr::left_join(df2, df1, by = "colA")` and check your answer below.

df2 |>
  dplyr::left_join(df1, by = "colA")

A data.frame: 5 × 5
colA	colD	colE	colB	colC
<dbl>	<chr>	<chr>	<dbl>	<chr>
1	foo	dog	2021	a
3	bar	cat	2023	c
2	baz	mouse	2022	b
4	qux	rabbit	2024	d
5	foo	horse	NA	NA

Optional exercise: repeated keys#

The values in the ‘by’ column(s) are sometimes called keys for the join. The rules we described in the above example are not the whole story and can be complicated by the presence of repeated keys.

What do you think happens if there are repeated keys? For example, try to guess the output of the following code:

df1 <- data.frame(colA = c(1, 2, 3, 3),  # repeated key 3
                  colB = c(2021, 2022, 2023, 2024),
                  colC = c('a', 'b', 'c', 'd'))

df2 <- data.frame(colA = c(1, 3),
                  colD = c('foo', 'bar'),
                  colE = c('dog', 'cat'))

Solution

df1 |>
  dplyr::left_join(df2, by = "colA")


Now what if we swap the roles of `df1` and `df2`?

df2 |>
  dplyr::left_join(df1, by = "colA")

A data.frame: 3 × 5
colA	colD	colE	colB	colC
<dbl>	<chr>	<chr>	<dbl>	<chr>
1	foo	dog	2021	a
3	bar	cat	2023	c
3	bar	cat	2024	d

Warning about doubles#

It’s generally not a good idea to join on a ‘by’ column (or columns) that are of type double. This is because matching in the join will be done by an exact equality test on the doubles, which can create strange results due to numerical imprecision and be difficult to reproduce. Example: in the following, two doubles that look distinct are considered equal to 1 / 3 = 0.333..., so ‘match’ on the 1 / 3 entry in data frame x.

x <- data.frame(colA = c(1 / 3, 2 / 3),
                colB = c(1000, 2000))

y <- data.frame(colA = c(0.33333333333333331, 0.33333333333333334),
                colB = c("foo", "bar"))

dplyr::left_join(x, y, by = "colA")

A data.frame: 2 × 3
colA	colB.x	colB.y
<dbl>	<dbl>	<chr>
0.3333333	1000	bar
0.6666667	2000	NA

Exercise: joining penguin and weather data#

Recall the penguin data from the palmerpenguins package. We’re going to join this with some annual weather data, taken from the Palmer Station in Antarctica from 1989 - 2019 [LTE23].

library(palmerpenguins)  # loads `penguins` data
weather <- readr::read_csv("./data/weather_annual.csv")

penguins
weather

Rows: 35 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): Year, Average_temp, Max_temp, Min_temp

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A tibble: 344 × 8
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
<fct>	<fct>	<dbl>	<dbl>	<int>	<int>	<fct>	<int>
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007
Adelie	Torgersen	38.9	17.8	181	3625	female	2007
Adelie	Torgersen	39.2	19.6	195	4675	male	2007
Adelie	Torgersen	34.1	18.1	193	3475	NA	2007
Adelie	Torgersen	42.0	20.2	190	4250	NA	2007
Adelie	Torgersen	37.8	17.1	186	3300	NA	2007
Adelie	Torgersen	37.8	17.3	180	3700	NA	2007
Adelie	Torgersen	41.1	17.6	182	3200	female	2007
Adelie	Torgersen	38.6	21.2	191	3800	male	2007
Adelie	Torgersen	34.6	21.1	198	4400	male	2007
Adelie	Torgersen	36.6	17.8	185	3700	female	2007
Adelie	Torgersen	38.7	19.0	195	3450	female	2007
Adelie	Torgersen	42.5	20.7	197	4500	male	2007
Adelie	Torgersen	34.4	18.4	184	3325	female	2007
Adelie	Torgersen	46.0	21.5	194	4200	male	2007
Adelie	Biscoe	37.8	18.3	174	3400	female	2007
Adelie	Biscoe	37.7	18.7	180	3600	male	2007
Adelie	Biscoe	35.9	19.2	189	3800	female	2007
Adelie	Biscoe	38.2	18.1	185	3950	male	2007
Adelie	Biscoe	38.8	17.2	180	3800	male	2007
Adelie	Biscoe	35.3	18.9	187	3800	female	2007
Adelie	Biscoe	40.6	18.6	183	3550	male	2007
Adelie	Biscoe	40.5	17.9	187	3200	female	2007
Adelie	Biscoe	37.9	18.6	172	3150	female	2007
Adelie	Biscoe	40.5	18.9	180	3950	male	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
Chinstrap	Dream	46.9	16.6	192	2700	female	2008
Chinstrap	Dream	53.5	19.9	205	4500	male	2008
Chinstrap	Dream	49.0	19.5	210	3950	male	2008
Chinstrap	Dream	46.2	17.5	187	3650	female	2008
Chinstrap	Dream	50.9	19.1	196	3550	male	2008
Chinstrap	Dream	45.5	17.0	196	3500	female	2008
Chinstrap	Dream	50.9	17.9	196	3675	female	2009
Chinstrap	Dream	50.8	18.5	201	4450	male	2009
Chinstrap	Dream	50.1	17.9	190	3400	female	2009
Chinstrap	Dream	49.0	19.6	212	4300	male	2009
Chinstrap	Dream	51.5	18.7	187	3250	male	2009
Chinstrap	Dream	49.8	17.3	198	3675	female	2009
Chinstrap	Dream	48.1	16.4	199	3325	female	2009
Chinstrap	Dream	51.4	19.0	201	3950	male	2009
Chinstrap	Dream	45.7	17.3	193	3600	female	2009
Chinstrap	Dream	50.7	19.7	203	4050	male	2009
Chinstrap	Dream	42.5	17.3	187	3350	female	2009
Chinstrap	Dream	52.2	18.8	197	3450	male	2009
Chinstrap	Dream	45.2	16.6	191	3250	female	2009
Chinstrap	Dream	49.3	19.9	203	4050	male	2009
Chinstrap	Dream	50.2	18.8	202	3800	male	2009
Chinstrap	Dream	45.6	19.4	194	3525	female	2009
Chinstrap	Dream	51.9	19.5	206	3950	male	2009
Chinstrap	Dream	46.8	16.5	189	3650	female	2009
Chinstrap	Dream	45.7	17.0	195	3650	female	2009
Chinstrap	Dream	55.8	19.8	207	4000	male	2009
Chinstrap	Dream	43.5	18.1	202	3400	female	2009
Chinstrap	Dream	49.6	18.2	193	3775	male	2009
Chinstrap	Dream	50.8	19.0	210	4100	male	2009
Chinstrap	Dream	50.2	18.7	198	3775	female	2009

A spec_tbl_df: 35 × 4
Year	Average_temp	Max_temp	Min_temp
<dbl>	<dbl>	<dbl>	<dbl>
1989	NA	7.8	-11.1
1990	NA	8.7	-20.6
1991	NA	9.8	-18.0
1992	NA	10.3	-24.7
1993	NA	10.1	-17.6
1994	-7.20000000	9.0	-25.6
1995	0.06129032	9.0	-26.0
1996	-0.64180328	9.6	-22.7
1997	-2.43150685	9.5	-25.2
1998	-0.98931507	8.5	-21.7
1999	-1.49753425	9.2	-23.4
2000	-1.36803279	10.8	-19.6
2001	-1.58657534	9.5	-17.6
2002	-2.62191781	8.5	-21.6
2003	-1.18931507	9.4	-19.7
2004	-1.62868852	9.8	-22.9
2005	-1.79479452	9.8	-19.9
2006	-0.99719888	10.6	-15.2
2007	-1.62876712	9.9	-16.3
2008	-0.79207650	6.8	-18.9
2009	-1.78862275	7.7	-14.1
2010	-0.72082192	11.6	-17.7
2011	-1.88767123	8.1	-19.1
2012	-1.42841530	8.9	-15.9
2013	-1.94904110	8.7	-24.7
2014	-1.57616438	8.5	-16.3
2015	-3.31808219	8.9	-23.3
2016	-1.86393443	9.4	-18.7
2017	-1.52170330	9.3	-20.9
2018	-1.65808219	9.7	-21.9
2019	-1.81095890	9.0	-21.4
2020	-1.12939560	10.6	-15.3
2021	-0.96219178	10.9	-18.3
2022	-0.20712329	11.6	-9.0
2023	0.43259669	8.7	-9.2

Use a join to create a single dataframe that has all the penguin data and weather data combined. Hint: there are some subtleties to be aware of:

First think about which column(s) to join on – a call to dplyr::rename might be in order!
We should make sure that the ‘by’ column(s) to join on are of the same type. Examine the kind of data in each dataframe closely and coerce if necessary!

weather_cleaned <- weather |>
  dplyr::rename(year = Year) |>
  dplyr::mutate(year = as.integer(year))

Solution

penguins |>
  dplyr::left_join(weather_cleaned, by = "year")

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.

What is the primary purpose of joining datasets?

In a left join, what happens to the rows in the left dataframe that do not have matching values in the right dataframe?

What is a potential issue with joining datasets on columns of type double?

Combining Datasets

Contents

Combining Datasets#

Download Rmd Version#

Learning Objectives#

Joining datasets#

Left joins#

Exercise: left join practice#

Optional exercise: repeated keys#

Warning about doubles#

Exercise: joining penguin and weather data#

Acknowledgement#