Tidyverse

Tidyverse#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Introducing_Tidyverse.Rmd Exercises complete: Download Introducing_Tidyverse_complete.Rmd

Learning Objectives#

Understand the concept of the Tidyverse and its importance in data science
Identify and use the core packages within the Tidyverse, such as readr, dplyr, tidyr, stringr, and lubridate
Differentiate between dataframes and tibbles and understand their usage in the Tidyverse
Apply the principles of tidy data, including structuring data with variables as columns and observations as rows
Load and manipulate datasets using Tidyverse packages

What is the Tidyverse?#

The Tidyverse is self-described as: “…an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures’.”

When you install the tidyverse package, you install a suite of packages. These include the following; we’ve marked the ones this course will introduce with a ->:

-> readr: Load ‘rectangular’ data into an R session (e.g. csv files).
-> dplyr: Manipulate data (filtering, computing summaries, etc.)
-> tidyr: Reshape data (e.g. into ‘tidy’ format).
-> stringr: Working with strings.
-> lubridate: Working with dates
ggplot2: Visualise data.
tibble: A modern refresh of the core R dataframe, used throughout Tidyverse packages.
forcats: Tools for working with R ‘factors’, often used for categorical data.
purrr: Tools for working with R objects in a functional way.

These are designed to help you work with data, from cleaning and manipulation to plotting and modelling. The benefits of the Tidyverse include:

They are increasingly popular with large user bases (good for support / advice).
The packages are generally very well documented.
The packages are designed to work together seamlessly and operate well with many other modern R packages.
Provide features to help you write expressive code and avoid common pitfalls when working with data.
Built around a consistent philosophy of how to structure data for analysis (the ‘tidy’ data format).

Dataframes / tibbles#

The key data structure that Tidyverse packages are designed to work with is the dataframe.

Actually, this is not quite the whole story. As you read the documentation for Tidyverse packages, you’ll inevitably come across the term tibble. This is essentially the same thing as a dataframe, although there are some minor differences. For the purposes of this course, you can consider dataframes and tibbles to be interchangeable and regard them as ‘the same thing’.

To demonstrate this, let’s consider the Palmer Station penguins data related to three species of Antarctic penguins from Horst, Hill, and Gorman[HHG20]. We’ll work with this dataset throughout this course. The data contains size measurements for male and female adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica between 2007-2009. Data were collected and made available by Dr Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program. You can read more about the package on the palmerpenguins documentation website.

To load the dataset, we just import the palmerpenguins package and look at the penguins object.

library(palmerpenguins)  # loads `penguins` object
penguins

A tibble: 344 × 8
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
<fct>	<fct>	<dbl>	<dbl>	<int>	<int>	<fct>	<int>
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007
Adelie	Torgersen	38.9	17.8	181	3625	female	2007
Adelie	Torgersen	39.2	19.6	195	4675	male	2007
Adelie	Torgersen	34.1	18.1	193	3475	NA	2007
Adelie	Torgersen	42.0	20.2	190	4250	NA	2007
Adelie	Torgersen	37.8	17.1	186	3300	NA	2007
Adelie	Torgersen	37.8	17.3	180	3700	NA	2007
Adelie	Torgersen	41.1	17.6	182	3200	female	2007
Adelie	Torgersen	38.6	21.2	191	3800	male	2007
Adelie	Torgersen	34.6	21.1	198	4400	male	2007
Adelie	Torgersen	36.6	17.8	185	3700	female	2007
Adelie	Torgersen	38.7	19.0	195	3450	female	2007
Adelie	Torgersen	42.5	20.7	197	4500	male	2007
Adelie	Torgersen	34.4	18.4	184	3325	female	2007
Adelie	Torgersen	46.0	21.5	194	4200	male	2007
Adelie	Biscoe	37.8	18.3	174	3400	female	2007
Adelie	Biscoe	37.7	18.7	180	3600	male	2007
Adelie	Biscoe	35.9	19.2	189	3800	female	2007
Adelie	Biscoe	38.2	18.1	185	3950	male	2007
Adelie	Biscoe	38.8	17.2	180	3800	male	2007
Adelie	Biscoe	35.3	18.9	187	3800	female	2007
Adelie	Biscoe	40.6	18.6	183	3550	male	2007
Adelie	Biscoe	40.5	17.9	187	3200	female	2007
Adelie	Biscoe	37.9	18.6	172	3150	female	2007
Adelie	Biscoe	40.5	18.9	180	3950	male	2007
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
Chinstrap	Dream	46.9	16.6	192	2700	female	2008
Chinstrap	Dream	53.5	19.9	205	4500	male	2008
Chinstrap	Dream	49.0	19.5	210	3950	male	2008
Chinstrap	Dream	46.2	17.5	187	3650	female	2008
Chinstrap	Dream	50.9	19.1	196	3550	male	2008
Chinstrap	Dream	45.5	17.0	196	3500	female	2008
Chinstrap	Dream	50.9	17.9	196	3675	female	2009
Chinstrap	Dream	50.8	18.5	201	4450	male	2009
Chinstrap	Dream	50.1	17.9	190	3400	female	2009
Chinstrap	Dream	49.0	19.6	212	4300	male	2009
Chinstrap	Dream	51.5	18.7	187	3250	male	2009
Chinstrap	Dream	49.8	17.3	198	3675	female	2009
Chinstrap	Dream	48.1	16.4	199	3325	female	2009
Chinstrap	Dream	51.4	19.0	201	3950	male	2009
Chinstrap	Dream	45.7	17.3	193	3600	female	2009
Chinstrap	Dream	50.7	19.7	203	4050	male	2009
Chinstrap	Dream	42.5	17.3	187	3350	female	2009
Chinstrap	Dream	52.2	18.8	197	3450	male	2009
Chinstrap	Dream	45.2	16.6	191	3250	female	2009
Chinstrap	Dream	49.3	19.9	203	4050	male	2009
Chinstrap	Dream	50.2	18.8	202	3800	male	2009
Chinstrap	Dream	45.6	19.4	194	3525	female	2009
Chinstrap	Dream	51.9	19.5	206	3950	male	2009
Chinstrap	Dream	46.8	16.5	189	3650	female	2009
Chinstrap	Dream	45.7	17.0	195	3650	female	2009
Chinstrap	Dream	55.8	19.8	207	4000	male	2009
Chinstrap	Dream	43.5	18.1	202	3400	female	2009
Chinstrap	Dream	49.6	18.2	193	3775	male	2009
Chinstrap	Dream	50.8	19.0	210	4100	male	2009
Chinstrap	Dream	50.2	18.7	198	3775	female	2009

If we look closely, we see that the class of penguins (i.e. the kind of object it is) is a tibble, indicated by the "tbl_df" in the output below:

class(penguins)

'tbl_df'
'tbl'
'data.frame'

Tidy data#

Tidy data is a convention that specifies how to arrange our data into a table. It states that

The columns of the table should correspond to the variables in the data.
The rows of the table should correspond to observations (or samples) of the variables in the data.

Packages in the Tidyverse are designed to work with tidy data.

You can read more about the tidy data format in Hadley Wickham’s Tidy Data paper[Wic14].

Tidy data: Example#

Suppose we have a pet dog and a pet cat and we want to study the average weight of them by year, to see if there is some kind of association between the species and the trend in weight variation over time. We could represent this in a ‘matrix-like’ format, with the years corresponding to rows and species corresponding to columns:

Year	Dog_weight_kg	Cat_weight_kg
2021	21.4	8.7
2022	20.9	8.4
2023	21.8	8.1

: Weights of pets by year (non-tidy format)

However, this is not in a tidy format. The variables we’re interested in studying are the year, animal species and weight, but here the columns correspond to the weights for each kind of species. Instead, we should have a separate column for the species:

Year	Species	Weight_kg
2021	dog	21.4
2021	cat	8.7
2022	dog	20.9
2022	cat	8.4
2023	dog	21.8
2023	cat	8.1

: Weights of pets by year (tidy format)

Summary quiz#

What is the Tidyverse?

What is a tibble in the Tidyverse?

In tidy data format, how should the data be structured?

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.