Tidyverse#

Learning Objectives#

  • Understand the concept of the Tidyverse and its importance in data science

  • Identify and use the core packages within the Tidyverse, such as readr, dplyr, tidyr, stringr, and lubridate

  • Differentiate between dataframes and tibbles and understand their usage in the Tidyverse

  • Apply the principles of tidy data, including structuring data with variables as columns and observations as rows

  • Load and manipulate datasets using Tidyverse packages

What is the Tidyverse?#

The Tidyverse is self-described as: “…an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures’.”

When you install the tidyverse package, you install a suite of packages. These include the following; we’ve marked the ones this course will introduce with a ->:

  • -> readr: Load ‘rectangular’ data into an R session (e.g. csv files).

  • -> dplyr: Manipulate data (filtering, computing summaries, etc.)

  • -> tidyr: Reshape data (e.g. into ‘tidy’ format).

  • -> stringr: Working with strings.

  • -> lubridate: Working with dates

  • ggplot2: Visualise data.

  • tibble: A modern refresh of the core R dataframe, used throughout Tidyverse packages.

  • forcats: Tools for working with R ‘factors’, often used for categorical data.

  • purrr: Tools for working with R objects in a functional way.

These are designed to help you work with data, from cleaning and manipulation to plotting and modelling. The benefits of the Tidyverse include:

  • They are increasingly popular with large user bases (good for support / advice).

  • The packages are generally very well documented.

  • The packages are designed to work together seamlessly and operate well with many other modern R packages.

  • Provide features to help you write expressive code and avoid common pitfalls when working with data.

  • Built around a consistent philosophy of how to structure data for analysis (the ‘tidy’ data format).

Dataframes / tibbles#

The key data structure that Tidyverse packages are designed to work with is the dataframe.

Actually, this is not quite the whole story. As you read the documentation for Tidyverse packages, you’ll inevitably come across the term tibble. This is essentially the same thing as a dataframe, although there are some minor differences. For the purposes of this course, you can consider dataframes and tibbles to be interchangeable and regard them as ‘the same thing’.

To demonstrate this, let’s consider the Palmer Station penguins data related to three species of Antarctic penguins from Horst, Hill, and Gorman[HHG20]. We’ll work with this dataset throughout this course. The data contains size measurements for male and female adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica between 2007-2009. Data were collected and made available by Dr Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program. You can read more about the package on the palmerpenguins documentation website.

To load the dataset, we just import the palmerpenguins package and look at the penguins object.

%%R
library(palmerpenguins)  # loads `penguins` object
penguins
# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# ℹ Use `print(n = ...)` to see more rows

If we look closely, we see that the class of penguins (i.e. the kind of object it is) is a tibble, indicated by the "tbl_df" in the output below:

%%R
class(penguins)
[1] "tbl_df"     "tbl"        "data.frame"

Tidy data#

Tidy data is a convention that specifies how to arrange our data into a table. It states that

  • The columns of the table should correspond to the variables in the data.

  • The rows of the table should correspond to observations (or samples) of the variables in the data.

Packages in the Tidyverse are designed to work with tidy data.

You can read more about the tidy data format in Hadley Wickham’s Tidy Data paper[Wic14].

Tidy data: Example#

Suppose we have a pet dog and a pet cat and we want to study the average weight of them by year, to see if there is some kind of association between the species and the trend in weight variation over time. We could represent this in a ‘matrix-like’ format, with the years corresponding to rows and species corresponding to columns:

Year

Dog_weight_kg

Cat_weight_kg

2021

21.4

8.7

2022

20.9

8.4

2023

21.8

8.1

: Weights of pets by year (non-tidy format)

However, this is not in a tidy format. The variables we’re interested in studying are the year, animal species and weight, but here the columns correspond to the weights for each kind of species. Instead, we should have a separate column for the species:

Year

Species

Weight_kg

2021

dog

21.4

2021

cat

8.7

2022

dog

20.9

2022

cat

8.4

2023

dog

21.8

2023

cat

8.1

: Weights of pets by year (tidy format)

Summary quiz#

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.