Tidyverse#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Introducing_Tidyverse.Rmd Exercises complete: Download Introducing_Tidyverse_complete.Rmd

Learning Objectives#

  • Understand the concept of the Tidyverse and its importance in data science

  • Identify and use the core packages within the Tidyverse, such as readr, dplyr, tidyr, stringr, and lubridate

  • Differentiate between dataframes and tibbles and understand their usage in the Tidyverse

  • Apply the principles of tidy data, including structuring data with variables as columns and observations as rows

  • Load and manipulate datasets using Tidyverse packages

What is the Tidyverse?#

The Tidyverse is self-described as: “…an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures’.”

When you install the tidyverse package, you install a suite of packages. These include the following; we’ve marked the ones this course will introduce with a ->:

  • -> readr: Load ‘rectangular’ data into an R session (e.g. csv files).

  • -> dplyr: Manipulate data (filtering, computing summaries, etc.)

  • -> tidyr: Reshape data (e.g. into ‘tidy’ format).

  • -> stringr: Working with strings.

  • -> lubridate: Working with dates

  • ggplot2: Visualise data.

  • tibble: A modern refresh of the core R dataframe, used throughout Tidyverse packages.

  • forcats: Tools for working with R ‘factors’, often used for categorical data.

  • purrr: Tools for working with R objects in a functional way.

These are designed to help you work with data, from cleaning and manipulation to plotting and modelling. The benefits of the Tidyverse include:

  • They are increasingly popular with large user bases (good for support / advice).

  • The packages are generally very well documented.

  • The packages are designed to work together seamlessly and operate well with many other modern R packages.

  • Provide features to help you write expressive code and avoid common pitfalls when working with data.

  • Built around a consistent philosophy of how to structure data for analysis (the ‘tidy’ data format).

Dataframes / tibbles#

The key data structure that Tidyverse packages are designed to work with is the dataframe.

Actually, this is not quite the whole story. As you read the documentation for Tidyverse packages, you’ll inevitably come across the term tibble. This is essentially the same thing as a dataframe, although there are some minor differences. For the purposes of this course, you can consider dataframes and tibbles to be interchangeable and regard them as ‘the same thing’.

To demonstrate this, let’s consider the Palmer Station penguins data related to three species of Antarctic penguins from Horst, Hill, and Gorman[HHG20]. We’ll work with this dataset throughout this course. The data contains size measurements for male and female adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica between 2007-2009. Data were collected and made available by Dr Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) Program. You can read more about the package on the palmerpenguins documentation website.

To load the dataset, we just import the palmerpenguins package and look at the penguins object.

library(palmerpenguins)  # loads `penguins` object
penguins
A tibble: 344 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen NA NA NA NANA 2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male 2007
AdelieTorgersen34.118.11933475NA 2007
AdelieTorgersen42.020.21904250NA 2007
AdelieTorgersen37.817.11863300NA 2007
AdelieTorgersen37.817.31803700NA 2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen38.621.21913800male 2007
AdelieTorgersen34.621.11984400male 2007
AdelieTorgersen36.617.81853700female2007
AdelieTorgersen38.719.01953450female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen34.418.41843325female2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 37.818.31743400female2007
AdelieBiscoe 37.718.71803600male 2007
AdelieBiscoe 35.919.21893800female2007
AdelieBiscoe 38.218.11853950male 2007
AdelieBiscoe 38.817.21803800male 2007
AdelieBiscoe 35.318.91873800female2007
AdelieBiscoe 40.618.61833550male 2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 37.918.61723150female2007
AdelieBiscoe 40.518.91803950male 2007
ChinstrapDream46.916.61922700female2008
ChinstrapDream53.519.92054500male 2008
ChinstrapDream49.019.52103950male 2008
ChinstrapDream46.217.51873650female2008
ChinstrapDream50.919.11963550male 2008
ChinstrapDream45.517.01963500female2008
ChinstrapDream50.917.91963675female2009
ChinstrapDream50.818.52014450male 2009
ChinstrapDream50.117.91903400female2009
ChinstrapDream49.019.62124300male 2009
ChinstrapDream51.518.71873250male 2009
ChinstrapDream49.817.31983675female2009
ChinstrapDream48.116.41993325female2009
ChinstrapDream51.419.02013950male 2009
ChinstrapDream45.717.31933600female2009
ChinstrapDream50.719.72034050male 2009
ChinstrapDream42.517.31873350female2009
ChinstrapDream52.218.81973450male 2009
ChinstrapDream45.216.61913250female2009
ChinstrapDream49.319.92034050male 2009
ChinstrapDream50.218.82023800male 2009
ChinstrapDream45.619.41943525female2009
ChinstrapDream51.919.52063950male 2009
ChinstrapDream46.816.51893650female2009
ChinstrapDream45.717.01953650female2009
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009

If we look closely, we see that the class of penguins (i.e. the kind of object it is) is a tibble, indicated by the "tbl_df" in the output below:

class(penguins)
  1. 'tbl_df'
  2. 'tbl'
  3. 'data.frame'

Tidy data#

Tidy data is a convention that specifies how to arrange our data into a table. It states that

  • The columns of the table should correspond to the variables in the data.

  • The rows of the table should correspond to observations (or samples) of the variables in the data.

Packages in the Tidyverse are designed to work with tidy data.

You can read more about the tidy data format in Hadley Wickham’s Tidy Data paper[Wic14].

Tidy data: Example#

Suppose we have a pet dog and a pet cat and we want to study the average weight of them by year, to see if there is some kind of association between the species and the trend in weight variation over time. We could represent this in a ‘matrix-like’ format, with the years corresponding to rows and species corresponding to columns:

Year

Dog_weight_kg

Cat_weight_kg

2021

21.4

8.7

2022

20.9

8.4

2023

21.8

8.1

: Weights of pets by year (non-tidy format)

However, this is not in a tidy format. The variables we’re interested in studying are the year, animal species and weight, but here the columns correspond to the weights for each kind of species. Instead, we should have a separate column for the species:

Year

Species

Weight_kg

2021

dog

21.4

2021

cat

8.7

2022

dog

20.9

2022

cat

8.4

2023

dog

21.8

2023

cat

8.1

: Weights of pets by year (tidy format)

Summary quiz#

What is the Tidyverse?

What is a tibble in the Tidyverse?

In tidy data format, how should the data be structured?

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.