Tidyverse#
Download Rmd Version#
If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.
Exercises incomplete: Download Introducing_Tidyverse.Rmd Exercises complete: Download Introducing_Tidyverse_complete.Rmd
Learning Objectives#
Understand the concept of the Tidyverse and its importance in data science
Identify and use the core packages within the Tidyverse, such as
readr
,dplyr
,tidyr
,stringr
, andlubridate
Differentiate between dataframes and tibbles and understand their usage in the Tidyverse
Apply the principles of tidy data, including structuring data with variables as columns and observations as rows
Load and manipulate datasets using Tidyverse packages
What is the Tidyverse?#
The Tidyverse is self-described as: “…an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures’.”
When you install the tidyverse
package, you install a suite of packages. These
include the following; we’ve marked the ones this course will introduce with a
->:
->
readr
: Load ‘rectangular’ data into an R session (e.g. csv files).->
dplyr
: Manipulate data (filtering, computing summaries, etc.)->
tidyr
: Reshape data (e.g. into ‘tidy’ format).->
stringr
: Working with strings.->
lubridate
: Working with datesggplot2
: Visualise data.tibble
: A modern refresh of the core R dataframe, used throughout Tidyverse packages.forcats
: Tools for working with R ‘factors’, often used for categorical data.purrr
: Tools for working with R objects in a functional way.
These are designed to help you work with data, from cleaning and manipulation to plotting and modelling. The benefits of the Tidyverse include:
They are increasingly popular with large user bases (good for support / advice).
The packages are generally very well documented.
The packages are designed to work together seamlessly and operate well with many other modern R packages.
Provide features to help you write expressive code and avoid common pitfalls when working with data.
Built around a consistent philosophy of how to structure data for analysis (the ‘tidy’ data format).
Dataframes / tibbles#
The key data structure that Tidyverse packages are designed to work with is the dataframe.
Actually, this is not quite the whole story. As you read the documentation for Tidyverse packages, you’ll inevitably come across the term tibble. This is essentially the same thing as a dataframe, although there are some minor differences. For the purposes of this course, you can consider dataframes and tibbles to be interchangeable and regard them as ‘the same thing’.
To demonstrate this, let’s consider the Palmer Station penguins data related to
three species of Antarctic penguins from Horst, Hill, and Gorman[HHG20]. We’ll
work with this dataset throughout this course. The data contains size
measurements for male and female adult foraging Adélie, Chinstrap, and Gentoo
penguins observed on islands in the Palmer Archipelago near Palmer Station,
Antarctica between 2007-2009. Data were collected and made available by
Dr Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER)
Program. You can read more about the package on the
palmerpenguins
documentation website.
To load the dataset, we just import the palmerpenguins
package and look at
the penguins
object.
%%R
library(palmerpenguins) # loads `penguins` object
penguins
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# ℹ Use `print(n = ...)` to see more rows
If we look closely, we see that the class of penguins
(i.e. the kind of object it is) is a tibble, indicated by the "tbl_df"
in the output below:
%%R
class(penguins)
[1] "tbl_df" "tbl" "data.frame"
Tidy data#
Tidy data is a convention that specifies how to arrange our data into a table. It states that
The columns of the table should correspond to the variables in the data.
The rows of the table should correspond to observations (or samples) of the variables in the data.
Packages in the Tidyverse are designed to work with tidy data.
You can read more about the tidy data format in Hadley Wickham’s Tidy Data paper[Wic14].
Tidy data: Example#
Suppose we have a pet dog and a pet cat and we want to study the average weight of them by year, to see if there is some kind of association between the species and the trend in weight variation over time. We could represent this in a ‘matrix-like’ format, with the years corresponding to rows and species corresponding to columns:
Year |
Dog_weight_kg |
Cat_weight_kg |
---|---|---|
2021 |
21.4 |
8.7 |
2022 |
20.9 |
8.4 |
2023 |
21.8 |
8.1 |
: Weights of pets by year (non-tidy format)
However, this is not in a tidy format. The variables we’re interested in studying are the year, animal species and weight, but here the columns correspond to the weights for each kind of species. Instead, we should have a separate column for the species:
Year |
Species |
Weight_kg |
---|---|---|
2021 |
dog |
21.4 |
2021 |
cat |
8.7 |
2022 |
dog |
20.9 |
2022 |
cat |
8.4 |
2023 |
dog |
21.8 |
2023 |
cat |
8.1 |
: Weights of pets by year (tidy format)
Summary quiz#
Acknowledgement#
The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.