Manipulating a Dataframe#
Download Rmd Version#
If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.
Exercises incomplete: Download Manipulating_a_dataframe.Rmd Exercises complete: Download Manipulating_a_dataframe_complete.Rmd
Learning Objectives#
Learn how to subset data by selecting specific columns and filtering rows based on conditions using the
dplyr
packageUnderstand how to create new variables and rename existing ones within a dataframe
Understand sorting data based on one or more variables
Learn how to summarise and aggregate data, including calculating summary statistics for the entire dataset and for grouped data
Develop techniques for identifying and filtering out missing values in a dataset
Introduction#
The package dplyr
from the Tidyverse contains functions for manipulating
dataframes. We will use the penguins
dataset contained in the palmerpenguins
package. Note that this data is already in tidy format.
%%R
library(palmerpenguins) # loads `penguins` data
head(penguins)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
Sub-setting data#
For taking subsets of data, we can use the functions select
and
filter
from dplyr
:
dplyr::select
is used to keep only specified columns of the dataframe, i.e. to ‘select’ certain variables from the data.dplyr::filter
is used to keep rows that meet some specified condition, i.e. to ‘filter’ the observations in the data.
Selecting columns#
To use select
, name the columns to keep after supplying the dataframe
(or tibble), like so:
dplyr::select(data, column1, column2, ...)
Note: We don’t use strings to specify the columns! Instead, we write them
as if they were variables e.g. dplyr::select(data, column1)
instead of
dplyr::select(data, "column1")
.
Let’s first check the column names in our tibble:
%%R
colnames(penguins)
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year"
Now let’s select the ‘species’, ‘year’, ‘sex’, and ‘body mass’ columns:
%%R
# Select species, year, sex, and body mass
penguins_selected <- dplyr::select(penguins, species, year, sex, body_mass_g)
head(penguins_selected)
# A tibble: 6 × 4
species year sex body_mass_g
<fct> <int> <fct> <int>
1 Adelie 2007 male 3750
2 Adelie 2007 female 3800
3 Adelie 2007 female 3250
4 Adelie 2007 <NA> NA
5 Adelie 2007 female 3450
6 Adelie 2007 male 3650
The select
function can also be used to remove columns by using the -
operator. For example, we may want to remove only two columns, ‘island’ and
‘year’:
%%R
# Remove the island and year columns by using `-`.
penguins_deselected <- dplyr::select(penguins, -island, -year)
# Take a look at the columns
colnames(penguins_deselected)
[1] "species" "bill_length_mm" "bill_depth_mm"
[4] "flipper_length_mm" "body_mass_g" "sex"
Filtering rows#
The filter
function from dplyr
is used to keep rows from a dataframe/tibble
that meet a predicate condition in terms of the column values, i.e. a
statement that is either TRUE
or FALSE
.
Comparitive operators#
One way to create predicate conditions is to use the comparative operators, which should be familiar:
x == y : equal
x != y : not equal
x > y : greater than
x < y : less than
x >= y : greater than or equal
x <= y : less than or equal
For example, suppose we want to keep all penguin data only from the year
2008. In base R, we could accomplish this by indexing on a boolean vector
created using the ==
operator:
%%R
# Create boolean vector indicating where year = 2008
is_year_2008 <- penguins$year == 2008
head(is_year_2008, n = 75)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE
%%R
# Index on this vector
head(penguins[is_year_2008,])
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 39.6 17.7 186 3500
2 Adelie Biscoe 40.1 18.9 188 4300
3 Adelie Biscoe 35 17.9 190 3450
4 Adelie Biscoe 42 19.5 200 4050
5 Adelie Biscoe 34.5 18.1 187 2900
6 Adelie Biscoe 41.4 18.6 191 3700
# ℹ 2 more variables: sex <fct>, year <int>
The filter
function works by specifying the predicate condition for filtering
in terms of the column names, like so:
dplyr::filter(data, condition) # keep rows satisfying 'condition'
So to filter the penguins data to keep only the 2008 observations:
%%R
penguins_2008 <- dplyr::filter(penguins, year == 2008)
head(penguins_2008)
# A tibble: 6 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Biscoe 39.6 17.7 186 3500
2 Adelie Biscoe 40.1 18.9 188 4300
3 Adelie Biscoe 35 17.9 190 3450
4 Adelie Biscoe 42 19.5 200 4050
5 Adelie Biscoe 34.5 18.1 187 2900
6 Adelie Biscoe 41.4 18.6 191 3700
# ℹ 2 more variables: sex <fct>, year <int>
Logical operators#
We can build more complicated predicate conditions by using the logical operators on columns:
A & B : A AND B e.g. col1 == 2 & col2 == 10
A | B : A OR B e.g. col1 > 2 | col2 != 10
!A : NOT A e.g. !(col1 < 2)
In addition, filter
allows us to specify multiple AND operations by listing
out multiple predicate conditions:
dplyr::filter(data, condition1, condition2, ...) # keep rows satisfying
# 'condition1' AND 'condition2' AND ...
Some examples on the penguin data:
%%R
# Observations of male penguins from the year 2008:
dplyr::filter(penguins, sex == "male" & year == 2008) # using &
dplyr::filter(penguins, sex == "male", year == 2008) # listing multiple conditions in filter
# Observations of penguins with bill length > 40mm or bill depth < 20mm
dplyr::filter(penguins, bill_length_mm > 40 | bill_depth_mm < 20)
# Observations except those of male penguins on the island of Biscoe
dplyr::filter(penguins, !(sex == "male" & island == "Biscoe"))
# A tibble: 256 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 246 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# ℹ Use `print(n = ...)` to see more rows
Filtering missing values#
There are other functions we can use to evaluate columns and get a true/false
output. An important one is is.na()
. This function evaluates a column and
reports back a TRUE
value when there is an NA
in that column’s row. For
example, we can use the head()
function to look at the top 6 values of the
sex
column, and see that there is an NA in the fourth row.
%%R
head(penguins$sex)
[1] male female female <NA> female male
Levels: female male
The is.na
function evaluates the whole column and gives us TRUE
s whenever it
sees an NA. Not surprisingly, we see a TRUE
in the fourth observation.
%%R
head(is.na(penguins$sex))
[1] FALSE FALSE FALSE TRUE FALSE FALSE
Since is.na()
gives us a TRUE
/FALSE
vector, we can use it with filter
.
The following gives us all rows where the sex column has missing data (i.e. is
set to NA
):
%%R
dplyr::filter(penguins, is.na(sex))
# A tibble: 11 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen NA NA NA NA
2 Adelie Torgersen 34.1 18.1 193 3475
3 Adelie Torgersen 42 20.2 190 4250
4 Adelie Torgersen 37.8 17.1 186 3300
5 Adelie Torgersen 37.8 17.3 180 3700
6 Adelie Dream 37.5 18.9 179 2975
7 Gentoo Biscoe 44.5 14.3 216 4100
8 Gentoo Biscoe 46.2 14.4 214 4650
9 Gentoo Biscoe 47.3 13.8 216 4725
10 Gentoo Biscoe 44.5 15.7 217 4875
11 Gentoo Biscoe NA NA NA NA
# ℹ 2 more variables: sex <fct>, year <int>
To do the reverse, i.e. keep only the observations where the sex value is not
missing, combine is.na
with the NOT operator:
%%R
dplyr::filter(penguins, !is.na(sex))
# A tibble: 333 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen 36.7 19.3 193 3450
5 Adelie Torgersen 39.3 20.6 190 3650
6 Adelie Torgersen 38.9 17.8 181 3625
7 Adelie Torgersen 39.2 19.6 195 4675
8 Adelie Torgersen 41.1 17.6 182 3200
9 Adelie Torgersen 38.6 21.2 191 3800
10 Adelie Torgersen 34.6 21.1 198 4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# ℹ Use `print(n = ...)` to see more rows
Exercise: subsetting data#
Write code to extract data for Gentoo penguins with weight between 4.8kg and 5.2kg (inclusive). Only keep the species, sex and body mass columns in the output. Give the numbers of resulting observations for male penguins, female penguins and penguins with unknown sex.
Solution
%%R
penguins_subset <- dplyr::filter(penguins,
species == "Gentoo",
4800 <= body_mass_g & body_mass_g <= 5200)
penguins_subset <- dplyr::select(penguins_subset, species, sex, body_mass_g)
penguins_subset
cat("Number of male penguins:", nrow(dplyr::filter(penguins_subset, sex == "male")), "\n")
cat("Number of female penguins:", nrow(dplyr::filter(penguins_subset, sex == "female")), "\n")
cat("Number of penguins of unknown sex:", nrow(dplyr::filter(penguins_subset, is.na(sex))), "\n")
Creating new variables and renaming variables#
We often want to make a new column with some updated or transformed value. We
can use the mutate
function in dplyr
for this:
# value1, value2,... are expressions, possibly involving column names
dplyr::mutate(data, new_column1 = value1, new_column2 = value2, ...)
For example, to add a column of IDs for identifying each row in the data:
%%R
# Add column of ID numbers
penguins_with_ids <- dplyr::mutate(penguins, id = 1:nrow(penguins))
# Optional: Put the ID column at the beginning of the dataframe
penguins_with_ids <- dplyr::relocate(penguins_with_ids, id, .before = 1)
head(penguins_with_ids)
# A tibble: 6 × 9
id species island bill_length_mm bill_depth_mm flipper_length_mm
<int> <fct> <fct> <dbl> <dbl> <int>
1 1 Adelie Torgersen 39.1 18.7 181
2 2 Adelie Torgersen 39.5 17.4 186
3 3 Adelie Torgersen 40.3 18 195
4 4 Adelie Torgersen NA NA NA
5 5 Adelie Torgersen 36.7 19.3 193
6 6 Adelie Torgersen 39.3 20.6 190
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
Another example: if we wanted to calculate a new value, the ratio of bill length to bill depth, we could do the following
%%R
dplyr::mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_ratio <dbl>
# ℹ Use `print(n = ...)` to see more rows
Note: the output of mutate is not just a new column on its own,
but the whole dataframe with the new column appended. To update penguins
to
have the new column, we need to overwrite it:
%%R
penguins <- dplyr::mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)
colnames(penguins)
[1] "species" "island" "bill_length_mm"
[4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
[7] "sex" "year" "bill_ratio"
When we just want to change the name of a column, we can use the rename
function in dplyr
:
dplyr::rename(data, New_column_name1 = Old_column_name1, New_column_name2 = Old_column_name2, ...)
For example, we could change the name of some variables to remove the units (probably not recommended in general, but it serves as an example):
%%R
penguins_renamed <- dplyr::rename(penguins,
bill_length = bill_length_mm,
bill_depth = bill_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
)
colnames(penguins_renamed)
[1] "species" "island" "bill_length" "bill_depth"
[5] "flipper_length" "body_mass" "sex" "year"
[9] "bill_ratio"
Exercise: creating variables#
Update the penguins
dataframe so that it contains an ID column
with IDs represented as strings.
Solution
%%R
# Solution 1
penguins <- dplyr::mutate(penguins, id = as.character(1:nrow(penguins)))
# Solution 2
penguins <- dplyr::mutate(penguins, id = 1:nrow(penguins)) # int IDs
penguins <- dplyr::mutate(penguins, id = as.character(id)) # update column to strings
Sorting observations#
We can sort the rows of data by the value of a particular column using the
arrange
function in dplyr
. By default, the sorting is performed in increasing
order, although we can use the desc()
function on the column to sort in
descending order:
dplyr::arrange(data, col) # sort rows by ascending order of values from column
# `col`
dplyr::arrange(data, dplyr::desc(col)) # sort rows by descending order of
# values from column `col`
For example, to sort the penguins
data by descending value of flipper length:
%%R
penguins_sorted_flipper <- dplyr::arrange(penguins, dplyr::desc(flipper_length_mm))
penguins_sorted_flipper
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Gentoo Biscoe 54.3 15.7 231 5650
2 Gentoo Biscoe 50 16.3 230 5700
3 Gentoo Biscoe 59.6 17 230 6050
4 Gentoo Biscoe 49.8 16.8 230 5700
5 Gentoo Biscoe 48.6 16 230 5800
6 Gentoo Biscoe 52.1 17 230 5550
7 Gentoo Biscoe 51.5 16.3 230 5500
8 Gentoo Biscoe 55.1 16 230 5850
9 Gentoo Biscoe 49.5 16.2 229 5800
10 Gentoo Biscoe 49.8 15.9 229 5950
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_ratio <dbl>
# ℹ Use `print(n = ...)` to see more rows
Exercise: sorting#
Sort the penguins
data by the id column we created in the previous exercise.
Explain the resulting order on the id column.
Solution
%%R
penguins_sorted_id <- dplyr::arrange(penguins, id)
head(dplyr::select(penguins_sorted_id, id, species, year, bill_length_mm))
Summarising / aggregating data#
Often we want to aggregate data at certain levels to better understand differences across groups. For instance, does flipper length differ by species? Does body mass change between years?
Summarising all the data#
First, we can use summarise
on its own, without any grouping, to get a single
summary from all the rows in the dataframe. Here are some examples using the
flipper length variable:
%%R
# Count number of observations
dplyr::summarise(penguins, num_rows = dplyr::n())
# Count number of different years in the data
dplyr::summarise(penguins, num_years = dplyr::n_distinct(year))
# Get the min and max year
dplyr::summarise(penguins, min_year = min(year), max_year = max(year))
# calculate the sum of flipper lengths for the whole data frame
dplyr::summarise(penguins, sum_flipper_length_mm = sum(flipper_length_mm))
# A tibble: 1 × 1
sum_flipper_length_mm
<int>
1 NA
If the result is NA
, it’s likely because the column contained missing values.
We include the na.rm = TRUE
optional argument to tell summarise
to remove
NA
s before calculating:
%%R
# With missing values removed
dplyr::summarise(penguins, sum_flipper_length_mm = sum(flipper_length_mm, na.rm = TRUE))
# A tibble: 1 × 1
sum_flipper_length_mm
<int>
1 68713
Summarising grouped data#
More generally, we can compute summaries for subgroups of the data by combining
the group_by
and summarise
(or summarize
) functions in dplyr
. The steps
to do this are:
First use
group_by
to declare the column (or columns) that you want to group the data by.Then use
summarise
to actually do the summarising on each of the subgroups from step 1. The resulting dataframe will have one summary per subgroup.
For example, to get the average flipper length of each sex:
%%R
# First group the data by sex
penguins_grouped_by_sex <- dplyr::group_by(penguins, sex)
# Then find the mean flipper length for each sex
dplyr::summarise(penguins_grouped_by_sex, mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
sex mean_flipper_length_mm
<fct> <dbl>
1 female 197.
2 male 205.
3 <NA> 199
These functions are powerful. We can group by multiple columns at once to create pairwise groups. As we saw above, we can also create several summary variables at the same time:
dplyr::group_by(data, col1, col2, ...) # group by combinations of values in
# col1, col2, ...
# Compute multiple summaries on the same data
dplyr::summarise(grouped_data, summary1 = <...>, summary2 = <...>, ...)
Exercise: grouping and summarising#
Create a summary dataframe that gives the mean, range (i.e. max - min
)
and standard deviation of the body mass for each species / sex combination, as
well as the number of observations in each group.
Solution
%%R
penguins_grouped <- dplyr::group_by(penguins, species, sex)
dplyr::summarise(penguins_grouped,
mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
sd_body_mass_g = sd(body_mass_g, na.rm = TRUE),
range_body_mass_g = max(body_mass_g, na.rm = TRUE) - min(body_mass_g, na.rm = TRUE),
Group_count = dplyr::n())
An unimportant detail: grouped dataframes#
dplyr::group_by
returns a ‘grouped dataframe’ (dplyr::grouped_df
). These
behave just like tibbles/dataframes except they have some extra information
about the grouping attached to them.
When you group by multiple columns, the result of summarise
will typically
also be a grouped dataframe. Usually, the groups are given by all columns except
the last one in the grouped dataframe supplied to summarise
. By default, you
get a console message telling you on which columns the output is grouped by (see
the result of the exercise above).
The upshot of this is that it allows us to do successive summaries without having to keep regrouping the data. For example, the code below will produce a ‘max of averages’:
%%R
# Group by sex and species
penguins_grouped <- dplyr::group_by(penguins, sex, species)
# Average mass for each species / sex combination
# Result is grouped by sex (see console output)
penguins_grouped_mass <- dplyr::summarise(penguins_grouped,
mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
print(penguins_grouped_mass)
# Max of the average mass over species, for each sex
penguins_max_species_avg_mass <- dplyr::summarise(
penguins_grouped_mass,
max_species_avg_body_mass_g = max(mean_body_mass_g, na.rm = TRUE)
)
print(penguins_max_species_avg_mass)
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 8 × 3
# Groups: sex [3]
sex species mean_body_mass_g
<fct> <fct> <dbl>
1 female Adelie 3369.
2 female Chinstrap 3527.
3 female Gentoo 4680.
4 male Adelie 4043.
5 male Chinstrap 3939.
6 male Gentoo 5485.
7 <NA> Adelie 3540
8 <NA> Gentoo 4588.
# A tibble: 3 × 2
sex max_species_avg_body_mass_g
<fct> <dbl>
1 female 4680.
2 male 5485.
3 <NA> 4588.
In practice, you don’t have to worry about this detail. If you want to get back
to a plain old tibble/dataframe, use the dplyr::ungroup
function on a grouped
dataframe.
Acknowledgement#
The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.