Manipulating A Dataframe#

Learning Objectives#

  • Learn how to subset data by selecting specific columns and filtering rows based on conditions using the dplyr package

  • Understand how to create new variables and rename existing ones within a dataframe

  • Understand sorting data based on one or more variables

  • Learn how to summarise and aggregate data, including calculating summary statistics for the entire dataset and for grouped data

  • Develop techniques for identifying and filtering out missing values in a dataset

Introduction#

The package dplyr from the Tidyverse contains functions for manipulating dataframes. We will use the penguins dataset contained in the palmerpenguins package. Note that this data is already in tidy format.

%%R
library(palmerpenguins)  # loads `penguins` data
head(penguins)
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Sub-setting data#

For taking subsets of data, we can use the functions select and filter from dplyr:

  • dplyr::select is used to keep only specified columns of the dataframe, i.e. to ‘select’ certain variables from the data.

  • dplyr::filter is used to keep rows that meet some specified condition, i.e. to ‘filter’ the observations in the data.

Selecting columns#

To use select, name the columns to keep after supplying the dataframe (or tibble), like so:

dplyr::select(data, column1, column2, ...)

Note: We don’t use strings to specify the columns! Instead, we write them as if they were variables e.g. dplyr::select(data, column1) instead of dplyr::select(data, "column1").

Let’s first check the column names in our tibble:

%%R
colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"             

Now let’s select the ‘species’, ‘year’, ‘sex’, and ‘body mass’ columns:

%%R
# Select species, year, sex, and body mass
penguins_selected <- dplyr::select(penguins, species, year, sex, body_mass_g)

head(penguins_selected)
# A tibble: 6 × 4
  species  year sex    body_mass_g
  <fct>   <int> <fct>        <int>
1 Adelie   2007 male          3750
2 Adelie   2007 female        3800
3 Adelie   2007 female        3250
4 Adelie   2007 <NA>            NA
5 Adelie   2007 female        3450
6 Adelie   2007 male          3650

The select function can also be used to remove columns by using the - operator. For example, we may want to remove only two columns, ‘island’ and ‘year’:

%%R
# Remove the island and year columns by using `-`.
penguins_deselected <- dplyr::select(penguins, -island, -year)

# Take a look at the columns
colnames(penguins_deselected)
[1] "species"           "bill_length_mm"    "bill_depth_mm"    
[4] "flipper_length_mm" "body_mass_g"       "sex"              

Filtering rows#

The filter function from dplyr is used to keep rows from a dataframe/tibble that meet a predicate condition in terms of the column values, i.e. a statement that is either TRUE or FALSE.

Comparitive operators#

One way to create predicate conditions is to use the comparative operators, which should be familiar:

x == y   : equal
x != y   : not equal
x > y    : greater than
x < y    : less than
x >= y   : greater than or equal
x <= y   : less than or equal

For example, suppose we want to keep all penguin data only from the year 2008. In base R, we could accomplish this by indexing on a boolean vector created using the == operator:

%%R
# Create boolean vector indicating where year = 2008
is_year_2008 <- penguins$year == 2008
head(is_year_2008, n = 75)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[73]  TRUE  TRUE  TRUE
%%R
# Index on this vector
head(penguins[is_year_2008,])
# A tibble: 6 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Adelie  Biscoe           39.6          17.7               186        3500
2 Adelie  Biscoe           40.1          18.9               188        4300
3 Adelie  Biscoe           35            17.9               190        3450
4 Adelie  Biscoe           42            19.5               200        4050
5 Adelie  Biscoe           34.5          18.1               187        2900
6 Adelie  Biscoe           41.4          18.6               191        3700
# ℹ 2 more variables: sex <fct>, year <int>

The filter function works by specifying the predicate condition for filtering in terms of the column names, like so:

dplyr::filter(data, condition)  # keep rows satisfying 'condition'

So to filter the penguins data to keep only the 2008 observations:

%%R
penguins_2008 <- dplyr::filter(penguins, year == 2008)
head(penguins_2008)
# A tibble: 6 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Adelie  Biscoe           39.6          17.7               186        3500
2 Adelie  Biscoe           40.1          18.9               188        4300
3 Adelie  Biscoe           35            17.9               190        3450
4 Adelie  Biscoe           42            19.5               200        4050
5 Adelie  Biscoe           34.5          18.1               187        2900
6 Adelie  Biscoe           41.4          18.6               191        3700
# ℹ 2 more variables: sex <fct>, year <int>

Logical operators#

We can build more complicated predicate conditions by using the logical operators on columns:

A & B   : A AND B   e.g. col1 == 2 & col2 == 10
A | B   : A OR B    e.g. col1 > 2 | col2 != 10
!A      : NOT A     e.g. !(col1 < 2)

In addition, filter allows us to specify multiple AND operations by listing out multiple predicate conditions:

dplyr::filter(data, condition1, condition2, ...)  # keep rows satisfying
                                                  # 'condition1' AND 'condition2' AND ...

Some examples on the penguin data:

%%R
# Observations of male penguins from the year 2008:
dplyr::filter(penguins, sex == "male" & year == 2008)  # using &
dplyr::filter(penguins, sex == "male", year == 2008)  # listing multiple conditions in filter

# Observations of penguins with bill length > 40mm or bill depth < 20mm 
dplyr::filter(penguins, bill_length_mm > 40 | bill_depth_mm < 20)

# Observations except those of male penguins on the island of Biscoe
dplyr::filter(penguins, !(sex == "male" & island == "Biscoe"))
# A tibble: 256 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 246 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# ℹ Use `print(n = ...)` to see more rows

Filtering missing values#

There are other functions we can use to evaluate columns and get a true/false output. An important one is is.na(). This function evaluates a column and reports back a TRUE value when there is an NA in that column’s row. For example, we can use the head() function to look at the top 6 values of the sex column, and see that there is an NA in the fourth row.

%%R
head(penguins$sex)
[1] male   female female <NA>   female male  
Levels: female male

The is.na function evaluates the whole column and gives us TRUEs whenever it sees an NA. Not surprisingly, we see a TRUE in the fourth observation.

%%R
head(is.na(penguins$sex))
[1] FALSE FALSE FALSE  TRUE FALSE FALSE

Since is.na() gives us a TRUE/FALSE vector, we can use it with filter. The following gives us all rows where the sex column has missing data (i.e. is set to NA):

%%R
dplyr::filter(penguins, is.na(sex))
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           NA            NA                  NA          NA
 2 Adelie  Torgersen           34.1          18.1               193        3475
 3 Adelie  Torgersen           42            20.2               190        4250
 4 Adelie  Torgersen           37.8          17.1               186        3300
 5 Adelie  Torgersen           37.8          17.3               180        3700
 6 Adelie  Dream               37.5          18.9               179        2975
 7 Gentoo  Biscoe              44.5          14.3               216        4100
 8 Gentoo  Biscoe              46.2          14.4               214        4650
 9 Gentoo  Biscoe              47.3          13.8               216        4725
10 Gentoo  Biscoe              44.5          15.7               217        4875
11 Gentoo  Biscoe              NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

To do the reverse, i.e. keep only the observations where the sex value is not missing, combine is.na with the NOT operator:

%%R
dplyr::filter(penguins, !is.na(sex))
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>
# ℹ Use `print(n = ...)` to see more rows

Exercise: subsetting data#

Write code to extract data for Gentoo penguins with weight between 4.8kg and 5.2kg (inclusive). Only keep the species, sex and body mass columns in the output. Give the numbers of resulting observations for male penguins, female penguins and penguins with unknown sex.

Creating new variables and renaming variables#

We often want to make a new column with some updated or transformed value. We can use the mutate function in dplyr for this:

# value1, value2,... are expressions, possibly involving column names

dplyr::mutate(data, new_column1 = value1, new_column2 = value2, ...)

For example, to add a column of IDs for identifying each row in the data:

%%R
# Add column of ID numbers
penguins_with_ids <- dplyr::mutate(penguins, id = 1:nrow(penguins))

# Optional: Put the ID column at the beginning of the dataframe
penguins_with_ids <- dplyr::relocate(penguins_with_ids, id, .before = 1)

head(penguins_with_ids)
# A tibble: 6 × 9
     id species island    bill_length_mm bill_depth_mm flipper_length_mm
  <int> <fct>   <fct>              <dbl>         <dbl>             <int>
1     1 Adelie  Torgersen           39.1          18.7               181
2     2 Adelie  Torgersen           39.5          17.4               186
3     3 Adelie  Torgersen           40.3          18                 195
4     4 Adelie  Torgersen           NA            NA                  NA
5     5 Adelie  Torgersen           36.7          19.3               193
6     6 Adelie  Torgersen           39.3          20.6               190
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Another example: if we wanted to calculate a new value, the ratio of bill length to bill depth, we could do the following

%%R
dplyr::mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)
# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_ratio <dbl>
# ℹ Use `print(n = ...)` to see more rows

Note: the output of mutate is not just a new column on its own, but the whole dataframe with the new column appended. To update penguins to have the new column, we need to overwrite it:

%%R
penguins <- dplyr::mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)
colnames(penguins)
[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"              "bill_ratio"       

When we just want to change the name of a column, we can use the rename function in dplyr:

dplyr::rename(data, New_column_name1 = Old_column_name1, New_column_name2 = Old_column_name2, ...)

For example, we could change the name of some variables to remove the units (probably not recommended in general, but it serves as an example):

%%R
penguins_renamed <- dplyr::rename(penguins,
                                  bill_length = bill_length_mm,
                                  bill_depth = bill_depth_mm,
                                  flipper_length = flipper_length_mm,
                                  body_mass = body_mass_g
                                  )
colnames(penguins_renamed)
[1] "species"        "island"         "bill_length"    "bill_depth"    
[5] "flipper_length" "body_mass"      "sex"            "year"          
[9] "bill_ratio"    

Exercise: creating variables#

Update the penguins dataframe so that it contains an ID column with IDs represented as strings.

Sorting observations#

We can sort the rows of data by the value of a particular column using the arrange function in dplyr. By default, the sorting is performed in increasing order, although we can use the desc() function on the column to sort in descending order:

dplyr::arrange(data, col)  # sort rows by ascending order of values from column
                           # `col`

dplyr::arrange(data, dplyr::desc(col))  # sort rows by descending order of
                                        # values from column `col`

For example, to sort the penguins data by descending value of flipper length:

%%R
penguins_sorted_flipper <- dplyr::arrange(penguins, dplyr::desc(flipper_length_mm))
penguins_sorted_flipper
# A tibble: 344 × 9
   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
 1 Gentoo  Biscoe           54.3          15.7               231        5650
 2 Gentoo  Biscoe           50            16.3               230        5700
 3 Gentoo  Biscoe           59.6          17                 230        6050
 4 Gentoo  Biscoe           49.8          16.8               230        5700
 5 Gentoo  Biscoe           48.6          16                 230        5800
 6 Gentoo  Biscoe           52.1          17                 230        5550
 7 Gentoo  Biscoe           51.5          16.3               230        5500
 8 Gentoo  Biscoe           55.1          16                 230        5850
 9 Gentoo  Biscoe           49.5          16.2               229        5800
10 Gentoo  Biscoe           49.8          15.9               229        5950
# ℹ 334 more rows
# ℹ 3 more variables: sex <fct>, year <int>, bill_ratio <dbl>
# ℹ Use `print(n = ...)` to see more rows

Exercise: sorting#

Sort the penguins data by the id column we created in the previous exercise. Explain the resulting order on the id column.

Summarising / aggregating data#

Often we want to aggregate data at certain levels to better understand differences across groups. For instance, does flipper length differ by species? Does body mass change between years?

Summarising all the data#

First, we can use summarise on its own, without any grouping, to get a single summary from all the rows in the dataframe. Here are some examples using the flipper length variable:

%%R
# Count number of observations
dplyr::summarise(penguins, num_rows = dplyr::n())

# Count number of different years in the data
dplyr::summarise(penguins, num_years = dplyr::n_distinct(year))

# Get the min and max year
dplyr::summarise(penguins, min_year = min(year), max_year = max(year))

# calculate the sum of flipper lengths for the whole data frame
dplyr::summarise(penguins, sum_flipper_length_mm = sum(flipper_length_mm))
# A tibble: 1 × 1
  sum_flipper_length_mm
                  <int>
1                    NA

If the result is NA, it’s likely because the column contained missing values. We include the na.rm = TRUE optional argument to tell summarise to remove NAs before calculating:

%%R
# With missing values removed
dplyr::summarise(penguins, sum_flipper_length_mm = sum(flipper_length_mm, na.rm = TRUE))
# A tibble: 1 × 1
  sum_flipper_length_mm
                  <int>
1                 68713

Summarising grouped data#

More generally, we can compute summaries for subgroups of the data by combining the group_by and summarise (or summarize) functions in dplyr. The steps to do this are:

  1. First use group_by to declare the column (or columns) that you want to group the data by.

  2. Then use summarise to actually do the summarising on each of the subgroups from step 1. The resulting dataframe will have one summary per subgroup.

For example, to get the average flipper length of each sex:

%%R
# First group the data by sex
penguins_grouped_by_sex <- dplyr::group_by(penguins, sex)

# Then find the mean flipper length for each sex
dplyr::summarise(penguins_grouped_by_sex, mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE))
# A tibble: 3 × 2
  sex    mean_flipper_length_mm
  <fct>                   <dbl>
1 female                   197.
2 male                     205.
3 <NA>                     199 

These functions are powerful. We can group by multiple columns at once to create pairwise groups. As we saw above, we can also create several summary variables at the same time:

dplyr::group_by(data, col1, col2, ...)  # group by combinations of values in
                                        # col1, col2, ...

# Compute multiple summaries on the same data
dplyr::summarise(grouped_data, summary1 = <...>, summary2 = <...>, ...)

Exercise: grouping and summarising#

Create a summary dataframe that gives the mean, range (i.e. max - min) and standard deviation of the body mass for each species / sex combination, as well as the number of observations in each group.

An unimportant detail: grouped dataframes#

dplyr::group_by returns a ‘grouped dataframe’ (dplyr::grouped_df). These behave just like tibbles/dataframes except they have some extra information about the grouping attached to them.

When you group by multiple columns, the result of summarise will typically also be a grouped dataframe. Usually, the groups are given by all columns except the last one in the grouped dataframe supplied to summarise. By default, you get a console message telling you on which columns the output is grouped by (see the result of the exercise above).

The upshot of this is that it allows us to do successive summaries without having to keep regrouping the data. For example, the code below will produce a ‘max of averages’:

%%R
# Group by sex and species
penguins_grouped <- dplyr::group_by(penguins, sex, species)

# Average mass for each species / sex combination
# Result is grouped by sex (see console output)
penguins_grouped_mass <- dplyr::summarise(penguins_grouped,
                                          mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
print(penguins_grouped_mass)

# Max of the average mass over species, for each sex
penguins_max_species_avg_mass <- dplyr::summarise(
  penguins_grouped_mass,
  max_species_avg_body_mass_g = max(mean_body_mass_g, na.rm = TRUE)
  )
print(penguins_max_species_avg_mass)
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 8 × 3
# Groups:   sex [3]
  sex    species   mean_body_mass_g
  <fct>  <fct>                <dbl>
1 female Adelie               3369.
2 female Chinstrap            3527.
3 female Gentoo               4680.
4 male   Adelie               4043.
5 male   Chinstrap            3939.
6 male   Gentoo               5485.
7 <NA>   Adelie               3540 
8 <NA>   Gentoo               4588.
# A tibble: 3 × 2
  sex    max_species_avg_body_mass_g
  <fct>                        <dbl>
1 female                       4680.
2 male                         5485.
3 <NA>                         4588.

In practice, you don’t have to worry about this detail. If you want to get back to a plain old tibble/dataframe, use the dplyr::ungroup function on a grouped dataframe.

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.

Summary Quiz#