Piping and Functional Style

Piping and Functional Style#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Learning Objectives#

Understand the functional philosophy of the Tidyverse and its importance in data analysis
Identify the role of different Tidyverse packages and functions in data manipulation
Use the piping operator to streamline and simplify data manipulation workflows
Implement chaining of multiple functions using the pipe operator to enhance code readability and efficiency
Apply the Tidyverse function to perform common data manipulation tasks, such as selecting, filtering, mutating and summarizing data

The functional philosophy of Tidyverse#

The Tidyverse packages are built around a philosophy of having functions that work well together. Let’s look at some of the functions we’ve seen so far:

Package	Function	Data input type	Output type
`tidyr`	`read_csv`	string	dataframe
`dplyr`	`filter`	dataframe	dataframe
`dplyr`	`select`	dataframe	dataframe
`dplyr`	`mutate`	dataframe	dataframe
`dplyr`	`relocate`	dataframe	dataframe
`dplyr`	`arrange`	dataframe	dataframe
`dplyr`	`summarise`	dataframe / grouped dataframe	dataframe / grouped dataframe
`dplyr`	`group_by`	dataframe	grouped dataframe

Notice how consistent these are in accepting a dataframe as input and returning a dataframe (or something like a dataframe) as output. It is this consistency that makes it easy to compose functions.

In addition, it’s important to note that none of the functions we’ve seen so far modify the input dataframe ‘in place’, they always return a new dataframe. (This is even the case for the mutate function, despite the name!) This encourages a ‘pipeline’ approach, where we successively take the output of one function and feed it in as the input to another function.

Piping operator#

Since version 4.1, R has had a pipe operator |> which facilitates chaining function calls.

Note: Before a pipe operator was introduced into the base R language, people used another pipe operator %>% from the magrittr package. It is still very common to see people using the magrittr pipe and it behaves in a similar way to base R’s |> (though there are some minor differences). You can use whichever you like, though for this course we’ll use the base R pipe.

The pipe operator takes the value in the left hand side and ‘feeds it into’ the first argument of a function call on the right hand side. So, using dplyr::select as an example, the following two expressions are equivalent ways to select two columns colA and colB from a dataframe data:

dplyr::select(data, colA, colB)     data |> dplyr::select(colA, colB)

Notice how it looks like the dataframe argument to dplyr::select is missing in the second expression. The pipe takes care of putting data into the first argument slot, and the remaining slots are available for all other arguments (in this case, the column names colA and colB).

We can now chain multiple pipe calls together, like so:

data |> dplyr::select(colA, colB) |> dplyr::filter(colA > 10)

In words, the above expression

takes a variable data, then
selects columns colA and colB, then
takes the result of 2 and filters it to keep rows where colA > 10,

Usually, people will put successive pipe calls on new lines to make it easier to read, like so:

data |>
  dplyr::select(colA, colB) |>
  dplyr::filter(colA > 10)

If we want to store the overall result in a new variable data_transformed, we can do so by putting the assignment data_transformed <- at the beginning of the chain, like this:

data_transformed <- data |>
  dplyr::select(colA, colB) |>
  dplyr::filter(colA > 10)

Exercise: piping#

The following code calculates the average bill ratio for penguins observed in the year 2009.

library(palmerpenguins)  # loads `penguins` data

# Select species, year, bill_length_mm, bill_depth_mm
penguins1 <- dplyr::select(penguins, year, species, bill_length_mm, bill_depth_mm)

# Filter to keep year 2009 data only
penguins2 <- dplyr::filter(penguins1, year == 2009)

# Compute new bill_ratio column
penguins3 <- dplyr::mutate(penguins2, bill_ratio = bill_length_mm / bill_depth_mm)

# Compute average bill ratio
penguins4 <- dplyr::summarise(penguins3, mean_bill_ratio = mean(bill_ratio, na.rm = TRUE))

print(penguins4)

# A tibble: 1 × 1
  mean_bill_ratio
            <dbl>
1            2.63

Rewrite this code to use the pipe operator instead of storing each intermediate dataframe, storing the overall result in a new variable called summarised.

Solution

summarised <- penguins |>
  dplyr::select(year, species, bill_length_mm, bill_depth_mm) |>
  dplyr::filter(year == 2009) |>
  dplyr::mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
  dplyr::summarise(mean_bill_ratio = mean(bill_ratio, na.rm = TRUE))

print(summarised)

Summary Quiz#

What is the main benefit of using the pipe operator in R?

How does the pipe operator `%>%` from the `magrittr` package behave compared to the base R pipe operator `|>`?

What is the purpose of the `mutate()` function in `dplyr`?