Manipulating a Dataframe#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Manipulating_a_dataframe.Rmd Exercises complete: Download Manipulating_a_dataframe_complete.Rmd

Learning Objectives#

  • Learn how to subset data by selecting specific columns and filtering rows based on conditions using the dplyr package

  • Understand how to create new variables and rename existing ones within a dataframe

  • Understand sorting data based on one or more variables

  • Learn how to summarise and aggregate data, including calculating summary statistics for the entire dataset and for grouped data

  • Develop techniques for identifying and filtering out missing values in a dataset

Introduction#

The package dplyr from the Tidyverse contains functions for manipulating dataframes. We will use the penguins dataset contained in the palmerpenguins package. Note that this data is already in tidy format.

library(palmerpenguins)  # loads `penguins` data
head(penguins)
A tibble: 6 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen NA NA NA NANA 2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007

Sub-setting data#

For taking subsets of data, we can use the functions select and filter from dplyr:

  • dplyr::select is used to keep only specified columns of the dataframe, i.e. to ‘select’ certain variables from the data.

  • dplyr::filter is used to keep rows that meet some specified condition, i.e. to ‘filter’ the observations in the data.

Selecting columns#

To use select, name the columns to keep after supplying the dataframe (or tibble), like so:

dplyr::select(data, column1, column2, ...)

Note: We don’t use strings to specify the columns! Instead, we write them as if they were variables e.g. dplyr::select(data, column1) instead of dplyr::select(data, "column1").

Let’s first check the column names in our tibble:

colnames(penguins)
  1. 'species'
  2. 'island'
  3. 'bill_length_mm'
  4. 'bill_depth_mm'
  5. 'flipper_length_mm'
  6. 'body_mass_g'
  7. 'sex'
  8. 'year'

Now let’s select the ‘species’, ‘year’, ‘sex’, and ‘body mass’ columns:

# Select species, year, sex, and body mass
penguins_selected <- dplyr::select(penguins, species, year, sex, body_mass_g)

head(penguins_selected)
A tibble: 6 × 4
speciesyearsexbody_mass_g
<fct><int><fct><int>
Adelie2007male 3750
Adelie2007female3800
Adelie2007female3250
Adelie2007NA NA
Adelie2007female3450
Adelie2007male 3650

The select function can also be used to remove columns by using the - operator. For example, we may want to remove only two columns, ‘island’ and ‘year’:

# Remove the island and year columns by using `-`.
penguins_deselected <- dplyr::select(penguins, -island, -year)

# Take a look at the columns
colnames(penguins_deselected)
  1. 'species'
  2. 'bill_length_mm'
  3. 'bill_depth_mm'
  4. 'flipper_length_mm'
  5. 'body_mass_g'
  6. 'sex'

Filtering rows#

The filter function from dplyr is used to keep rows from a dataframe/tibble that meet a predicate condition in terms of the column values, i.e. a statement that is either TRUE or FALSE.

Comparitive operators#

One way to create predicate conditions is to use the comparative operators, which should be familiar:

x == y   : equal
x != y   : not equal
x > y    : greater than
x < y    : less than
x >= y   : greater than or equal
x <= y   : less than or equal

For example, suppose we want to keep all penguin data only from the year 2008. In base R, we could accomplish this by indexing on a boolean vector created using the == operator:

# Create boolean vector indicating where year = 2008
is_year_2008 <- penguins$year == 2008
head(is_year_2008, n = 75)
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. FALSE
  6. FALSE
  7. FALSE
  8. FALSE
  9. FALSE
  10. FALSE
  11. FALSE
  12. FALSE
  13. FALSE
  14. FALSE
  15. FALSE
  16. FALSE
  17. FALSE
  18. FALSE
  19. FALSE
  20. FALSE
  21. FALSE
  22. FALSE
  23. FALSE
  24. FALSE
  25. FALSE
  26. FALSE
  27. FALSE
  28. FALSE
  29. FALSE
  30. FALSE
  31. FALSE
  32. FALSE
  33. FALSE
  34. FALSE
  35. FALSE
  36. FALSE
  37. FALSE
  38. FALSE
  39. FALSE
  40. FALSE
  41. FALSE
  42. FALSE
  43. FALSE
  44. FALSE
  45. FALSE
  46. FALSE
  47. FALSE
  48. FALSE
  49. FALSE
  50. FALSE
  51. TRUE
  52. TRUE
  53. TRUE
  54. TRUE
  55. TRUE
  56. TRUE
  57. TRUE
  58. TRUE
  59. TRUE
  60. TRUE
  61. TRUE
  62. TRUE
  63. TRUE
  64. TRUE
  65. TRUE
  66. TRUE
  67. TRUE
  68. TRUE
  69. TRUE
  70. TRUE
  71. TRUE
  72. TRUE
  73. TRUE
  74. TRUE
  75. TRUE
# Index on this vector
head(penguins[is_year_2008,])
A tibble: 6 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieBiscoe39.617.71863500female2008
AdelieBiscoe40.118.91884300male 2008
AdelieBiscoe35.017.91903450female2008
AdelieBiscoe42.019.52004050male 2008
AdelieBiscoe34.518.11872900female2008
AdelieBiscoe41.418.61913700male 2008

The filter function works by specifying the predicate condition for filtering in terms of the column names, like so:

dplyr::filter(data, condition)  # keep rows satisfying 'condition'

So to filter the penguins data to keep only the 2008 observations:

penguins_2008 <- dplyr::filter(penguins, year == 2008)
head(penguins_2008)
A tibble: 6 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieBiscoe39.617.71863500female2008
AdelieBiscoe40.118.91884300male 2008
AdelieBiscoe35.017.91903450female2008
AdelieBiscoe42.019.52004050male 2008
AdelieBiscoe34.518.11872900female2008
AdelieBiscoe41.418.61913700male 2008

Logical operators#

We can build more complicated predicate conditions by using the logical operators on columns:

A & B   : A AND B   e.g. col1 == 2 & col2 == 10
A | B   : A OR B    e.g. col1 > 2 | col2 != 10
!A      : NOT A     e.g. !(col1 < 2)

In addition, filter allows us to specify multiple AND operations by listing out multiple predicate conditions:

dplyr::filter(data, condition1, condition2, ...)  # keep rows satisfying
                                                  # 'condition1' AND 'condition2' AND ...

Some examples on the penguin data:

# Observations of male penguins from the year 2008:
dplyr::filter(penguins, sex == "male" & year == 2008)  # using &
dplyr::filter(penguins, sex == "male", year == 2008)  # listing multiple conditions in filter

# Observations of penguins with bill length > 40mm or bill depth < 20mm 
dplyr::filter(penguins, bill_length_mm > 40 | bill_depth_mm < 20)

# Observations except those of male penguins on the island of Biscoe
dplyr::filter(penguins, !(sex == "male" & island == "Biscoe"))
A tibble: 57 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
Adelie Biscoe 40.118.91884300male2008
Adelie Biscoe 42.019.52004050male2008
Adelie Biscoe 41.418.61913700male2008
Adelie Biscoe 40.618.81933800male2008
Adelie Biscoe 37.619.11943750male2008
Adelie Biscoe 41.321.11954400male2008
Adelie Biscoe 41.118.21924050male2008
Adelie Biscoe 41.618.01923950male2008
Adelie Biscoe 41.119.11884100male2008
Adelie Torgersen41.819.41984450male2008
Adelie Torgersen39.718.41903900male2008
Adelie Torgersen45.818.91974150male2008
Adelie Torgersen42.818.51954250male2008
Adelie Torgersen37.219.41843900male2008
Adelie Torgersen42.119.11954000male2008
Adelie Torgersen42.917.61964700male2008
Adelie Torgersen35.119.41934200male2008
Adelie Dream 41.320.31943550male2008
Adelie Dream 36.319.51903800male2008
Adelie Dream 38.319.21893950male2008
Adelie Dream 41.118.12054300male2008
Adelie Dream 39.618.11864450male2008
Adelie Dream 40.818.92084300male2008
Adelie Dream 40.318.51964350male2008
Adelie Dream 43.218.51924100male2008
Gentoo Biscoe 48.416.32205400male2008
Gentoo Biscoe 44.417.32195250male2008
Gentoo Biscoe 48.715.72085350male2008
Gentoo Biscoe 49.616.02255700male2008
Gentoo Biscoe 49.615.02164750male2008
Gentoo Biscoe 50.515.92225550male2008
Gentoo Biscoe 50.515.92255400male2008
Gentoo Biscoe 45.215.82155300male2008
Gentoo Biscoe 48.514.12205300male2008
Gentoo Biscoe 50.115.02255000male2008
Gentoo Biscoe 45.015.42205050male2008
Gentoo Biscoe 45.515.02205000male2008
Gentoo Biscoe 50.415.32245550male2008
Gentoo Biscoe 46.214.92215300male2008
Gentoo Biscoe 54.315.72315650male2008
Gentoo Biscoe 49.816.82305700male2008
Gentoo Biscoe 49.516.22295800male2008
Gentoo Biscoe 50.715.02235550male2008
Gentoo Biscoe 46.415.62215000male2008
Gentoo Biscoe 48.215.62215100male2008
Gentoo Biscoe 48.616.02305800male2008
Gentoo Biscoe 51.116.32206000male2008
Gentoo Biscoe 45.216.42235950male2008
ChinstrapDream 49.519.02003800male2008
ChinstrapDream 52.820.02054550male2008
ChinstrapDream 54.220.82014300male2008
ChinstrapDream 51.018.82034100male2008
ChinstrapDream 49.718.61953600male2008
ChinstrapDream 52.020.72104800male2008
ChinstrapDream 53.519.92054500male2008
ChinstrapDream 49.019.52103950male2008
ChinstrapDream 50.919.11963550male2008
A tibble: 57 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
Adelie Biscoe 40.118.91884300male2008
Adelie Biscoe 42.019.52004050male2008
Adelie Biscoe 41.418.61913700male2008
Adelie Biscoe 40.618.81933800male2008
Adelie Biscoe 37.619.11943750male2008
Adelie Biscoe 41.321.11954400male2008
Adelie Biscoe 41.118.21924050male2008
Adelie Biscoe 41.618.01923950male2008
Adelie Biscoe 41.119.11884100male2008
Adelie Torgersen41.819.41984450male2008
Adelie Torgersen39.718.41903900male2008
Adelie Torgersen45.818.91974150male2008
Adelie Torgersen42.818.51954250male2008
Adelie Torgersen37.219.41843900male2008
Adelie Torgersen42.119.11954000male2008
Adelie Torgersen42.917.61964700male2008
Adelie Torgersen35.119.41934200male2008
Adelie Dream 41.320.31943550male2008
Adelie Dream 36.319.51903800male2008
Adelie Dream 38.319.21893950male2008
Adelie Dream 41.118.12054300male2008
Adelie Dream 39.618.11864450male2008
Adelie Dream 40.818.92084300male2008
Adelie Dream 40.318.51964350male2008
Adelie Dream 43.218.51924100male2008
Gentoo Biscoe 48.416.32205400male2008
Gentoo Biscoe 44.417.32195250male2008
Gentoo Biscoe 48.715.72085350male2008
Gentoo Biscoe 49.616.02255700male2008
Gentoo Biscoe 49.615.02164750male2008
Gentoo Biscoe 50.515.92225550male2008
Gentoo Biscoe 50.515.92255400male2008
Gentoo Biscoe 45.215.82155300male2008
Gentoo Biscoe 48.514.12205300male2008
Gentoo Biscoe 50.115.02255000male2008
Gentoo Biscoe 45.015.42205050male2008
Gentoo Biscoe 45.515.02205000male2008
Gentoo Biscoe 50.415.32245550male2008
Gentoo Biscoe 46.214.92215300male2008
Gentoo Biscoe 54.315.72315650male2008
Gentoo Biscoe 49.816.82305700male2008
Gentoo Biscoe 49.516.22295800male2008
Gentoo Biscoe 50.715.02235550male2008
Gentoo Biscoe 46.415.62215000male2008
Gentoo Biscoe 48.215.62215100male2008
Gentoo Biscoe 48.616.02305800male2008
Gentoo Biscoe 51.116.32206000male2008
Gentoo Biscoe 45.216.42235950male2008
ChinstrapDream 49.519.02003800male2008
ChinstrapDream 52.820.02054550male2008
ChinstrapDream 54.220.82014300male2008
ChinstrapDream 51.018.82034100male2008
ChinstrapDream 49.718.61953600male2008
ChinstrapDream 52.020.72104800male2008
ChinstrapDream 53.519.92054500male2008
ChinstrapDream 49.019.52103950male2008
ChinstrapDream 50.919.11963550male2008
A tibble: 333 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male 2007
AdelieTorgersen34.118.11933475NA 2007
AdelieTorgersen42.020.21904250NA 2007
AdelieTorgersen37.817.11863300NA 2007
AdelieTorgersen37.817.31803700NA 2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen36.617.81853700female2007
AdelieTorgersen38.719.01953450female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen34.418.41843325female2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 37.818.31743400female2007
AdelieBiscoe 37.718.71803600male 2007
AdelieBiscoe 35.919.21893800female2007
AdelieBiscoe 38.218.11853950male 2007
AdelieBiscoe 38.817.21803800male 2007
AdelieBiscoe 35.318.91873800female2007
AdelieBiscoe 40.618.61833550male 2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 37.918.61723150female2007
AdelieBiscoe 40.518.91803950male 2007
AdelieDream 39.516.71783250female2007
AdelieDream 37.218.11783900male 2007
AdelieDream 39.517.81883300female2007
AdelieDream 40.918.91843900male 2007
ChinstrapDream46.916.61922700female2008
ChinstrapDream53.519.92054500male 2008
ChinstrapDream49.019.52103950male 2008
ChinstrapDream46.217.51873650female2008
ChinstrapDream50.919.11963550male 2008
ChinstrapDream45.517.01963500female2008
ChinstrapDream50.917.91963675female2009
ChinstrapDream50.818.52014450male 2009
ChinstrapDream50.117.91903400female2009
ChinstrapDream49.019.62124300male 2009
ChinstrapDream51.518.71873250male 2009
ChinstrapDream49.817.31983675female2009
ChinstrapDream48.116.41993325female2009
ChinstrapDream51.419.02013950male 2009
ChinstrapDream45.717.31933600female2009
ChinstrapDream50.719.72034050male 2009
ChinstrapDream42.517.31873350female2009
ChinstrapDream52.218.81973450male 2009
ChinstrapDream45.216.61913250female2009
ChinstrapDream49.319.92034050male 2009
ChinstrapDream50.218.82023800male 2009
ChinstrapDream45.619.41943525female2009
ChinstrapDream51.919.52063950male 2009
ChinstrapDream46.816.51893650female2009
ChinstrapDream45.717.01953650female2009
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009
A tibble: 256 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen NA NA NA NANA 2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male 2007
AdelieTorgersen34.118.11933475NA 2007
AdelieTorgersen42.020.21904250NA 2007
AdelieTorgersen37.817.11863300NA 2007
AdelieTorgersen37.817.31803700NA 2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen38.621.21913800male 2007
AdelieTorgersen34.621.11984400male 2007
AdelieTorgersen36.617.81853700female2007
AdelieTorgersen38.719.01953450female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen34.418.41843325female2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 37.818.31743400female2007
AdelieBiscoe 35.919.21893800female2007
AdelieBiscoe 35.318.91873800female2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 37.918.61723150female2007
AdelieDream 39.516.71783250female2007
AdelieDream 37.218.11783900male 2007
AdelieDream 39.517.81883300female2007
AdelieDream 40.918.91843900male 2007
AdelieDream 36.417.01953325female2007
ChinstrapDream46.916.61922700female2008
ChinstrapDream53.519.92054500male 2008
ChinstrapDream49.019.52103950male 2008
ChinstrapDream46.217.51873650female2008
ChinstrapDream50.919.11963550male 2008
ChinstrapDream45.517.01963500female2008
ChinstrapDream50.917.91963675female2009
ChinstrapDream50.818.52014450male 2009
ChinstrapDream50.117.91903400female2009
ChinstrapDream49.019.62124300male 2009
ChinstrapDream51.518.71873250male 2009
ChinstrapDream49.817.31983675female2009
ChinstrapDream48.116.41993325female2009
ChinstrapDream51.419.02013950male 2009
ChinstrapDream45.717.31933600female2009
ChinstrapDream50.719.72034050male 2009
ChinstrapDream42.517.31873350female2009
ChinstrapDream52.218.81973450male 2009
ChinstrapDream45.216.61913250female2009
ChinstrapDream49.319.92034050male 2009
ChinstrapDream50.218.82023800male 2009
ChinstrapDream45.619.41943525female2009
ChinstrapDream51.919.52063950male 2009
ChinstrapDream46.816.51893650female2009
ChinstrapDream45.717.01953650female2009
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009

Filtering missing values#

There are other functions we can use to evaluate columns and get a true/false output. An important one is is.na(). This function evaluates a column and reports back a TRUE value when there is an NA in that column’s row. For example, we can use the head() function to look at the top 6 values of the sex column, and see that there is an NA in the fourth row.

head(penguins$sex)
  1. male
  2. female
  3. female
  4. <NA>
  5. female
  6. male
Levels:
  1. 'female'
  2. 'male'

The is.na function evaluates the whole column and gives us TRUEs whenever it sees an NA. Not surprisingly, we see a TRUE in the fourth observation.

head(is.na(penguins$sex))
  1. FALSE
  2. FALSE
  3. FALSE
  4. TRUE
  5. FALSE
  6. FALSE

Since is.na() gives us a TRUE/FALSE vector, we can use it with filter. The following gives us all rows where the sex column has missing data (i.e. is set to NA):

dplyr::filter(penguins, is.na(sex))
A tibble: 11 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen NA NA NA NANA2007
AdelieTorgersen34.118.11933475NA2007
AdelieTorgersen42.020.21904250NA2007
AdelieTorgersen37.817.11863300NA2007
AdelieTorgersen37.817.31803700NA2007
AdelieDream 37.518.91792975NA2007
GentooBiscoe 44.514.32164100NA2007
GentooBiscoe 46.214.42144650NA2008
GentooBiscoe 47.313.82164725NA2009
GentooBiscoe 44.515.72174875NA2009
GentooBiscoe NA NA NA NANA2009

To do the reverse, i.e. keep only the observations where the sex value is not missing, combine is.na with the NOT operator:

dplyr::filter(penguins, !is.na(sex))
A tibble: 333 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen36.719.31933450female2007
AdelieTorgersen39.320.61903650male 2007
AdelieTorgersen38.917.81813625female2007
AdelieTorgersen39.219.61954675male 2007
AdelieTorgersen41.117.61823200female2007
AdelieTorgersen38.621.21913800male 2007
AdelieTorgersen34.621.11984400male 2007
AdelieTorgersen36.617.81853700female2007
AdelieTorgersen38.719.01953450female2007
AdelieTorgersen42.520.71974500male 2007
AdelieTorgersen34.418.41843325female2007
AdelieTorgersen46.021.51944200male 2007
AdelieBiscoe 37.818.31743400female2007
AdelieBiscoe 37.718.71803600male 2007
AdelieBiscoe 35.919.21893800female2007
AdelieBiscoe 38.218.11853950male 2007
AdelieBiscoe 38.817.21803800male 2007
AdelieBiscoe 35.318.91873800female2007
AdelieBiscoe 40.618.61833550male 2007
AdelieBiscoe 40.517.91873200female2007
AdelieBiscoe 37.918.61723150female2007
AdelieBiscoe 40.518.91803950male 2007
AdelieDream 39.516.71783250female2007
AdelieDream 37.218.11783900male 2007
AdelieDream 39.517.81883300female2007
AdelieDream 40.918.91843900male 2007
AdelieDream 36.417.01953325female2007
ChinstrapDream46.916.61922700female2008
ChinstrapDream53.519.92054500male 2008
ChinstrapDream49.019.52103950male 2008
ChinstrapDream46.217.51873650female2008
ChinstrapDream50.919.11963550male 2008
ChinstrapDream45.517.01963500female2008
ChinstrapDream50.917.91963675female2009
ChinstrapDream50.818.52014450male 2009
ChinstrapDream50.117.91903400female2009
ChinstrapDream49.019.62124300male 2009
ChinstrapDream51.518.71873250male 2009
ChinstrapDream49.817.31983675female2009
ChinstrapDream48.116.41993325female2009
ChinstrapDream51.419.02013950male 2009
ChinstrapDream45.717.31933600female2009
ChinstrapDream50.719.72034050male 2009
ChinstrapDream42.517.31873350female2009
ChinstrapDream52.218.81973450male 2009
ChinstrapDream45.216.61913250female2009
ChinstrapDream49.319.92034050male 2009
ChinstrapDream50.218.82023800male 2009
ChinstrapDream45.619.41943525female2009
ChinstrapDream51.919.52063950male 2009
ChinstrapDream46.816.51893650female2009
ChinstrapDream45.717.01953650female2009
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009

Exercise: subsetting data#

Write code to extract data for Gentoo penguins with weight between 4.8kg and 5.2kg (inclusive). Only keep the species, sex and body mass columns in the output. Give the numbers of resulting observations for male penguins, female penguins and penguins with unknown sex.

Creating new variables and renaming variables#

We often want to make a new column with some updated or transformed value. We can use the mutate function in dplyr for this:

# value1, value2,... are expressions, possibly involving column names

dplyr::mutate(data, new_column1 = value1, new_column2 = value2, ...)

For example, to add a column of IDs for identifying each row in the data:

# Add column of ID numbers
penguins_with_ids <- dplyr::mutate(penguins, id = 1:nrow(penguins))

# Optional: Put the ID column at the beginning of the dataframe
penguins_with_ids <- dplyr::relocate(penguins_with_ids, id, .before = 1)

head(penguins_with_ids)
A tibble: 6 × 9
idspeciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<int><fct><fct><dbl><dbl><int><int><fct><int>
1AdelieTorgersen39.118.71813750male 2007
2AdelieTorgersen39.517.41863800female2007
3AdelieTorgersen40.318.01953250female2007
4AdelieTorgersen NA NA NA NANA 2007
5AdelieTorgersen36.719.31933450female2007
6AdelieTorgersen39.320.61903650male 2007

Another example: if we wanted to calculate a new value, the ratio of bill length to bill depth, we could do the following

dplyr::mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)
A tibble: 344 × 9
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyearbill_ratio
<fct><fct><dbl><dbl><int><int><fct><int><dbl>
AdelieTorgersen39.118.71813750male 20072.090909
AdelieTorgersen39.517.41863800female20072.270115
AdelieTorgersen40.318.01953250female20072.238889
AdelieTorgersen NA NA NA NANA 2007 NA
AdelieTorgersen36.719.31933450female20071.901554
AdelieTorgersen39.320.61903650male 20071.907767
AdelieTorgersen38.917.81813625female20072.185393
AdelieTorgersen39.219.61954675male 20072.000000
AdelieTorgersen34.118.11933475NA 20071.883978
AdelieTorgersen42.020.21904250NA 20072.079208
AdelieTorgersen37.817.11863300NA 20072.210526
AdelieTorgersen37.817.31803700NA 20072.184971
AdelieTorgersen41.117.61823200female20072.335227
AdelieTorgersen38.621.21913800male 20071.820755
AdelieTorgersen34.621.11984400male 20071.639810
AdelieTorgersen36.617.81853700female20072.056180
AdelieTorgersen38.719.01953450female20072.036842
AdelieTorgersen42.520.71974500male 20072.053140
AdelieTorgersen34.418.41843325female20071.869565
AdelieTorgersen46.021.51944200male 20072.139535
AdelieBiscoe 37.818.31743400female20072.065574
AdelieBiscoe 37.718.71803600male 20072.016043
AdelieBiscoe 35.919.21893800female20071.869792
AdelieBiscoe 38.218.11853950male 20072.110497
AdelieBiscoe 38.817.21803800male 20072.255814
AdelieBiscoe 35.318.91873800female20071.867725
AdelieBiscoe 40.618.61833550male 20072.182796
AdelieBiscoe 40.517.91873200female20072.262570
AdelieBiscoe 37.918.61723150female20072.037634
AdelieBiscoe 40.518.91803950male 20072.142857
ChinstrapDream46.916.61922700female20082.825301
ChinstrapDream53.519.92054500male 20082.688442
ChinstrapDream49.019.52103950male 20082.512821
ChinstrapDream46.217.51873650female20082.640000
ChinstrapDream50.919.11963550male 20082.664921
ChinstrapDream45.517.01963500female20082.676471
ChinstrapDream50.917.91963675female20092.843575
ChinstrapDream50.818.52014450male 20092.745946
ChinstrapDream50.117.91903400female20092.798883
ChinstrapDream49.019.62124300male 20092.500000
ChinstrapDream51.518.71873250male 20092.754011
ChinstrapDream49.817.31983675female20092.878613
ChinstrapDream48.116.41993325female20092.932927
ChinstrapDream51.419.02013950male 20092.705263
ChinstrapDream45.717.31933600female20092.641618
ChinstrapDream50.719.72034050male 20092.573604
ChinstrapDream42.517.31873350female20092.456647
ChinstrapDream52.218.81973450male 20092.776596
ChinstrapDream45.216.61913250female20092.722892
ChinstrapDream49.319.92034050male 20092.477387
ChinstrapDream50.218.82023800male 20092.670213
ChinstrapDream45.619.41943525female20092.350515
ChinstrapDream51.919.52063950male 20092.661538
ChinstrapDream46.816.51893650female20092.836364
ChinstrapDream45.717.01953650female20092.688235
ChinstrapDream55.819.82074000male 20092.818182
ChinstrapDream43.518.12023400female20092.403315
ChinstrapDream49.618.21933775male 20092.725275
ChinstrapDream50.819.02104100male 20092.673684
ChinstrapDream50.218.71983775female20092.684492

Note: the output of mutate is not just a new column on its own, but the whole dataframe with the new column appended. To update penguins to have the new column, we need to overwrite it:

penguins <- dplyr::mutate(penguins, bill_ratio = bill_length_mm / bill_depth_mm)
colnames(penguins)
  1. 'species'
  2. 'island'
  3. 'bill_length_mm'
  4. 'bill_depth_mm'
  5. 'flipper_length_mm'
  6. 'body_mass_g'
  7. 'sex'
  8. 'year'
  9. 'bill_ratio'

When we just want to change the name of a column, we can use the rename function in dplyr:

dplyr::rename(data, New_column_name1 = Old_column_name1, New_column_name2 = Old_column_name2, ...)

For example, we could change the name of some variables to remove the units (probably not recommended in general, but it serves as an example):

penguins_renamed <- dplyr::rename(penguins,
                                  bill_length = bill_length_mm,
                                  bill_depth = bill_depth_mm,
                                  flipper_length = flipper_length_mm,
                                  body_mass = body_mass_g
                                  )
colnames(penguins_renamed)
  1. 'species'
  2. 'island'
  3. 'bill_length'
  4. 'bill_depth'
  5. 'flipper_length'
  6. 'body_mass'
  7. 'sex'
  8. 'year'
  9. 'bill_ratio'

Exercise: creating variables#

Update the penguins dataframe so that it contains an ID column with IDs represented as strings.

Sorting observations#

We can sort the rows of data by the value of a particular column using the arrange function in dplyr. By default, the sorting is performed in increasing order, although we can use the desc() function on the column to sort in descending order:

dplyr::arrange(data, col)  # sort rows by ascending order of values from column
                           # `col`

dplyr::arrange(data, dplyr::desc(col))  # sort rows by descending order of
                                        # values from column `col`

For example, to sort the penguins data by descending value of flipper length:

penguins_sorted_flipper <- dplyr::arrange(penguins, dplyr::desc(flipper_length_mm))
penguins_sorted_flipper
A tibble: 344 × 9
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyearbill_ratio
<fct><fct><dbl><dbl><int><int><fct><int><dbl>
GentooBiscoe54.315.72315650male 20083.458599
GentooBiscoe50.016.32305700male 20073.067485
GentooBiscoe59.617.02306050male 20073.505882
GentooBiscoe49.816.82305700male 20082.964286
GentooBiscoe48.616.02305800male 20083.037500
GentooBiscoe52.117.02305550male 20093.064706
GentooBiscoe51.516.32305500male 20093.159509
GentooBiscoe55.116.02305850male 20093.443750
GentooBiscoe49.516.22295800male 20083.055556
GentooBiscoe49.815.92295950male 20093.132075
GentooBiscoe50.817.32285600male 20092.936416
GentooBiscoe52.217.12285400male 20093.052632
GentooBiscoe55.917.02285600male 20093.288235
GentooBiscoe49.115.02285500male 20093.273333
GentooBiscoe50.815.72265200male 20093.235669
GentooBiscoe49.616.02255700male 20083.100000
GentooBiscoe50.515.92255400male 20083.176101
GentooBiscoe50.115.02255000male 20083.340000
GentooBiscoe51.116.52255250male 20093.096970
GentooBiscoe50.415.32245550male 20083.294118
GentooBiscoe50.015.92245350male 20093.144654
GentooBiscoe49.516.12245650male 20093.074534
GentooBiscoe50.715.02235550male 20083.380000
GentooBiscoe45.216.42235950male 20082.756098
GentooBiscoe48.715.12225350male 20073.225166
GentooBiscoe47.315.32225250male 20073.091503
GentooBiscoe50.515.92225550male 20083.176101
GentooBiscoe46.914.62224875female20093.212329
GentooBiscoe48.816.22226000male 20093.012346
GentooBiscoe50.415.72225750male 20093.210191
Adelie Torgersen37.219.41843900male 20081.917526
Adelie Biscoe 39.718.91843550male 20092.100529
Adelie Dream 36.618.41843475female20091.989130
Adelie Biscoe 40.618.61833550male 20072.182796
Adelie Biscoe 37.716.01833075female20092.356250
Adelie Torgersen41.117.61823200female20072.335227
Adelie Dream 36.518.01823150female20072.027778
Adelie Dream 41.119.01823425male 20072.163158
Adelie Torgersen39.118.71813750male 20072.090909
Adelie Torgersen38.917.81813625female20072.185393
Adelie Dream 37.619.31813300female20071.948187
Adelie Biscoe 36.516.61812850female20082.198795
Adelie Biscoe 38.117.01813175female20092.241176
ChinstrapDream 58.017.81813700female20073.258427
ChinstrapDream 42.417.31813600female20072.450867
Adelie Torgersen37.817.31803700NA 20072.184971
Adelie Biscoe 37.718.71803600male 20072.016043
Adelie Biscoe 38.817.21803800male 20072.255814
Adelie Biscoe 40.518.91803950male 20072.142857
Adelie Dream 42.218.51803550female20072.281081
Adelie Dream 37.518.91792975NA 20071.984127
Adelie Dream 39.516.71783250female20072.365269
Adelie Dream 37.218.11783900male 20072.055249
Adelie Dream 33.116.11782900female20082.055901
ChinstrapDream 46.118.21783250female20072.532967
Adelie Torgersen40.217.01763450female20092.364706
Adelie Biscoe 37.818.31743400female20072.065574
Adelie Biscoe 37.918.61723150female20072.037634
Adelie Torgersen NA NA NA NANA 2007 NA
Gentoo Biscoe NA NA NA NANA 2009 NA

Exercise: sorting#

Sort the penguins data by the id column we created in the previous exercise. Explain the resulting order on the id column.

Summarising / aggregating data#

Often we want to aggregate data at certain levels to better understand differences across groups. For instance, does flipper length differ by species? Does body mass change between years?

Summarising all the data#

First, we can use summarise on its own, without any grouping, to get a single summary from all the rows in the dataframe. Here are some examples using the flipper length variable:

# Count number of observations
dplyr::summarise(penguins, num_rows = dplyr::n())

# Count number of different years in the data
dplyr::summarise(penguins, num_years = dplyr::n_distinct(year))

# Get the min and max year
dplyr::summarise(penguins, min_year = min(year), max_year = max(year))

# calculate the sum of flipper lengths for the whole data frame
dplyr::summarise(penguins, sum_flipper_length_mm = sum(flipper_length_mm))
A tibble: 1 × 1
num_rows
<int>
344
A tibble: 1 × 1
num_years
<int>
3
A tibble: 1 × 2
min_yearmax_year
<int><int>
20072009
A tibble: 1 × 1
sum_flipper_length_mm
<int>
NA

If the result is NA, it’s likely because the column contained missing values. We include the na.rm = TRUE optional argument to tell summarise to remove NAs before calculating:

# With missing values removed
dplyr::summarise(penguins, sum_flipper_length_mm = sum(flipper_length_mm, na.rm = TRUE))
A tibble: 1 × 1
sum_flipper_length_mm
<int>
68713

Summarising grouped data#

More generally, we can compute summaries for subgroups of the data by combining the group_by and summarise (or summarize) functions in dplyr. The steps to do this are:

  1. First use group_by to declare the column (or columns) that you want to group the data by.

  2. Then use summarise to actually do the summarising on each of the subgroups from step 1. The resulting dataframe will have one summary per subgroup.

For example, to get the average flipper length of each sex:

# First group the data by sex
penguins_grouped_by_sex <- dplyr::group_by(penguins, sex)

# Then find the mean flipper length for each sex
dplyr::summarise(penguins_grouped_by_sex, mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE))
A tibble: 3 × 2
sexmean_flipper_length_mm
<fct><dbl>
female197.3636
male 204.5060
NA 199.0000

These functions are powerful. We can group by multiple columns at once to create pairwise groups. As we saw above, we can also create several summary variables at the same time:

dplyr::group_by(data, col1, col2, ...)  # group by combinations of values in
                                        # col1, col2, ...

# Compute multiple summaries on the same data
dplyr::summarise(grouped_data, summary1 = <...>, summary2 = <...>, ...)

Exercise: grouping and summarising#

Create a summary dataframe that gives the mean, range (i.e. max - min) and standard deviation of the body mass for each species / sex combination, as well as the number of observations in each group.

An unimportant detail: grouped dataframes#

dplyr::group_by returns a ‘grouped dataframe’ (dplyr::grouped_df). These behave just like tibbles/dataframes except they have some extra information about the grouping attached to them.

When you group by multiple columns, the result of summarise will typically also be a grouped dataframe. Usually, the groups are given by all columns except the last one in the grouped dataframe supplied to summarise. By default, you get a console message telling you on which columns the output is grouped by (see the result of the exercise above).

The upshot of this is that it allows us to do successive summaries without having to keep regrouping the data. For example, the code below will produce a ‘max of averages’:

# Group by sex and species
penguins_grouped <- dplyr::group_by(penguins, sex, species)

# Average mass for each species / sex combination
# Result is grouped by sex (see console output)
penguins_grouped_mass <- dplyr::summarise(penguins_grouped,
                                          mean_body_mass_g = mean(body_mass_g, na.rm = TRUE))
print(penguins_grouped_mass)

# Max of the average mass over species, for each sex
penguins_max_species_avg_mass <- dplyr::summarise(
  penguins_grouped_mass,
  max_species_avg_body_mass_g = max(mean_body_mass_g, na.rm = TRUE)
  )
print(penguins_max_species_avg_mass)
`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.
# A tibble: 8 × 3
# Groups:   sex [3]
  sex    species   mean_body_mass_g
  <fct>  <fct>                <dbl>
1 female Adelie               3369.
2 female Chinstrap            3527.
3 female Gentoo               4680.
4 male   Adelie               4043.
5 male   Chinstrap            3939.
6 male   Gentoo               5485.
7 NA     Adelie               3540 
8 NA     Gentoo               4588.
# A tibble: 3 × 2
  sex    max_species_avg_body_mass_g
  <fct>                        <dbl>
1 female                       4680.
2 male                         5485.
3 NA                           4588.

In practice, you don’t have to worry about this detail. If you want to get back to a plain old tibble/dataframe, use the dplyr::ungroup function on a grouped dataframe.

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.

Summary Quiz#

Which function from the `dplyr` package is used to select specific columns from a dataframe?

Which operator is used in `dplyr` to filter rows based on a condition?

What does the `mutate()` function do in the `dplyr` package?