Reshaping Data#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Reshaping_data.Rmd Exercises complete: Download Reshaping_data_complete.Rmd

Learning Objectives#

  • Understand the concept of reshaping data and its importance in data analysis

  • Learn how to use the pivot_longer function from the tidyr package to convert matrix format data into tidy format

  • Understand how to use the pivot_wider function from the tidyr package to convert tidy format data into matrix format

  • Gain experience in reshaping dataset using practical examples and exercises

  • Explore the tidyr package and its capabilities for data manipulation

Introduction#

There are often cases where you’d like to change the shape of your data, by which we mean to change the structure of the columns and/or rows. The tidyr package contains functions to support this.

As an example dataset for reshaping, we’ll use data on territorial fossil CO2 emissions by country (Friedlingstein et al. 2023)[eal23].

Note: To read the data, we’ll use a call to the read_excel function from the readxl package. We only touched on this in notebook 2, so don’t worry about understanding exactly how the data is read in. Treat this as a glimpse of seeing the Tidyverse in action ‘in the wild’. You can read up about read_excel as extra homework, if you’re keen :-)

%%R
emissions <- readxl::read_excel("data/National_Fossil_Carbon_Emissions_2023v1.0.xlsx",
                                sheet = "Territorial Emissions",
                                skip = 11) |>
  dplyr::rename(Year = `...1`) |>  # missing column name in data got assigned name `...1` when reading
  dplyr::select(Year, World, Africa, Asia, `Central America`, Europe,
                `Middle East`, `North America`, Oceania, `South America`,
                Bunkers) |>
  dplyr::mutate(Year = as.integer(Year))

print(emissions)
New names:
• `` -> `...1`
# A tibble: 173 × 11
    Year World Africa    Asia `Central America` Europe `Middle East`
   <int> <dbl>  <dbl>   <dbl>             <dbl>  <dbl>         <dbl>
 1  1850  53.7      0 0.00360                 0   48.2             0
 2  1851  54.2      0 0                       0   47.5             0
 3  1852  56.6      0 0                       0   49.2             0
 4  1853  59.3      0 0                       0   51.0             0
 5  1854  69.6      0 0                       0   60.5             0
 6  1855  71.0      0 0.0109                  0   60.5             0
 7  1856  75.7      0 0                       0   64.6             0
 8  1857  76.4      0 0                       0   65.0             0
 9  1858  77.6      0 0.123                   0   65.9             0
10  1859  82.3      0 0.190                   0   69.5             0
# ℹ 163 more rows
# ℹ 4 more variables: `North America` <dbl>, Oceania <dbl>,
#   `South America` <dbl>, Bunkers <dbl>
# ℹ Use `print(n = ...)` to see more rows

Converting to/from tidy format#

One scenario for reshaping data is when we receive data that is not in tidy format and want to transform it so that it is tidy.

In the emissions data above, notice how each column other than ‘Year’ corresponds to a region of the world (or the category ‘Bunkers’):

%%R
colnames(emissions)
 [1] "Year"            "World"           "Africa"          "Asia"           
 [5] "Central America" "Europe"          "Middle East"     "North America"  
 [9] "Oceania"         "South America"   "Bunkers"        

This is not in tidy format. The observations correspond to year-region pairs, while the variable being measured is territorial emissions. But the data is structured as a matrix, where to get the emissions value for a region and year you look up the entry for the appropriate row and column.

What we’d like to do is create a dataframe with columns ‘Year’, ‘Country’ and ‘Territorial_emissions’ columns, with rows corresponding to year-region pairs.

Matrix format to tidy format with pivot_longer#

We can use the function pivot_longer from the tidyr package to help us make the required transformation. Like most of the Tidyverse functions, there are a lot of possible parameters to use in this function, but we only need to use it as follows:

tidyr::pivot_longer(data, <columns-to-gather>, names_to = "Name_for_gathered_columns", values_to = "Variable_name")

where

  • data is the dataframe to transform.

  • <columns-to-gather> is a specification of the columns that should be gathered into a new column.

  • names_to defines the name of the new column to create, whose values will be the names in <columns-to-gather>.

  • values_to is the name of a new column that contains the values that were in the columns <columns-to-gather> (i.e. the variable that the original data was giving).

Note: More specifically, <columns-to-gather> is a so-called ‘tidy-select’. This is a way of specifying collections of columns within the Tidyverse. Unfortunately details of this are outside the scope of this course, but you can learn more by running ?tidyr::tidyr_tidy_select in the R console. We will explain our use of ‘tidy-select’ in code comments below.

The behaviour of pivot_longer is much easier to see with an example. The following creates a new column called ‘Region’ that contains all the column names in emissions from ‘World’ to ‘Bunkers’. The values from these columns are put into a new column called ‘Territorial_emissions’.

%%R
emissions_tidy <- emissions |>
  tidyr::pivot_longer(World:Bunkers,  # All columns from 'World' to 'Bunkers'
                      names_to = "Region",
                      values_to = "Territorial_emissions")

print(emissions_tidy)
# A tibble: 1,730 × 3
    Year Region          Territorial_emissions
   <int> <chr>                           <dbl>
 1  1850 World                        53.7    
 2  1850 Africa                        0      
 3  1850 Asia                          0.00360
 4  1850 Central America               0      
 5  1850 Europe                       48.2    
 6  1850 Middle East                   0      
 7  1850 North America                 5.42   
 8  1850 Oceania                       0.0306 
 9  1850 South America                 0      
10  1850 Bunkers                       0      
# ℹ 1,720 more rows
# ℹ Use `print(n = ...)` to see more rows

Notice how each row corresponds to a year-region pair. The ‘Territorial_emissions’ values line up with the year and region they corresponded to in the original data.

Notice that the number of rows is equal to the number of regions multiplied by the number of years, as we’d expect:

%%R
# Number of rows in tidy dataframe
cat("Number of rows in emissions_tidy:", nrow(emissions_tidy), "\n")

# Number of year-region pairs
# (Use ncol(emissions) - 1 so not to include the year column)
cat("Number of year-region pairs:", length(emissions$Year) * (ncol(emissions) - 1), "\n")
Number of rows in emissions_tidy: 1730 
Number of year-region pairs: 1730 

In general, pivot_longer is called such because it make the table of data ‘longer’ i.e. have more rows.

Exercise: pivoting#

What do you think the output of the following pivot_longer will be?

emissions |>
  tidyr::pivot_longer(!Africa,  # this means all columns except 'Africa'
                      names_to = "Names_column",
                      values_to = "Territorial_emissions?")

Run the code below to see if you’re correct!

What do values in the ‘Africa’ and ‘Territorial_emissions?’ columns correspond to in the following output?

%%R
emissions |>
  tidyr::pivot_longer(!Africa,  # this means all columns except 'Africa'
                      names_to = "Names_column",
                      values_to = "Territorial_emissions?") |>
  dplyr::filter(Names_column == "Year")
# A tibble: 173 × 3
   Africa Names_column `Territorial_emissions?`
    <dbl> <chr>                           <dbl>
 1      0 Year                             1850
 2      0 Year                             1851
 3      0 Year                             1852
 4      0 Year                             1853
 5      0 Year                             1854
 6      0 Year                             1855
 7      0 Year                             1856
 8      0 Year                             1857
 9      0 Year                             1858
10      0 Year                             1859
# ℹ 163 more rows
# ℹ Use `print(n = ...)` to see more rows

Tidy format to matrix format with pivot_wider#

What if we want to go the other way, i.e. turn a tidy dataframe into a matrix format? In that case, we can use pivot_wider from the tidyr package.

tidyr::pivot_wider(data, names_from = <column-to-make-new-columns-from>, values_from = <column-of-variable>)

where

  • data is the dataframe to transform.

  • names_from is the column whose values will be the names of new columns in the new dataframe.

  • values_from is the name of a column whose values will go into the ‘cells’ of the new dataframe.

To see this in action, the following use of pivot_wider will take our tidied emissions data and put it back into the original ‘matrix’ format, where regions corresponded to columns and the entries gave the emissions for each year-region combination.

%%R
emissions_tidy |>
  tidyr::pivot_wider(names_from = Region, values_from = Territorial_emissions)
# A tibble: 173 × 11
    Year World Africa    Asia `Central America` Europe `Middle East`
   <int> <dbl>  <dbl>   <dbl>             <dbl>  <dbl>         <dbl>
 1  1850  53.7      0 0.00360                 0   48.2             0
 2  1851  54.2      0 0                       0   47.5             0
 3  1852  56.6      0 0                       0   49.2             0
 4  1853  59.3      0 0                       0   51.0             0
 5  1854  69.6      0 0                       0   60.5             0
 6  1855  71.0      0 0.0109                  0   60.5             0
 7  1856  75.7      0 0                       0   64.6             0
 8  1857  76.4      0 0                       0   65.0             0
 9  1858  77.6      0 0.123                   0   65.9             0
10  1859  82.3      0 0.190                   0   69.5             0
# ℹ 163 more rows
# ℹ 4 more variables: `North America` <dbl>, Oceania <dbl>,
#   `South America` <dbl>, Bunkers <dbl>
# ℹ Use `print(n = ...)` to see more rows

Acnowledgements#

We acknowledge the Global Carbon Project, which is responsible for the Global Carbon Budget and we thank the fossil carbon emissions modelling groups for producing and making available their model output.

!pwd
/Users/lb788/Documents/CfRR/CfRR_Courses/individual_modules/working_with_data_in_R