Reshaping Data#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Reshaping_data.Rmd Exercises complete: Download Reshaping_data_complete.Rmd

Learning Objectives#

  • Understand the concept of reshaping data and its importance in data analysis

  • Learn how to use the pivot_longer function from the tidyr package to convert matrix format data into tidy format

  • Understand how to use the pivot_wider function from the tidyr package to convert tidy format data into matrix format

  • Gain experience in reshaping dataset using practical examples and exercises

  • Explore the tidyr package and its capabilities for data manipulation

Introduction#

There are often cases where you’d like to change the shape of your data, by which we mean to change the structure of the columns and/or rows. The tidyr package contains functions to support this.

As an example dataset for reshaping, we’ll use data on territorial fossil CO2 emissions by country (Friedlingstein et al. 2023)[eal23].

Note: To read the data, we’ll use a call to the read_excel function from the readxl package. We only touched on this in notebook 2, so don’t worry about understanding exactly how the data is read in. Treat this as a glimpse of seeing the Tidyverse in action ‘in the wild’. You can read up about read_excel as extra homework, if you’re keen :-)

emissions <- readxl::read_excel("data/National_Fossil_Carbon_Emissions_2023v1.0.xlsx",
                                sheet = "Territorial Emissions",
                                skip = 11) |>
  dplyr::rename(Year = `...1`) |>  # missing column name in data got assigned name `...1` when reading
  dplyr::select(Year, World, Africa, Asia, `Central America`, Europe,
                `Middle East`, `North America`, Oceania, `South America`,
                Bunkers) |>
  dplyr::mutate(Year = as.integer(Year))

print(emissions)
New names:
 `` -> `...1`
# A tibble: 173 × 11
    Year World Africa    Asia `Central America` Europe `Middle East`
   <int> <dbl>  <dbl>   <dbl>             <dbl>  <dbl>         <dbl>
 1  1850  53.7      0 0.00360                 0   48.2             0
 2  1851  54.2      0 0                       0   47.5             0
 3  1852  56.6      0 0                       0   49.2             0
 4  1853  59.3      0 0                       0   51.0             0
 5  1854  69.6      0 0                       0   60.5             0
 6  1855  71.0      0 0.0109                  0   60.5             0
 7  1856  75.7      0 0                       0   64.6             0
 8  1857  76.4      0 0                       0   65.0             0
 9  1858  77.6      0 0.123                   0   65.9             0
10  1859  82.3      0 0.190                   0   69.5             0
# ℹ 163 more rows
# ℹ 4 more variables: `North America` <dbl>, Oceania <dbl>,
#   `South America` <dbl>, Bunkers <dbl>

Converting to/from tidy format#

One scenario for reshaping data is when we receive data that is not in tidy format and want to transform it so that it is tidy.

In the emissions data above, notice how each column other than ‘Year’ corresponds to a region of the world (or the category ‘Bunkers’):

colnames(emissions)
  1. 'Year'
  2. 'World'
  3. 'Africa'
  4. 'Asia'
  5. 'Central America'
  6. 'Europe'
  7. 'Middle East'
  8. 'North America'
  9. 'Oceania'
  10. 'South America'
  11. 'Bunkers'

This is not in tidy format. The observations correspond to year-region pairs, while the variable being measured is territorial emissions. But the data is structured as a matrix, where to get the emissions value for a region and year you look up the entry for the appropriate row and column.

What we’d like to do is create a dataframe with columns ‘Year’, ‘Country’ and ‘Territorial_emissions’ columns, with rows corresponding to year-region pairs.

Matrix format to tidy format with pivot_longer#

We can use the function pivot_longer from the tidyr package to help us make the required transformation. Like most of the Tidyverse functions, there are a lot of possible parameters to use in this function, but we only need to use it as follows:

tidyr::pivot_longer(data, <columns-to-gather>, names_to = "Name_for_gathered_columns", values_to = "Variable_name")

where

  • data is the dataframe to transform.

  • <columns-to-gather> is a specification of the columns that should be gathered into a new column.

  • names_to defines the name of the new column to create, whose values will be the names in <columns-to-gather>.

  • values_to is the name of a new column that contains the values that were in the columns <columns-to-gather> (i.e. the variable that the original data was giving).

Note: More specifically, <columns-to-gather> is a so-called ‘tidy-select’. This is a way of specifying collections of columns within the Tidyverse. Unfortunately details of this are outside the scope of this course, but you can learn more by running ?tidyr::tidyr_tidy_select in the R console. We will explain our use of ‘tidy-select’ in code comments below.

The behaviour of pivot_longer is much easier to see with an example. The following creates a new column called ‘Region’ that contains all the column names in emissions from ‘World’ to ‘Bunkers’. The values from these columns are put into a new column called ‘Territorial_emissions’.

emissions_tidy <- emissions |>
  tidyr::pivot_longer(World:Bunkers,  # All columns from 'World' to 'Bunkers'
                      names_to = "Region",
                      values_to = "Territorial_emissions")

print(emissions_tidy)
# A tibble: 1,730 × 3
    Year Region          Territorial_emissions
   <int> <chr>                           <dbl>
 1  1850 World                        53.7    
 2  1850 Africa                        0      
 3  1850 Asia                          0.00360
 4  1850 Central America               0      
 5  1850 Europe                       48.2    
 6  1850 Middle East                   0      
 7  1850 North America                 5.42   
 8  1850 Oceania                       0.0306 
 9  1850 South America                 0      
10  1850 Bunkers                       0      
# ℹ 1,720 more rows

Notice how each row corresponds to a year-region pair. The ‘Territorial_emissions’ values line up with the year and region they corresponded to in the original data.

Notice that the number of rows is equal to the number of regions multiplied by the number of years, as we’d expect:

# Number of rows in tidy dataframe
cat("Number of rows in emissions_tidy:", nrow(emissions_tidy), "\n")

# Number of year-region pairs
# (Use ncol(emissions) - 1 so not to include the year column)
cat("Number of year-region pairs:", length(emissions$Year) * (ncol(emissions) - 1), "\n")
Number of rows in emissions_tidy: 1730 
Number of year-region pairs: 1730 

In general, pivot_longer is called such because it make the table of data ‘longer’ i.e. have more rows.

Exercise: pivoting#

What do you think the output of the following pivot_longer will be?

emissions |>
  tidyr::pivot_longer(!Africa,  # this means all columns except 'Africa'
                      names_to = "Names_column",
                      values_to = "Territorial_emissions?")

Run the code below to see if you’re correct!

What do values in the ‘Africa’ and ‘Territorial_emissions?’ columns correspond to in the following output?

emissions |>
  tidyr::pivot_longer(!Africa,  # this means all columns except 'Africa'
                      names_to = "Names_column",
                      values_to = "Territorial_emissions?") |>
  dplyr::filter(Names_column == "Year")
A tibble: 173 × 3
AfricaNames_columnTerritorial_emissions?
<dbl><chr><dbl>
0Year1850
0Year1851
0Year1852
0Year1853
0Year1854
0Year1855
0Year1856
0Year1857
0Year1858
0Year1859
0Year1860
0Year1861
0Year1862
0Year1863
0Year1864
0Year1865
0Year1866
0Year1867
0Year1868
0Year1869
0Year1870
0Year1871
0Year1872
0Year1873
0Year1874
0Year1875
0Year1876
0Year1877
0Year1878
0Year1879
192.8631Year1993
215.6851Year1994
230.4983Year1995
238.0758Year1996
243.9897Year1997
249.0260Year1998
245.5422Year1999
253.8296Year2000
250.9639Year2001
245.4321Year2002
269.8020Year2003
285.1680Year2004
290.1662Year2005
298.0772Year2006
306.4166Year2007
320.8539Year2008
320.1835Year2009
331.7038Year2010
342.5417Year2011
342.7512Year2012
345.3763Year2013
367.0591Year2014
360.2010Year2015
369.9451Year2016
373.8261Year2017
370.3777Year2018
397.8039Year2019
370.2013Year2020
391.7021Year2021
386.6290Year2022

Tidy format to matrix format with pivot_wider#

What if we want to go the other way, i.e. turn a tidy dataframe into a matrix format? In that case, we can use pivot_wider from the tidyr package.

tidyr::pivot_wider(data, names_from = <column-to-make-new-columns-from>, values_from = <column-of-variable>)

where

  • data is the dataframe to transform.

  • names_from is the column whose values will be the names of new columns in the new dataframe.

  • values_from is the name of a column whose values will go into the ‘cells’ of the new dataframe.

To see this in action, the following use of pivot_wider will take our tidied emissions data and put it back into the original ‘matrix’ format, where regions corresponded to columns and the entries gave the emissions for each year-region combination.

emissions_tidy |>
  tidyr::pivot_wider(names_from = Region, values_from = Territorial_emissions)
A tibble: 173 × 11
YearWorldAfricaAsiaCentral AmericaEuropeMiddle EastNorth AmericaOceaniaSouth AmericaBunkers
<int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1850 53.6986800.0035991230 48.246500.000 5.4180.0305820000
1851 54.2216400.0000000000 47.450610.000 6.7420.0290334900
1852 56.6086800.0000000000 49.244740.000 7.3350.0289450300
1853 59.2577600.0000000000 50.956190.000 8.2600.0415723000
1854 69.6051200.0000000000 60.471030.000 9.0840.0500891100
1855 70.9891900.0109134700 60.463410.00010.4560.0588640000
1856 75.6813400.0000000000 64.623770.00010.9760.0815737600
1857 76.4006600.0000000000 65.046290.00011.2640.0903658300
1858 77.5723700.1229912190 65.917450.00011.4390.0929264900
1859 82.2820000.1899671860 69.504680.00012.4550.1323546700
1860 90.3282200.1994489990 76.914070.00013.0510.1636924600
1861 94.9197100.1661063470 82.008610.00012.5920.1529954600
1862 96.7491600.1776078030 83.253580.00013.1030.2149659200
1863103.2151000.1951867230 87.672290.00015.1510.1966294300
1864111.2789200.1896311290 94.603800.00016.2390.2464956700
1865118.1384100.1854942110101.314220.04116.3320.2656953000
1866121.8759800.2718754410104.692740.05716.5050.3493612300
1867130.5023900.1116134450109.668390.06220.3220.3383876800
1868134.1571900.0958603170110.861140.04822.7340.4181930600
1869142.4547500.1309692880116.092030.07325.7590.3997500700
1870145.5617500.1417038370117.744300.04327.2500.3827486700
1871154.7192600.1915566470125.477440.05428.6030.3932614500
1872171.2066700.1984712600135.518530.06734.9760.4466730200
1873181.8999000.1888313870142.605170.06538.5140.5268966700
1874170.3691400.3758549710132.247150.07037.0960.5801390500
1875184.8018100.6414466140145.918550.09637.5610.5848077700
1876187.4400200.6914566240149.327540.07536.7570.5890178600
1877191.1584100.6667938880149.002210.12240.7200.6474064800
1878192.6783701.4006395860150.589600.08939.8030.7961385700
1879206.3292801.5138506200155.453860.03448.4860.8415685300
1993 6222.257192.86311767.90924.733871858.070302.15811658.322 87.53554178.3418152.3197
1994 6286.784215.68511848.71425.736511759.722318.15311690.972 88.93646179.1228159.7370
1995 6420.440230.49831946.74927.592351753.479307.63021705.467 92.10548188.1586168.7586
1996 6618.494238.07582034.09029.141121766.688312.82191764.086 94.39378196.4329182.7634
1997 6658.283243.98972071.28530.935401710.626308.59871794.232 97.13335209.2761192.2064
1998 6640.542249.02601994.13131.024741703.319319.39781813.733100.51114225.4773203.9215
1999 6777.700245.54222100.53532.464001680.612343.59361838.242103.71644221.7123211.2822
2000 6959.958253.82962167.05135.677781680.019379.83521902.205105.66316221.5420214.1338
2001 7007.350250.96392207.22437.090831714.589384.07651876.444108.36968215.9629212.6275
2002 7163.834245.43212344.67738.339901705.785389.08761889.145109.80886223.6542217.9042
2003 7546.029269.80202594.74440.238251745.740414.35151918.249112.31317223.8123226.7782
2004 7811.188285.16802742.78839.383761754.702446.46311946.463115.88986233.7669246.5613
2005 8076.447290.16622956.47541.762191752.605464.05091957.232117.25944240.4743256.4204
2006 8353.102298.07723161.75644.548621780.033494.07121936.884118.94012248.0915270.6985
2007 8597.089306.41663354.81145.914321768.476492.21881964.614120.73416257.5851286.3159
2008 8745.061320.85393517.82745.751551748.086530.63751901.140122.16701269.5415289.0567
2009 8595.324320.18353637.54445.262971608.319552.18081770.180122.30569262.3309277.0165
2010 9090.155331.70383920.99747.219881671.395582.69991826.725122.11061292.6122294.6914
2011 9398.600342.54174241.42547.197381650.984595.07191800.100121.60677297.1936302.4791
2012 9534.785342.75124388.56246.938021637.398634.92621750.022122.69098312.1908299.3058
2013 9615.848345.37634465.31347.072441595.210622.24161786.520120.63339329.8006303.6800
2014 9679.638367.05914523.79247.923921532.263652.44941795.780119.30493335.2772305.7898
2015 9678.825360.20104533.88649.853671532.340684.39881754.166121.65083325.7104316.6178
2016 9677.955369.94514553.11948.200101530.234701.14161716.977123.78293314.2274320.3272
2017 9832.274373.82614676.43547.273111538.889722.15771704.434125.19382309.4195334.6460
201810034.647370.37774841.29647.334591531.780728.77991753.753125.77717291.2735344.2744
201910109.199397.80394964.59148.977301483.723729.58331722.073126.17338291.2435345.0302
2020 9554.514370.20134914.10442.237391372.597688.31431550.317119.96938262.3909234.3823
202110048.183391.70215132.60544.359281449.320720.73271648.145117.12639296.3718247.8202
202210139.133386.62905213.58944.169101393.370730.78091669.725118.24632296.2315286.3923

Acnowledgements#

We acknowledge the Global Carbon Project, which is responsible for the Global Carbon Budget and we thank the fossil carbon emissions modelling groups for producing and making available their model output.

What is the primary purpose of the pivot_longer function in the tidyr package?

Which arguement in the pivot_wider function specifies the column whose values will become the names of new columns?

Which function is used to convert a tidy dataframe back into a matrix format?