Reshaping Data#
Download Rmd Version#
If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.
Exercises incomplete: Download Reshaping_data.Rmd Exercises complete: Download Reshaping_data_complete.Rmd
Learning Objectives#
Understand the concept of reshaping data and its importance in data analysis
Learn how to use the
pivot_longer
function from thetidyr
package to convert matrix format data into tidy formatUnderstand how to use the
pivot_wider
function from thetidyr
package to convert tidy format data into matrix formatGain experience in reshaping dataset using practical examples and exercises
Explore the
tidyr
package and its capabilities for data manipulation
Introduction#
There are often cases where you’d like to change the shape of your data, by
which we mean to change the structure of the columns and/or rows. The tidyr
package contains functions to support this.
As an example dataset for reshaping, we’ll use data on territorial fossil CO2 emissions by country (Friedlingstein et al. 2023)[eal23].
Note: To read the data, we’ll use a call to the read_excel
function from
the readxl
package. We only touched on this in notebook 2, so don’t worry
about understanding exactly how the data is read in. Treat this as a glimpse of
seeing the Tidyverse in action ‘in the wild’. You can read up about read_excel
as extra homework, if you’re keen :-)
%%R
emissions <- readxl::read_excel("data/National_Fossil_Carbon_Emissions_2023v1.0.xlsx",
sheet = "Territorial Emissions",
skip = 11) |>
dplyr::rename(Year = `...1`) |> # missing column name in data got assigned name `...1` when reading
dplyr::select(Year, World, Africa, Asia, `Central America`, Europe,
`Middle East`, `North America`, Oceania, `South America`,
Bunkers) |>
dplyr::mutate(Year = as.integer(Year))
print(emissions)
New names:
• `` -> `...1`
# A tibble: 173 × 11
Year World Africa Asia `Central America` Europe `Middle East`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1850 53.7 0 0.00360 0 48.2 0
2 1851 54.2 0 0 0 47.5 0
3 1852 56.6 0 0 0 49.2 0
4 1853 59.3 0 0 0 51.0 0
5 1854 69.6 0 0 0 60.5 0
6 1855 71.0 0 0.0109 0 60.5 0
7 1856 75.7 0 0 0 64.6 0
8 1857 76.4 0 0 0 65.0 0
9 1858 77.6 0 0.123 0 65.9 0
10 1859 82.3 0 0.190 0 69.5 0
# ℹ 163 more rows
# ℹ 4 more variables: `North America` <dbl>, Oceania <dbl>,
# `South America` <dbl>, Bunkers <dbl>
# ℹ Use `print(n = ...)` to see more rows
Converting to/from tidy format#
One scenario for reshaping data is when we receive data that is not in tidy format and want to transform it so that it is tidy.
In the emissions data above, notice how each column other than ‘Year’ corresponds to a region of the world (or the category ‘Bunkers’):
%%R
colnames(emissions)
[1] "Year" "World" "Africa" "Asia"
[5] "Central America" "Europe" "Middle East" "North America"
[9] "Oceania" "South America" "Bunkers"
This is not in tidy format. The observations correspond to year-region pairs, while the variable being measured is territorial emissions. But the data is structured as a matrix, where to get the emissions value for a region and year you look up the entry for the appropriate row and column.
What we’d like to do is create a dataframe with columns ‘Year’, ‘Country’ and ‘Territorial_emissions’ columns, with rows corresponding to year-region pairs.
Matrix format to tidy format with pivot_longer
#
We can use the function pivot_longer
from the tidyr
package to help us make
the required transformation. Like most of the Tidyverse functions, there are a
lot of possible parameters to use in this function, but we only need to use it
as follows:
tidyr::pivot_longer(data, <columns-to-gather>, names_to = "Name_for_gathered_columns", values_to = "Variable_name")
where
data
is the dataframe to transform.<columns-to-gather>
is a specification of the columns that should be gathered into a new column.names_to
defines the name of the new column to create, whose values will be the names in<columns-to-gather>
.values_to
is the name of a new column that contains the values that were in the columns<columns-to-gather>
(i.e. the variable that the original data was giving).
Note: More specifically, <columns-to-gather>
is a so-called ‘tidy-select’.
This is a way of specifying collections of columns within the Tidyverse.
Unfortunately details of this are outside the scope of this course, but you can
learn more by running ?tidyr::tidyr_tidy_select
in the R console. We will
explain our use of ‘tidy-select’ in code comments below.
The behaviour of pivot_longer
is much easier to see with an example. The
following creates a new column called ‘Region’ that contains all the column
names in emissions
from ‘World’ to ‘Bunkers’. The values from these columns
are put into a new column called ‘Territorial_emissions’.
%%R
emissions_tidy <- emissions |>
tidyr::pivot_longer(World:Bunkers, # All columns from 'World' to 'Bunkers'
names_to = "Region",
values_to = "Territorial_emissions")
print(emissions_tidy)
# A tibble: 1,730 × 3
Year Region Territorial_emissions
<int> <chr> <dbl>
1 1850 World 53.7
2 1850 Africa 0
3 1850 Asia 0.00360
4 1850 Central America 0
5 1850 Europe 48.2
6 1850 Middle East 0
7 1850 North America 5.42
8 1850 Oceania 0.0306
9 1850 South America 0
10 1850 Bunkers 0
# ℹ 1,720 more rows
# ℹ Use `print(n = ...)` to see more rows
Notice how each row corresponds to a year-region pair. The ‘Territorial_emissions’ values line up with the year and region they corresponded to in the original data.
Notice that the number of rows is equal to the number of regions multiplied by the number of years, as we’d expect:
%%R
# Number of rows in tidy dataframe
cat("Number of rows in emissions_tidy:", nrow(emissions_tidy), "\n")
# Number of year-region pairs
# (Use ncol(emissions) - 1 so not to include the year column)
cat("Number of year-region pairs:", length(emissions$Year) * (ncol(emissions) - 1), "\n")
Number of rows in emissions_tidy: 1730
Number of year-region pairs: 1730
In general, pivot_longer
is called such because it make the table of data
‘longer’ i.e. have more rows.
Exercise: pivoting#
What do you think the output of the following pivot_longer
will be?
emissions |>
tidyr::pivot_longer(!Africa, # this means all columns except 'Africa'
names_to = "Names_column",
values_to = "Territorial_emissions?")
Run the code below to see if you’re correct!
Solution
%%R
emissions |>
tidyr::pivot_longer(!Africa, # this means all columns except 'Africa'
names_to = "Names_column",
values_to = "Territorial_emissions?")
What do values in the ‘Africa’ and ‘Territorial_emissions?’ columns correspond to in the following output?
%%R
emissions |>
tidyr::pivot_longer(!Africa, # this means all columns except 'Africa'
names_to = "Names_column",
values_to = "Territorial_emissions?") |>
dplyr::filter(Names_column == "Year")
# A tibble: 173 × 3
Africa Names_column `Territorial_emissions?`
<dbl> <chr> <dbl>
1 0 Year 1850
2 0 Year 1851
3 0 Year 1852
4 0 Year 1853
5 0 Year 1854
6 0 Year 1855
7 0 Year 1856
8 0 Year 1857
9 0 Year 1858
10 0 Year 1859
# ℹ 163 more rows
# ℹ Use `print(n = ...)` to see more rows
Tidy format to matrix format with pivot_wider
#
What if we want to go the other way, i.e. turn a tidy dataframe into a matrix
format? In that case, we can use pivot_wider
from the tidyr
package.
tidyr::pivot_wider(data, names_from = <column-to-make-new-columns-from>, values_from = <column-of-variable>)
where
data
is the dataframe to transform.names_from
is the column whose values will be the names of new columns in the new dataframe.values_from
is the name of a column whose values will go into the ‘cells’ of the new dataframe.
To see this in action, the following use of pivot_wider
will take our tidied
emissions data and put it back into the original ‘matrix’ format, where regions
corresponded to columns and the entries gave the emissions for each year-region
combination.
%%R
emissions_tidy |>
tidyr::pivot_wider(names_from = Region, values_from = Territorial_emissions)
# A tibble: 173 × 11
Year World Africa Asia `Central America` Europe `Middle East`
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1850 53.7 0 0.00360 0 48.2 0
2 1851 54.2 0 0 0 47.5 0
3 1852 56.6 0 0 0 49.2 0
4 1853 59.3 0 0 0 51.0 0
5 1854 69.6 0 0 0 60.5 0
6 1855 71.0 0 0.0109 0 60.5 0
7 1856 75.7 0 0 0 64.6 0
8 1857 76.4 0 0 0 65.0 0
9 1858 77.6 0 0.123 0 65.9 0
10 1859 82.3 0 0.190 0 69.5 0
# ℹ 163 more rows
# ℹ 4 more variables: `North America` <dbl>, Oceania <dbl>,
# `South America` <dbl>, Bunkers <dbl>
# ℹ Use `print(n = ...)` to see more rows
Acnowledgements#
We acknowledge the Global Carbon Project, which is responsible for the Global Carbon Budget and we thank the fossil carbon emissions modelling groups for producing and making available their model output.
!pwd
/Users/lb788/Documents/CfRR/CfRR_Courses/individual_modules/working_with_data_in_R