Loading Data#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Loading_data.Rmd Exercises complete: Download Loading_data_complete.Rmd

Learning Objectives#

  • Understand the process of loading data into R from various file formats, including CSV and other delimited files

  • Use the readr package to read data from files and understand its advantages over base R functions

  • Specify and control data parsing options to ensure accurate data import

  • Handle different data types (e.g. strings, dates, numbers) correctly during data import

  • Parse data explicitly using column specifications to avoid incorrect assumptions about data types

Introduction#

The package readr provides functions for reading data from ‘flat’ files, such as csv or other delimited files. The functionality provided by readr offers some advantages over the base R functions available for performing the same task.

Loading data from a csv file#

You may have used the R function read.csv to load data from a csv file into a dataframe:

%%R
# Read in ./data/timeseries_data.csv
timeseries <- read.csv("./data/timeseries_data.csv")
print(timeseries)
   Id       Date  X_variable Y_variable
1   1 2024-02-01  1.19678372   7.900464
2   2 2024-02-02 -0.75075633  12.690405
3   3 2024-02-03 -0.78504927  10.280089
4   4 2024-02-04 -0.61854575  10.073199
5   5 2024-02-05  0.61882513   9.183893
6   6 2024-02-06 -1.29555973   4.830592
7   7 2024-02-07 -1.04786101  11.950496
8   8 2024-02-08  1.27728217   7.008456
9   9 2024-02-09 -0.06582643   3.403482
10 10 2024-02-10 -0.07455295  11.637056

This is fine in a lot of cases, however, note some assumptions have been made. For example, in the above, the Id column is read as an integer and the Date column is read in as a string. In fact, the original data file records the Id with padded zeros, like so:

Id,Date,X_variable,Y_variable
001,2024-02-01,1.1967837194583961,7.900464364159182
002,2024-02-02,-0.7507563344053951,12.690405000912477
003,2024-02-03,-0.7850492689583279,10.280088829160382
004,2024-02-04,-0.6185457481359723,10.073199260115578
005,2024-02-05,0.6188251316696602,9.183892504016947
006,2024-02-06,-1.295559729374875,4.830591718912964
007,2024-02-07,-1.0478610080617479,11.950495760617637
008,2024-02-08,1.2772821688688454,7.008456322119265
009,2024-02-09,-0.06582643175389392,3.4034822391265145
010,2024-02-10,-0.07455295463910701,11.6370557133845

What we really want is to read the Id as a string and the Date as a date object.

The read_csv function from the readr package is more ‘shouty’ about assumptions made when parsing the data, and can have better default interpretations. For example:

%%R
timeseries <- readr::read_csv("./data/timeseries_data.csv")
Rows: 10 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Id
dbl  (2): X_variable, Y_variable
date (1): Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Notice how it has explicitly told us that

  • Id is parsed as a string (which is what we wanted!)

  • Date is parsed as a date (more about dates later in the course).

  • X_variable and Y_variable are parsed as doubles, as expected.

In general, it’s good practice to be explicit about how we parse data from files, rather than assuming the defaults will work. For example, we’ve no way of knowing whether the readr package maintainers will decide in the future to interpret padded numbers as in the Id column as integers.

We can be explicit about how columns are parsed by using the col_types optional argument. The general way to do this is to supply a list with names being the column names and values being a column specifier, readr::col_<data-type>():

readr::read_csv("path/to/data.csv",
                col_types = list(colA = readr::col_logical(),    # column colA is a boolean
                                 colB = readr::col_integer(),    # column colB is an integer
                                 colC = readr::col_double(),     # column colC is a floating point number
                                 colD = readr::col_character(),  # column colD is a string
                                 colE = readr::col_date(),       # column colE is a date
                                 .default = readr::col_double()  # columns are floating point numbers
                                                                 # unless specified otherwise
                                 ))

Note: For a full list of the available column specifiers, run vignette("readr") in the R console.

So, to specify that we want to read the Id column as a string, the Date column as a date, and the remaining columns as floating point numbers:

%%R
timeseries <- readr::read_csv("./data/timeseries_data.csv",
                              col_types = list(Id = readr::col_character(),
                                               Date = readr::col_date(),
                                               .default = readr::col_double()))
print(timeseries)
# A tibble: 10 × 4
   Id    Date       X_variable Y_variable
   <chr> <date>          <dbl>      <dbl>
 1 001   2024-02-01     1.20         7.90
 2 002   2024-02-02    -0.751       12.7 
 3 003   2024-02-03    -0.785       10.3 
 4 004   2024-02-04    -0.619       10.1 
 5 005   2024-02-05     0.619        9.18
 6 006   2024-02-06    -1.30         4.83
 7 007   2024-02-07    -1.05        12.0 
 8 008   2024-02-08     1.28         7.01
 9 009   2024-02-09    -0.0658       3.40
10 010   2024-02-10    -0.0746      11.6 

Notice how the informational message about parsing has gone away, because we explicitly gave the types.

Exercise: parsing data from files#

Change the above so that the Id column is parsed as an integer and the Date column is parsed as a string.

Other file formats#

There are related packages in the Tidyverse ecosystem that provide support for reading data from file types other than csv:

  • readxl: package for reading data from Excel workbooks.

  • googlesheets4: package for reading data from Google Sheets spreadsheets.

We also mention here the ncdf4 package for reading NetCDF files (not part of the Tidyverse).

Summary Quiz#