Loading Data#
Download Rmd Version#
If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.
Exercises incomplete: Download Loading_data.Rmd Exercises complete: Download Loading_data_complete.Rmd
Learning Objectives#
Understand the process of loading data into R from various file formats, including CSV and other delimited files
Use the
readr
package to read data from files and understand its advantages over base R functionsSpecify and control data parsing options to ensure accurate data import
Handle different data types (e.g. strings, dates, numbers) correctly during data import
Parse data explicitly using column specifications to avoid incorrect assumptions about data types
Introduction#
The package readr
provides functions for reading data from ‘flat’ files,
such as csv or other delimited files. The functionality provided by readr
offers some advantages over the base R functions available for performing the
same task.
Loading data from a csv file#
You may have used the R function read.csv
to load data from a csv file into
a dataframe:
%%R
# Read in ./data/timeseries_data.csv
timeseries <- read.csv("./data/timeseries_data.csv")
print(timeseries)
Id Date X_variable Y_variable
1 1 2024-02-01 1.19678372 7.900464
2 2 2024-02-02 -0.75075633 12.690405
3 3 2024-02-03 -0.78504927 10.280089
4 4 2024-02-04 -0.61854575 10.073199
5 5 2024-02-05 0.61882513 9.183893
6 6 2024-02-06 -1.29555973 4.830592
7 7 2024-02-07 -1.04786101 11.950496
8 8 2024-02-08 1.27728217 7.008456
9 9 2024-02-09 -0.06582643 3.403482
10 10 2024-02-10 -0.07455295 11.637056
This is fine in a lot of cases, however, note some assumptions have been made.
For example, in the above, the Id
column is read as an integer and the Date
column is read in as a string. In fact, the original data file records the Id
with padded zeros, like so:
Id,Date,X_variable,Y_variable
001,2024-02-01,1.1967837194583961,7.900464364159182
002,2024-02-02,-0.7507563344053951,12.690405000912477
003,2024-02-03,-0.7850492689583279,10.280088829160382
004,2024-02-04,-0.6185457481359723,10.073199260115578
005,2024-02-05,0.6188251316696602,9.183892504016947
006,2024-02-06,-1.295559729374875,4.830591718912964
007,2024-02-07,-1.0478610080617479,11.950495760617637
008,2024-02-08,1.2772821688688454,7.008456322119265
009,2024-02-09,-0.06582643175389392,3.4034822391265145
010,2024-02-10,-0.07455295463910701,11.6370557133845
What we really want is to read the Id as a string and the Date as a date object.
The read_csv
function from the readr
package is more ‘shouty’ about
assumptions made when parsing the data, and can have better default
interpretations. For example:
%%R
timeseries <- readr::read_csv("./data/timeseries_data.csv")
Rows: 10 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Id
dbl (2): X_variable, Y_variable
date (1): Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Notice how it has explicitly told us that
Id
is parsed as a string (which is what we wanted!)Date
is parsed as a date (more about dates later in the course).X_variable
andY_variable
are parsed as doubles, as expected.
In general, it’s good practice to be explicit about how we parse data from
files, rather than assuming the defaults will work. For example, we’ve no way of
knowing whether the readr
package maintainers will decide in the future to
interpret padded numbers as in the Id
column as integers.
We can be explicit about how columns are parsed by using the col_types
optional argument. The general way to do this is to supply a list with names
being the column names and values being a column specifier,
readr::col_<data-type>()
:
readr::read_csv("path/to/data.csv",
col_types = list(colA = readr::col_logical(), # column colA is a boolean
colB = readr::col_integer(), # column colB is an integer
colC = readr::col_double(), # column colC is a floating point number
colD = readr::col_character(), # column colD is a string
colE = readr::col_date(), # column colE is a date
.default = readr::col_double() # columns are floating point numbers
# unless specified otherwise
))
Note: For a full list of the available column specifiers, run
vignette("readr")
in the R console.
So, to specify that we want to read the Id
column as a string, the Date
column as a date, and the remaining columns as floating point numbers:
%%R
timeseries <- readr::read_csv("./data/timeseries_data.csv",
col_types = list(Id = readr::col_character(),
Date = readr::col_date(),
.default = readr::col_double()))
print(timeseries)
# A tibble: 10 × 4
Id Date X_variable Y_variable
<chr> <date> <dbl> <dbl>
1 001 2024-02-01 1.20 7.90
2 002 2024-02-02 -0.751 12.7
3 003 2024-02-03 -0.785 10.3
4 004 2024-02-04 -0.619 10.1
5 005 2024-02-05 0.619 9.18
6 006 2024-02-06 -1.30 4.83
7 007 2024-02-07 -1.05 12.0
8 008 2024-02-08 1.28 7.01
9 009 2024-02-09 -0.0658 3.40
10 010 2024-02-10 -0.0746 11.6
Notice how the informational message about parsing has gone away, because we explicitly gave the types.
Exercise: parsing data from files#
Change the above so that the Id
column is parsed as an integer and the
Date
column is parsed as a string.
Solution
%%R
timeseries <- readr::read_csv("./data/timeseries_data.csv",col_types = list(Id = readr::col_integer(),Date = readr::col_character(),.default = readr::col_double()))
print(timeseries)
Other file formats#
There are related packages in the Tidyverse ecosystem that provide support for reading data from file types other than csv:
readxl
: package for reading data from Excel workbooks.googlesheets4
: package for reading data from Google Sheets spreadsheets.
We also mention here the ncdf4
package for reading NetCDF files (not part of
the Tidyverse).