%load_ext rpy2.ipython
/Users/cc1333/Library/Python/3.9/lib/python/site-packages/rpy2/ipython/rmagic.py:77: UserWarning: The Python package `pandas` is strongly recommended when using `rpy2.ipython`. Unfortunately it could not be loaded (error: No module named 'pandas'), but at least we found `numpy`.
warnings.warn('The Python package `pandas` is strongly '
%%R
suppressPackageStartupMessages({
suppressMessages({
library(data.table)
library(dplyr)
})
})
Fast Data Manipulation Using the data.table
Package#
Overview#
The data.table
package is designed specifically for efficient manipulation of large datasets. The introductory vignette accompanying the data.table
package states the following:
“… if you are interested in reducing programming and compute time tremendously, then this package is for you.”
In research we often deal with large datasets, so this package merits further investigation! We closely follow the aforementioned vignette in this section, which you can view by executing vignette("datatable-intro", package = "data.table")
in your R session.
It is likely that when analysing your data you structure your workflow as follows:
You read the data.
You analyse the data e.g. cleaning, grouping, aggregating, filtering etc.
You write the results to a file and maybe send them to a collaborator.
The data.table
package speeds up reading and writing using the following functions:
Reading using
fread
. Similar toread.csv()
andread.delim()
but faster and more convenient.Writing using
fwrite
. As write.csv but much faster (e.g. 2 seconds versus 1 minute) and just as flexible.
The following code uses fread
to read data stored in a csv file from the data.table
GitHub repository. We will work with this dataset throughout this section. It contains On-Time flights data from the Bureau of Transportation Statistics for all the flights that departed from New York City airports in 2014. The data is available only for Jan 2014 - October 14.
%%R
input <- if (file.exists("flights14.csv")) {
"flights14.csv"
} else {
"https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
flights <- fread(input)
flights
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 1 1 14 13 AA JFK LAX 359
2: 2014 1 1 -3 13 AA JFK LAX 363
3: 2014 1 1 2 9 AA JFK LAX 351
4: 2014 1 1 -8 -26 AA LGA PBI 157
5: 2014 1 1 2 1 AA JFK LAX 350
---
253312: 2014 10 31 1 -30 UA LGA IAH 201
253313: 2014 10 31 -5 -14 UA EWR IAH 189
253314: 2014 10 31 -8 16 MQ LGA RDU 83
253315: 2014 10 31 -4 15 MQ LGA DTW 75
253316: 2014 10 31 -5 1 MQ LGA SDF 110
distance hour
<int> <int>
1: 2475 9
2: 2475 11
3: 2475 19
4: 1035 7
5: 2475 13
---
253312: 1416 14
253313: 1400 8
253314: 431 11
253315: 502 11
253316: 659 8
trying URL 'https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv'
Content type 'text/plain; charset=utf-8' length 9587529 bytes (9.1 MB)
==================================================
downloaded 9.1 MB
Note, fread
returns a data.table
object by default, otherwise a data.frame
with the argument data.table=FALSE
.
Analysis#
OK, so points 1. and 3. above can be accelerated pretty easily using fread
and fwrite
, but how can we accelerate our analysis? The purpose of this section is to introduce the basic syntax of the data.table
object for speeding up our analysis (i.e. point 2 above).
In base R, the standard way of storing data is in a data.frame
. If you attended our ‘Working With Data in R’ course, one of the motivating factors for using the Tidyverse functionality was that Tidyverse functions were designed with consistent interfaces, meaning the functions accept a data.frame
object as an input and return a data.frame
object as an output. The data.table
package provides an enhanced version of the data.frame
object.
If not reading the data in using fread
, a data.table
can be created from scratch as follows:
%%R
DT <- data.table(x = rnorm(200),
y = rnorm(200),
z = c("A", "B"))
The general form of the data.table
syntax is as follows:
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
We should interpret this as “Take DT
, subset/reorder rows using i
, then calculate j
, grouped by by
.”
Selecting Row and Columns#
To subsetting our data, we use the i
and j
arguments within DT[i, j, by]
.
Using j
to select columns is analogous to dplyr::select()
:
%%R
# return as vector
flights[, dep_delay]
# return as data.table
flights[, .(dep_delay)]
dep_delay
<int>
1: 14
2: -3
3: 2
4: -8
5: 2
---
253312: 1
253313: -5
253314: -8
253315: -4
253316: -5
Note, the .()
notation is an alias for list()
here.
Columns name in a variable can also be selected using ..
. Here, the ..
with []
is signalling to R
to ‘look up a level’ into the global environment to look for a variable named my_cols
, reminiscent of using cd ..
to change directories in a Unix shell.
%%R
my_cols <- c("dep_delay", "arr_delay")
flights[, ..my_cols]
dep_delay arr_delay
<int> <int>
1: 14 13
2: -3 13
3: 2 9
4: -8 -26
5: 2 1
---
253312: 1 -30
253313: -5 -14
253314: -8 16
253315: -4 15
253316: -5 1
Using i
to select rows subject to some condition is analogous to dplyr::filter()
:
%%R
# select rows with a departure delay of more than four hours
flights[dep_delay > 4]
year month day dep_delay arr_delay carrier origin dest air_time
<int> <int> <int> <int> <int> <char> <char> <char> <int>
1: 2014 1 1 14 13 AA JFK LAX 359
2: 2014 1 1 7 -5 AA JFK SFO 365
3: 2014 1 1 142 133 AA JFK LAX 345
4: 2014 1 1 18 69 AA JFK ORD 155
5: 2014 1 1 25 36 AA JFK IAH 234
---
80664: 2014 10 31 14 11 UA EWR MIA 151
80665: 2014 10 31 41 19 UA EWR SFO 344
80666: 2014 10 31 427 393 UA EWR ORD 100
80667: 2014 10 31 10 -27 UA EWR LAX 326
80668: 2014 10 31 18 -14 UA EWR LAS 291
distance hour
<int> <int>
1: 2475 9
2: 2586 17
3: 2475 19
4: 740 17
5: 1417 16
---
80664: 1085 19
80665: 2565 12
80666: 719 21
80667: 2454 10
80668: 2227 16
We can combine them both in a natural way:
%%R
# select values from departure delay and arrival delay columns, where of more
# than four hours where departure delay is more than four hours
flights[dep_delay > 4, .(dep_delay, arr_delay)]
dep_delay arr_delay
<int> <int>
1: 14 13
2: 7 -5
3: 142 133
4: 18 69
5: 25 36
---
80664: 14 11
80665: 41 19
80666: 427 393
80667: 10 -27
80668: 18 -14
Subset and Operate#
We can apply the above to perform operations on the columns
%%R
# returns a data.frame
flights[dep_delay > 4, .(mean_delay = mean(dep_delay, na.rm = TRUE), sum_delay = sum(dep_delay, na.rm = TRUE))]
mean_delay sum_delay
<num> <int>
1: 47.27013 3813187
In dplyr, the equivalent would code be:
%%R
# returns a data.frame
flights %>%
filter(dep_delay > 4) %>%
summarise(
mean_delay = mean(dep_delay, na.rm = TRUE),
sum_delay = sum(dep_delay, na.rm = TRUE)
)
mean_delay sum_delay
1 47.27013 3813187
Notice, the data.table
code is more compact, and all operation occurs within the frame of a data.table, i.e., within [ ... ]
. This is also referred to as ‘querying’ the data.table
l; analogous to SQL. Though, arguably, the dplyr
code is easier to understand.
Count#
In data.table
we can use the special symbol .N
, see ?.N
, to count the number of records.
%%R
# returns a data.table
flights[dep_delay > 4, .(total_above_four_hours = .N)]
total_above_four_hours
<int>
1: 80668
In dplyr, the equivalent code would be:
%%R
# returns a data.frame
flights %>%
filter(dep_delay > 4) %>%
summarise(total_above_four_hours = n())
total_above_four_hours
1 80668
Updating and Removing Columns#
Adding columns can be achieved using the :=
operator. This adds columns ‘by reference’, meaning the data.table
is never copied. To illustrate this, consider the following smaller examples:
%%R
my_dt <- data.table(x = rnorm(200))
tracemem(my_dt)
[1] "<0x11d9e5800>"
%%R
my_dt[, z := rnorm(200)]
tracemem(my_dt)
[1] "<0x11d9e5800>"
%%R
my_df <- data.frame(x = rnorm(200))
tracemem(my_df)
[1] "<0x11d49cd40>"
%%R
my_df <- my_df %>%
mutate(z = rnorm(200))
tracemem(my_df)
tracemem[0x11d49cd40 -> 0x11d3caf38]: initialize <Anonymous> mutate_cols mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous>
tracemem[0x11d3caf38 -> 0x11d3cb280]: dplyr_new_list initialize <Anonymous> mutate_cols mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous>
tracemem[0x11d3cb280 -> 0x11d3cb2b8]: dplyr_new_list initialize <Anonymous> mutate_cols mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous>
tracemem[0x11d49cd40 -> 0x11d39e090]: new_data_frame vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous>
tracemem[0x11d39e090 -> 0x11d39e1a8]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous>
[1] "<0x14631ea08>"
With the data.table
, using :=
within the frame [ ... ]
is sufficient to add a new column to the data.table
; we didn’t have to assign the result back to my_dt
using <-
. Note, the tracemem()
function shows the addresses are the same for the my_dt
example, so a copy was not made. However, for the my_df
example, this was not the case. Much more detail on this behaviour is available in this vignette.
Removing a column, by reference, can also be achieved using :=
%%R
my_dt[, z := NULL]
my_dt
x
<num>
1: 0.6642543
2: 0.2826647
3: -0.8785894
4: -0.5096216
5: 0.1665126
---
196: -1.9311627
197: 1.2559794
198: -0.3076927
199: 0.2261218
200: 0.5263950
Aggregation#
We have discussed the i
and j
arguments of data.table
, now let’s discuss by
.
Grouping
Let’s group the flight data by month, and get the mean dep_delay
.
%%R
flights[, .(mean_delay = mean(dep_delay)), by = month]
month mean_delay
<int> <num>
1: 1 22.957624
2: 2 17.809878
3: 3 8.927260
4: 4 10.243083
5: 5 13.684233
6: 6 14.084906
7: 7 16.463060
8: 8 10.012459
9: 9 4.742795
10: 10 7.850555
Let’s find the number of flights grouped by month.
%%R
flights[, .N, by = month]
month N
<int> <int>
1: 1 22796
2: 2 20813
3: 3 26423
4: 4 25588
5: 5 25522
6: 6 26488
7: 7 27003
8: 8 27450
9: 9 25190
10: 10 26043
Chaining
In the above example where we found the number of flights grouped by month; we can chain compound expressions, for example to arrange the answer in increasing or decreasing order.
%%R
flights[, .(total_flights = .N), by = month
][
order(total_flights, decreasing = TRUE)
]
month total_flights
<int> <int>
1: 8 27450
2: 7 27003
3: 6 26488
4: 3 26423
5: 10 26043
6: 4 25588
7: 5 25522
8: 9 25190
9: 1 22796
10: 2 20813
Note the vertical chaining of the operations. This helps with readability, analogously to separating pipes %>%
over separate lines in the Tidyverse!
Sorting by Grouping Variables
The keyby
argument is available for sorting by the variables you group by, though the following three are all equivalent ways of getting the mean air time of flights leaving from JFK grouped by destination, then ordering the results by destination (increasing).
%%R
flights[origin == "JFK", .(mean_air_time = mean(air_time)), keyby = .(dest)]
flights[origin == "JFK", .(mean_air_time = mean(air_time)), by = .(dest), keyby = TRUE]
# using head for smaller print
head(flights[origin == "JFK", .(mean_air_time = mean(air_time)), by = .(dest)][order(dest)])
dest mean_air_time
<char> <num>
1: ABQ 251.71583
2: ACK 39.53430
3: ATL 110.49559
4: AUS 210.75912
5: BNA 118.59155
6: BOS 38.07954
Example: A Data Wrangling Task#
How many minutes does each flight ‘make up’ on its journey per minute of air time? Hint: (dep_delay - arr_delay)
Order your answer increasing by destination and decreasing by time made up per minute of air time.
%%R
# Write your code here
NULL
I highly recommend looking through the additional vignettes for further information on what is happening under the hood in data.table
.