Fast Data Manipulation Using the data.table Package

Fast Data Manipulation Using the `data.table` Package#

Overview#

The data.table package is designed specifically for efficient manipulation of large datasets. The introductory vignette accompanying the data.table package states the following:

“… if you are interested in reducing programming and compute time tremendously, then this package is for you.”

In research we often deal with large datasets, so this package merits further investigation! We closely follow the aforementioned vignette in this section, which you can view by executing vignette("datatable-intro", package = "data.table") in your R session.

It is likely that when analysing your data you structure your workflow as follows:

You read the data.
You analyse the data e.g. cleaning, grouping, aggregating, filtering etc.
You write the results to a file and maybe send them to a collaborator.

The data.table package speeds up reading and writing using the following functions:

Reading using fread. Similar to read.csv() and read.delim() but faster and more convenient.
Writing using fwrite. As write.csv but much faster (e.g. 2 seconds versus 1 minute) and just as flexible.

The following code uses fread to read data stored in a csv file from the data.table GitHub repository. We will work with this dataset throughout this section. It contains On-Time flights data from the Bureau of Transportation Statistics for all the flights that departed from New York City airports in 2014. The data is available only for Jan 2014 - October 14.

input <- if (file.exists("flights14.csv")) {
   "flights14.csv"
} else {
  "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
}
flights <- fread(input)
flights

         year month   day dep_delay arr_delay carrier origin   dest air_time
        <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
     1:  2014     1     1        14        13      AA    JFK    LAX      359
     2:  2014     1     1        -3        13      AA    JFK    LAX      363
     3:  2014     1     1         2         9      AA    JFK    LAX      351
     4:  2014     1     1        -8       -26      AA    LGA    PBI      157
     5:  2014     1     1         2         1      AA    JFK    LAX      350
    ---                                                                     
253312:  2014    10    31         1       -30      UA    LGA    IAH      201
253313:  2014    10    31        -5       -14      UA    EWR    IAH      189
253314:  2014    10    31        -8        16      MQ    LGA    RDU       83
253315:  2014    10    31        -4        15      MQ    LGA    DTW       75
253316:  2014    10    31        -5         1      MQ    LGA    SDF      110
        distance  hour
           <int> <int>
     1:     2475     9
     2:     2475    11
     3:     2475    19
     4:     1035     7
     5:     2475    13
    ---               
253312:     1416    14
253313:     1400     8
253314:      431    11
253315:      502    11
253316:      659     8

trying URL 'https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv'
Content type 'text/plain; charset=utf-8' length 9587529 bytes (9.1 MB)
==================================================
downloaded 9.1 MB

Note, fread returns a data.table object by default, otherwise a data.frame with the argument data.table=FALSE.

Analysis#

OK, so points 1. and 3. above can be accelerated pretty easily using fread and fwrite, but how can we accelerate our analysis? The purpose of this section is to introduce the basic syntax of the data.table object for speeding up our analysis (i.e. point 2 above).

In base R, the standard way of storing data is in a data.frame. If you attended our ‘Working With Data in R’ course, one of the motivating factors for using the Tidyverse functionality was that Tidyverse functions were designed with consistent interfaces, meaning the functions accept a data.frame object as an input and return a data.frame object as an output. The data.table package provides an enhanced version of the data.frame object.

If not reading the data in using fread, a data.table can be created from scratch as follows:

DT <- data.table(x = rnorm(200),
                    y = rnorm(200),
                    z = c("A", "B"))

The general form of the data.table syntax is as follows:

DT[ i,  j,  by ] # + extra arguments
    |   |   |
    |   |    -------> grouped by what?
    |    -------> what to do?
     ---> on which rows?

We should interpret this as “Take DT, subset/reorder rows using i, then calculate j, grouped by by.”

Selecting Row and Columns#

To subsetting our data, we use the i and j arguments within DT[i, j, by].

Using j to select columns is analogous to dplyr::select():

# return as vector
flights[, dep_delay]

#  return as data.table
flights[, .(dep_delay)]

        dep_delay
            <int>
     1:        14
     2:        -3
     3:         2
     4:        -8
     5:         2
    ---          
253312:         1
253313:        -5
253314:        -8
253315:        -4
253316:        -5

Note, the .() notation is an alias for list() here.

Columns name in a variable can also be selected using ... Here, the .. with [] is signalling to R to ‘look up a level’ into the global environment to look for a variable named my_cols, reminiscent of using cd .. to change directories in a Unix shell.

my_cols <- c("dep_delay", "arr_delay")
flights[, ..my_cols]

        dep_delay arr_delay
            <int>     <int>
     1:        14        13
     2:        -3        13
     3:         2         9
     4:        -8       -26
     5:         2         1
    ---                    
253312:         1       -30
253313:        -5       -14
253314:        -8        16
253315:        -4        15
253316:        -5         1

Using i to select rows subject to some condition is analogous to dplyr::filter():

# select rows with a departure delay of more than four hours
flights[dep_delay > 4]

        year month   day dep_delay arr_delay carrier origin   dest air_time
       <int> <int> <int>     <int>     <int>  <char> <char> <char>    <int>
 2014     1     1        14        13      AA    JFK    LAX      359
 2014     1     1         7        -5      AA    JFK    SFO      365
 2014     1     1       142       133      AA    JFK    LAX      345
 2014     1     1        18        69      AA    JFK    ORD      155
 2014     1     1        25        36      AA    JFK    IAH      234
   ---                                                                     
 2014    10    31        14        11      UA    EWR    MIA      151
 2014    10    31        41        19      UA    EWR    SFO      344
 2014    10    31       427       393      UA    EWR    ORD      100
 2014    10    31        10       -27      UA    EWR    LAX      326
 2014    10    31        18       -14      UA    EWR    LAS      291
       distance  hour
          <int> <int>
    2475     9
    2586    17
    2475    19
     740    17
    1417    16
   ---               
    1085    19
    2565    12
     719    21
    2454    10
    2227    16

We can combine them both in a natural way:

# select values from departure delay and arrival delay columns, where of more
# than four hours where departure delay is more than four hours
flights[dep_delay > 4, .(dep_delay, arr_delay)]

       dep_delay arr_delay
           <int>     <int>
       14        13
        7        -5
      142       133
       18        69
       25        36
   ---                    
       14        11
       41        19
      427       393
       10       -27
       18       -14

Subset and Operate#

We can apply the above to perform operations on the columns

# returns a data.frame
flights[dep_delay > 4, .(mean_delay = mean(dep_delay, na.rm = TRUE), sum_delay = sum(dep_delay, na.rm = TRUE))]

   mean_delay sum_delay
        <num>     <int>
1:   47.27013   3813187

In dplyr, the equivalent would code be:

# returns a data.frame
flights %>%
  filter(dep_delay > 4) %>%
  summarise(
    mean_delay = mean(dep_delay, na.rm = TRUE),
    sum_delay = sum(dep_delay, na.rm = TRUE)
  )

  mean_delay sum_delay
1   47.27013   3813187

Notice, the data.table code is more compact, and all operation occurs within the frame of a data.table, i.e., within [ ... ]. This is also referred to as ‘querying’ the data.tablel; analogous to SQL. Though, arguably, the dplyr code is easier to understand.

Count#

In data.table we can use the special symbol .N, see ?.N, to count the number of records.

# returns a data.table
flights[dep_delay > 4, .(total_above_four_hours = .N)]

   total_above_four_hours
                    <int>
1:                  80668

In dplyr, the equivalent code would be:

# returns a data.frame
flights %>%
  filter(dep_delay > 4) %>%
  summarise(total_above_four_hours = n())

  total_above_four_hours
1                  80668

Updating and Removing Columns#

Adding columns can be achieved using the := operator. This adds columns ‘by reference’, meaning the data.table is never copied. To illustrate this, consider the following smaller examples:

my_dt <- data.table(x = rnorm(200))
tracemem(my_dt)

[1] "<0x11d9e5800>"

my_dt[, z := rnorm(200)]
tracemem(my_dt)

[1] "<0x11d9e5800>"

my_df <- data.frame(x = rnorm(200))
tracemem(my_df)

[1] "<0x11d49cd40>"

my_df <- my_df %>%
  mutate(z = rnorm(200))
tracemem(my_df)

tracemem[0x11d49cd40 -> 0x11d3caf38]: initialize <Anonymous> mutate_cols mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous> 
tracemem[0x11d3caf38 -> 0x11d3cb280]: dplyr_new_list initialize <Anonymous> mutate_cols mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous> 
tracemem[0x11d3cb280 -> 0x11d3cb2b8]: dplyr_new_list initialize <Anonymous> mutate_cols mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous> 
tracemem[0x11d49cd40 -> 0x11d39e090]: new_data_frame vec_data as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous> 
tracemem[0x11d39e090 -> 0x11d39e1a8]: as.list.data.frame as.list dplyr_col_modify.data.frame dplyr_col_modify mutate.data.frame mutate %>% <Anonymous> <Anonymous> <Anonymous> 
[1] "<0x14631ea08>"

With the data.table, using := within the frame [ ... ] is sufficient to add a new column to the data.table; we didn’t have to assign the result back to my_dt using <-. Note, the tracemem() function shows the addresses are the same for the my_dt example, so a copy was not made. However, for the my_df example, this was not the case. Much more detail on this behaviour is available in this vignette.

Removing a column, by reference, can also be achieved using :=

my_dt[, z := NULL]
my_dt

Aggregation#

We have discussed the i and j arguments of data.table, now let’s discuss by.

Grouping Let’s group the flight data by month, and get the mean dep_delay.

flights[, .(mean_delay = mean(dep_delay)), by = month]

    month mean_delay
    <int>      <num>
    1  22.957624
    2  17.809878
    3   8.927260
    4  10.243083
    5  13.684233
    6  14.084906
    7  16.463060
    8  10.012459
    9   4.742795
   10   7.850555

Let’s find the number of flights grouped by month.

flights[, .N, by = month]

    month     N
    <int> <int>
    1 22796
    2 20813
    3 26423
    4 25588
    5 25522
    6 26488
    7 27003
    8 27450
    9 25190
   10 26043

Chaining

In the above example where we found the number of flights grouped by month; we can chain compound expressions, for example to arrange the answer in increasing or decreasing order.

flights[, .(total_flights = .N), by = month
        ][
          order(total_flights, decreasing = TRUE)
          ]

    month total_flights
    <int>         <int>
    8         27450
    7         27003
    6         26488
    3         26423
   10         26043
    4         25588
    5         25522
    9         25190
    1         22796
    2         20813

Note the vertical chaining of the operations. This helps with readability, analogously to separating pipes %>% over separate lines in the Tidyverse!

Sorting by Grouping Variables

The keyby argument is available for sorting by the variables you group by, though the following three are all equivalent ways of getting the mean air time of flights leaving from JFK grouped by destination, then ordering the results by destination (increasing).

flights[origin == "JFK", .(mean_air_time = mean(air_time)), keyby = .(dest)]
flights[origin == "JFK", .(mean_air_time = mean(air_time)), by = .(dest), keyby = TRUE]

# using head for smaller print
head(flights[origin == "JFK", .(mean_air_time = mean(air_time)), by = .(dest)][order(dest)])

     dest mean_air_time
   <char>         <num>
   ABQ     251.71583
   ACK      39.53430
   ATL     110.49559
   AUS     210.75912
   BNA     118.59155
   BOS      38.07954

Example: A Data Wrangling Task#

How many minutes does each flight ‘make up’ on its journey per minute of air time? Hint: (dep_delay - arr_delay)

Order your answer increasing by destination and decreasing by time made up per minute of air time.

# Write your code here

I highly recommend looking through the additional vignettes for further information on what is happening under the hood in data.table.